iT邦幫忙

0

用 python 爬蟲ajax 動態頁面

爬蟲網站連結:https://m.coa.gov.tw/Transaction/PoultryTrans/Index


第一次接觸爬蟲ajax資料,真的有些難度,
找了很多資料,目前卡在這裡,
程式碼提供:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://m.coa.gov.tw/Transaction/PoultryTrans/Index'

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}

resp = requests.post(url, headers = headers)

# 將 HTML 轉成 BeautifulSoup 物件
soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find_all("table", {"id":"searchtable"})
tr = soup.find_all('tr')

for trr in tr:
    tdlist = trr.find_all('td')

result = []    
activities = resp.json()["rows"]
for activity in activities:
    # 交易日期
    date = activity["TradeDate"]
 
    # 雞蛋產地價格
    price = activity["Column4Data"]

    result.append(
        dict(date = date, price = price))
print(result) #爬到的值  
case_dataframe = pd.DataFrame(result, columns = ['交易日期', '雞蛋產地價格'])
print(case_dataframe) #放入dataframe

print出結果:

[{'date': '110/07/19', 'price': '52'}, {'date': '110/07/12', 'price': '52'}, {'date': '110/07/05', 'price': '52'}, {'date': '110/06/28', 'price': '52'}, {'date': '110/06/07', 'price': '52'}, {'date': '110/06/07', 'price': '52'}, {'date': '110/05/31', 'price': '52'}, {'date': '110/05/24', 'price': '52'}, {'date': '110/05/24', 'price': '52'}, {'date': '110/05/17', 'price': '52'}, {'date': '110/05/17', 'price': '52'}, {'date': '110/05/10', 'price': '52'}, {'date': '110/05/10', 'price': '52'}, {'date': '110/05/03', 'price': '51.1'}, {'date': '110/04/26', 'price': '50'}, {'date': '110/04/19', 'price': '50'}, {'date': '110/04/19', 'price': '50'}, {'date': '110/04/12', 'price': '50'}, {'date': '110/04/05', 'price': '50'}, {'date': '110/03/29', 'price': '50'}]
    交易日期  雞蛋產地價格
0    NaN       NaN
1    NaN       NaN
2    NaN       NaN
3    NaN       NaN
4    NaN       NaN
5    NaN       NaN
6    NaN       NaN
7    NaN       NaN
8    NaN       NaN
9    NaN       NaN
10   NaN       NaN
11   NaN       NaN
12   NaN       NaN
13   NaN       NaN
14   NaN       NaN
15   NaN       NaN
16   NaN       NaN
17   NaN       NaN
18   NaN       NaN
19   NaN       NaN

因為我想爬2018年1月1日至2021年5月31日這區間日期的資料,
但是我還沒設定區間,他會產生出從110/03/29每7天的資料,
我不理解這7天是哪裡來的?
以及為什麼明明爬到資料了,
但放入dataframe後卻顯示NaN?

1 個回答

0
froce
iT邦大師 1 級 ‧ 2021-07-27 16:50:28
最佳解答

因為我想爬2018年1月1日至2021年5月31日這區間日期的資料,
但是我還沒設定區間,他會產生出從110/03/29每7天的資料,
我不理解這7天是哪裡來的?
以及為什麼明明爬到資料了,
放入dataframe後卻顯示NaN

https://m.coa.gov.tw/Transaction/PoultryTrans/Index

  1. 這網址看起來只要post不帶body就是出來每7天的資料,並且是json格式
  2. post帶body的話會傳回區間內每天的資料,Content-Type是application/x-www-form-urlencoded
# post送出的body範例

	"StartDate": "2021/06/27",
	"EndDate": "2021/07/27",
	"DataSource": "1",
	"NoRest": "false",
	"NowPage": "1",
	"SortAction": "DESC",
	"SortField": "TradeDateTime",
	"PageSize": "20"

3.你result裡面應該是list,要不然你的dict欄位應該要對應dataframe的columns,pandas沒聰明到可以幫你英翻中,你叫他去找交易日期和雞蛋產地價格,給他的資料只有date和price,當然吐NA給你。

result = [['110/07/19','52'], ....]

我要發表回答

立即登入回答