爬蟲網站連結:https://m.coa.gov.tw/Transaction/PoultryTrans/Index
第一次接觸爬蟲ajax資料,真的有些難度,
找了很多資料,目前卡在這裡,
程式碼提供:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://m.coa.gov.tw/Transaction/PoultryTrans/Index'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}
resp = requests.post(url, headers = headers)
# 將 HTML 轉成 BeautifulSoup 物件
soup = BeautifulSoup(resp.text, 'html.parser')
table = soup.find_all("table", {"id":"searchtable"})
tr = soup.find_all('tr')
for trr in tr:
tdlist = trr.find_all('td')
result = []
activities = resp.json()["rows"]
for activity in activities:
# 交易日期
date = activity["TradeDate"]
# 雞蛋產地價格
price = activity["Column4Data"]
result.append(
dict(date = date, price = price))
print(result) #爬到的值
case_dataframe = pd.DataFrame(result, columns = ['交易日期', '雞蛋產地價格'])
print(case_dataframe) #放入dataframe
print出結果:
[{'date': '110/07/19', 'price': '52'}, {'date': '110/07/12', 'price': '52'}, {'date': '110/07/05', 'price': '52'}, {'date': '110/06/28', 'price': '52'}, {'date': '110/06/07', 'price': '52'}, {'date': '110/06/07', 'price': '52'}, {'date': '110/05/31', 'price': '52'}, {'date': '110/05/24', 'price': '52'}, {'date': '110/05/24', 'price': '52'}, {'date': '110/05/17', 'price': '52'}, {'date': '110/05/17', 'price': '52'}, {'date': '110/05/10', 'price': '52'}, {'date': '110/05/10', 'price': '52'}, {'date': '110/05/03', 'price': '51.1'}, {'date': '110/04/26', 'price': '50'}, {'date': '110/04/19', 'price': '50'}, {'date': '110/04/19', 'price': '50'}, {'date': '110/04/12', 'price': '50'}, {'date': '110/04/05', 'price': '50'}, {'date': '110/03/29', 'price': '50'}]
交易日期 雞蛋產地價格
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 NaN NaN
因為我想爬2018年1月1日至2021年5月31日這區間日期的資料,
但是我還沒設定區間,他會產生出從110/03/29每7天的資料,
我不理解這7天是哪裡來的?
以及為什麼明明爬到資料了,
但放入dataframe後卻顯示NaN?
因為我想爬2018年1月1日至2021年5月31日這區間日期的資料,
但是我還沒設定區間,他會產生出從110/03/29每7天的資料,
我不理解這7天是哪裡來的?
以及為什麼明明爬到資料了,
但放入dataframe後卻顯示NaN?
https://m.coa.gov.tw/Transaction/PoultryTrans/Index
# post送出的body範例
"StartDate": "2021/06/27",
"EndDate": "2021/07/27",
"DataSource": "1",
"NoRest": "false",
"NowPage": "1",
"SortAction": "DESC",
"SortField": "TradeDateTime",
"PageSize": "20"
3.你result裡面應該是list,要不然你的dict欄位應該要對應dataframe的columns,pandas沒聰明到可以幫你英翻中,你叫他去找交易日期和雞蛋產地價格,給他的資料只有date和price,當然吐NA給你。
result = [['110/07/19','52'], ....]