Python的requests模組中,發送request的過程是:發送第一個request->接收第一個回覆->發送第二個request...,但是在發送和接收之間會有非常長的一段時間在做io調度,這段時間比起發送和接收後處理還長許多,因此異步就是在這段io調度的過程中繼做事情。
Python中使用異步的關鍵字為async,並且要import asyncio,以下是修改過後的程式碼
只要把發送request後的程式碼包成一個function就好
import requests
from bs4 import BeautifulSoup
import csv
import asyncio
#!/usr/bin/env python
# coding: utf-8
loop = asyncio.get_event_loop()
tasks = []
month_result = []
async def send_req(url, row):
res = await loop.run_in_executor(None, requests.get, url)
soup = BeautifulSoup(res.text, 'lxml')
player_td = soup.select(".display_a1")
month = soup.select("tr > td:nth-of-type(1)")
month.remove(month[0])
month.remove(month[0])
player_info, avg_list, OBP_list, SLG_list, PA_list, AB_list, RBI_list, H_list, HR_list = {
}, {}, {}, {}, {}, {}, {}, {},{}
for i in range(len(month)):
avg_list[month[i].text] = player_td[19 + i*10].text
OBP_list[month[i].text] = player_td[17 + i*10].text
SLG_list[month[i].text] = player_td[18 + i*10].text
PA_list[month[i].text] = player_td[10 + i*10].text
AB_list[month[i].text] = player_td[11 + i*10].text
RBI_list[month[i].text] = player_td[12 + i*10].text
H_list[month[i].text] = player_td[13+i*10].text
HR_list[month[i].text] = player_td[14+i*10].text
player_info[player_td[9].text] = avg_list
player_info[player_td[7].text] = OBP_list
player_info[player_td[8].text] = SLG_list
player_info[player_td[0].text] = PA_list
player_info[player_td[1].text] = AB_list
player_info[player_td[2].text] = RBI_list
player_info[player_td[3].text] = H_list
player_info[player_td[4].text] = HR_list
attr = ['3月', '4月', '5月', '6月', '7月', '8月', '9月', '10月']
total_info = {}
#解決月份數據為零
for info_type in player_info:
info = ['0', '0', '0', '0', '0', '0', '0', '0']
for data in player_info[info_type]:
n = attr.index(data)
info[n] = player_info[info_type][data]
total_info[info_type] = info
total_info['Name'] = row['Name']
month_result.append(total_info)
player_url = "http://www.cpbl.com.tw/players/apart.html?year=2020&type=05&"
with open('player_ID.csv', 'r', encoding = 'utf8') as csvfile:
rows = csv.DictReader(csvfile)
for row in rows:
player_full_url = player_url + 'player_id=' + \
str(row['ID']) + '&teamno=' + str(row['Team ID'])
task = loop.create_task(send_req(player_full_url, row))
tasks.append(task)
備註: 上次在macOS底下用open()可以直接開,這次在Window底下要加encoding='utf8'
成功將爬蟲優化後,要將取得的資料儲存下來,在寫code之前,這邊我想到了一個小問題,假如用csv存,並且存資料的code寫在send_req裡面,那同時接到回覆時,是否會造成同時打開csv進而造成錯誤?
首先異步還是單線程的,所以就算同時收到,還是會有個先後順序,但是寫在send_req裡面會調用open非常多次,因此到最後一次寫入是更好的做法。
下一篇將把這些資料存到firebase裡面