iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 1
0
自我挑戰組

從爬蟲到架站系列 第 2

Day 1: 從爬蟲到架站-優化爬蟲

  • 分享至 

  • xImage
  •  

異步

Python的requests模組中,發送request的過程是:發送第一個request->接收第一個回覆->發送第二個request...,但是在發送和接收之間會有非常長的一段時間在做io調度,這段時間比起發送和接收後處理還長許多,因此異步就是在這段io調度的過程中繼做事情。

Python中使用異步的關鍵字為async,並且要import asyncio,以下是修改過後的程式碼

只要把發送request後的程式碼包成一個function就好

import requests
from bs4 import BeautifulSoup
import csv
import asyncio
#!/usr/bin/env python
# coding: utf-8

loop = asyncio.get_event_loop()
tasks = []
month_result = []

async def send_req(url, row):
    res = await loop.run_in_executor(None, requests.get, url)
    soup = BeautifulSoup(res.text, 'lxml')
    player_td = soup.select(".display_a1")
    month = soup.select("tr > td:nth-of-type(1)")

    month.remove(month[0])
    month.remove(month[0])

    player_info, avg_list, OBP_list, SLG_list, PA_list, AB_list, RBI_list, H_list, HR_list = {
    }, {}, {}, {}, {}, {}, {}, {},{}

    for i in range(len(month)):
        avg_list[month[i].text] = player_td[19 + i*10].text
        OBP_list[month[i].text] = player_td[17 + i*10].text
        SLG_list[month[i].text] = player_td[18 + i*10].text
        PA_list[month[i].text] = player_td[10 + i*10].text
        AB_list[month[i].text] = player_td[11 + i*10].text
        RBI_list[month[i].text] = player_td[12 + i*10].text
        H_list[month[i].text] = player_td[13+i*10].text
        HR_list[month[i].text] = player_td[14+i*10].text

    player_info[player_td[9].text] = avg_list
    player_info[player_td[7].text] = OBP_list
    player_info[player_td[8].text] = SLG_list
    player_info[player_td[0].text] = PA_list
    player_info[player_td[1].text] = AB_list
    player_info[player_td[2].text] = RBI_list
    player_info[player_td[3].text] = H_list
    player_info[player_td[4].text] = HR_list

    attr = ['3月', '4月', '5月', '6月', '7月', '8月', '9月', '10月']

    total_info = {}

    #解決月份數據為零
    for info_type in player_info:
        info = ['0', '0', '0', '0', '0', '0', '0', '0']
        for data in player_info[info_type]:
            n = attr.index(data)
            info[n] = player_info[info_type][data]
        total_info[info_type] = info

    total_info['Name'] = row['Name']

    month_result.append(total_info)


player_url = "http://www.cpbl.com.tw/players/apart.html?year=2020&type=05&"
with open('player_ID.csv', 'r',  encoding = 'utf8') as csvfile:
    rows = csv.DictReader(csvfile)
    for row in rows:
        player_full_url = player_url + 'player_id=' + \
            str(row['ID']) + '&teamno=' + str(row['Team ID'])
        task = loop.create_task(send_req(player_full_url, row))
        tasks.append(task)

備註: 上次在macOS底下用open()可以直接開,這次在Window底下要加encoding='utf8'

儲存

成功將爬蟲優化後,要將取得的資料儲存下來,在寫code之前,這邊我想到了一個小問題,假如用csv存,並且存資料的code寫在send_req裡面,那同時接到回覆時,是否會造成同時打開csv進而造成錯誤?

首先異步還是單線程的,所以就算同時收到,還是會有個先後順序,但是寫在send_req裡面會調用open非常多次,因此到最後一次寫入是更好的做法。

下一篇將把這些資料存到firebase裡面


上一篇
Day 0: 從爬蟲到架站-取得資料
下一篇
Day 2: 從爬蟲到架站-儲存資料
系列文
從爬蟲到架站21
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言