Day 13: 從爬蟲到架站-增加功能(近況)

第 12 屆 iThome 鐵人賽

DAY 1

自我挑戰組

從爬蟲到架站系列第 14 篇

12th鐵人賽

jeff3071

2020-09-15 22:22:22

717 瀏覽

分享至

一樣先從爬蟲開始，先找目標的位置，在個人頁面的逐場成績有我要的資料，目標是要取近五場的表現，而這次沒有要畫圖，只要更新表格。

舉江坤宇為例

這次的比較麻煩，沒有統一的class抓，所以我用td這個標籤來找規律，並且與上次一樣使用異步。

並且在寫爬蟲前要注意到，有選手是沒打滿五場的要做特別處理。

crawl.py

async def send_recent_req(url, row):
    res = await loop.run_in_executor(None, requests.get, url)
    soup = BeautifulSoup(res.text, 'lxml')
    html_info = soup.select("td")  # 8:Date 9:oppo 10:PA

    Date, opp, PA, AB, RBI, R, H, two_B, third_B, HR, SO, SB, CS, AVG = [
    ], [], [], [], [], [], [], [], [], [], [], [], [], []
    Game_num = min((len(html_info)-8)//30, 5)
    for i in range(Game_num):
        Date.append(html_info[8+i*30].text)
        opp.append(html_info[9+i*30].text)
        PA.append(html_info[10+i*30].text)
        AB.append(html_info[11+i*30].text)
        RBI.append(html_info[12+i*30].text)
        R.append(html_info[13+i*30].text)
        H.append(html_info[14+i*30].text)
        two_B.append(html_info[15+i*30].text)
        third_B.append(html_info[16+i*30].text)
        HR.append(html_info[17+i*30].text)
        SO.append(html_info[19+i*30].text)
        SB.append(html_info[20+i*30].text)
        CS.append(html_info[21+i*30].text)
        AVG.append(html_info[22+i*30].text)
    d = {'Date': Date, 'Name': row['Name'], 'OPP': opp, 'PA': PA, 'AB': AB, 'RBI': RBI, 'R': R,
         'H': H, '2B': two_B, '3B': third_B, 'HR': HR, 'SO': SO, 'SB': SB, 'CS': CS, 'AVG': AVG}
    recent_result.append(d)

def get_player_recent_stat():
    player_recent_url = "http://www.cpbl.com.tw/players/follow.html?"
    with open('player_ID.csv', 'r', encoding='utf8') as csvfile:
            rows = csv.DictReader(csvfile)
            for row in rows:
                player_recent_full_url = player_recent_url + 'player_id=' + \
                    str(row['ID']) + '&teamno=' + str(row['Team ID'])

                r_task = loop.create_task(
                    send_recent_req(player_recent_full_url, row))
                
                tasks.append(r_task)
            loop.run_until_complete(asyncio.wait(tasks))

            store_recent(recent_result)

上面的寫法有點醜，在下一個功能會用dict改進

並且同樣在db_connect.py裡面寫好儲存的function。

db_connect.py

def store_recent(data):
    db = firestore.client()
    batch = db.batch()

    for player in data:
        doc_ref = db.collection(u'打者').document(player['Name'])
        doc = doc_ref.get()
        if doc.exists:
            batch.update(doc_ref, {u'近況':player})
        else:
            doc_ref.set({u'近況':player}, merge = True)

    batch.commit()

這邊發現一個前面沒講到的問題，用update必須在doc_ref存在的前提下才會成功，所以如果是新增欄位就要用set，而set後面加了merge=True，是因為set會把舊的資料覆蓋掉，如果沒加這個參數，之前的數據就會不見。

這次差不多就這樣，下次會繼續將功能完善