[Python 爬蟲這樣學，一定是大拇指拉！] DAY23 - 實戰演練：HTML Response - 抓取股票代碼清單 (2)

2021 iThome 鐵人賽

DAY 23

Software Development

Python 爬蟲這樣學，一定是大拇指拉！系列第 23 篇

[Python 爬蟲這樣學，一定是大拇指拉！] DAY23 - 實戰演練：HTML Response - 抓取股票代碼清單 (2)

13th鐵人賽爬蟲網路爬蟲 python requests

GreedIsGood

團隊請支援 Coding

2021-10-08 01:21:54

4427 瀏覽

分享至

開始前我簡單帶過一下我們這支爬蟲 Beautiful soup 的用法好了：

from bs4 import BeautifulSoup
html = "<title>example1</title><title>example2</title>"

soup = BeautifulSoup(html, "lxml")

# find_all() 會搜尋整個 html 回傳符合的值
print(soup.find_all("title"))
# [<title>example1</title>, <title>example2</title>]

# find() 只會回傳"第一個"符合的值
print(soup.find("title"))
# <title>example1</title>

# 拿取 tag 之間的 text
print(soup.find("title").text)
# example1

官方文件傳送門

大概是這樣，其實沒有很難對吧！

那麼，要開始囉！

抓取股票代碼清單 - 程式

根據前篇得到的資訊：
- URL：https://isin.twse.com.tw/isin/class_main.jsp。
- 必要的 Query：market=1&issuetype=1&Page=1&chklike=Y。
- 所以可以根據需求在 Query 的 market、issuetype、Page、chklike 代入自己要的值。
- HTTP Method 是 GET。
- Content-Type：text/html;charset=MS950，所以格式是 HTML，編碼為 MS950。

爬蟲程式：

Beautiful soup 看不懂的地方，請開啟上方提供的官方文件傳送門搭配使用。

import json
import requests
from bs4 import BeautifulSoup

# 設置 index constant，數字代表我們要的資料在 list 的位置
TARGET_TABLE_INDEX = 1
STOCK_NO_INDEX = 2
STOCK_NAME_INDEX = 3
STOCK_INDUSTRY_INDEX = 6
# JSON settings
TITLE = "stock"
JSON_INDENT = 4

# 送出 HTTP Request
url = "https://isin.twse.com.tw/isin/class_main.jsp"
res = requests.get(url, params={
    "market": "1",
    "issuetype": "1",
    "Page": "1",
    "chklike": "Y"
})

# 處理編碼，使用預設 utf-8 的話 res.text 的內容會有亂碼
res.encoding = "MS950"
res_html = res.text

# Parse
soup = BeautifulSoup(res_html, "lxml")

# 因為這個 HTML 裡面有兩張 table
# 所以我們 find_all("table") 回傳的 list 的 length 會是 2
# 而我們要的資料在第二張
tr_list = soup.find_all("table")[TARGET_TABLE_INDEX].find_all("tr")

# tr_list 的第一個是 item 是欄位名稱
# 我們這邊用不到所以 pop 掉
tr_list.pop(0)

# 開始處理資料
result = []
for tr in tr_list:

    td_list = tr.find_all("td")

    # 股票代碼
    stock_no_val = td_list[STOCK_NO_INDEX].text

    # 股票名稱
    stock_name_val = td_list[STOCK_NAME_INDEX].text

    # 股票產業類別
    stock_industry_val = td_list[STOCK_INDUSTRY_INDEX].text

    # 整理成 dict 存起來
    result.append({
        "stockNo": stock_no_val,
        "stockName": stock_name_val,
        "stockIndustry": stock_industry_val
    })


# 將 dict 輸出成檔案
stock_list_dict = {TITLE: result}
with open("stock_info_list.json", "w", encoding="utf-8") as f:
    f.write(
        json.dumps(stock_list_dict,
                   indent=JSON_INDENT,
                   ensure_ascii=False)
    )