07.爬股票代號、產業別

2019 iT 邦幫忙鐵人賽

DAY 7

AI & Data

量化投資與機器學習研究系列第 7 篇

2019鐵人賽

kuankuan

2018-10-21 19:30:53

7137 瀏覽

分享至

爬取所有股票代號

之後的所有資料都會常常根據這張表來查詢爬蟲

from IPython.core.display import HTML
from pyquery import PyQuery as pq
import pandas as pd

TWSE_URL = 'http://isin.twse.com.tw/isin/C_public.jsp?strMode=2'
TPEX_URL = 'http://isin.twse.com.tw/isin/C_public.jsp?strMode=4'

columns = ['dtype', 'code', 'name', '國際證券辨識號碼', '上市日', '市場別', '產業別', 'CFI']

items = []
for url in [TWSE_URL, TPEX_URL]:
    response_dom = pq(url)
    for tr in response_dom('.h4 tr:eq(0)').next_all().items():
        if tr('b'):
            dtype = tr.text()
        else:
            row = [td.text() for td in tr('td').items()]
            code, name = row[0].split('\u3000')
            items.append(dict(zip(columns, [dtype, code, name, *row[1: -1]])))

data = pd.DataFrame(items)

HTML(data.head().to_html())

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>CFI</th>
      <th>code</th>
      <th>dtype</th>
      <th>name</th>
      <th>上市日</th>
      <th>國際證券辨識號碼</th>
      <th>市場別</th>
      <th>產業別</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>ESVUFR</td>
      <td>1101</td>
      <td>股票</td>
      <td>台泥</td>
      <td>1962/02/09</td>
      <td>TW0001101004</td>
      <td>上市</td>
      <td>水泥工業</td>
    </tr>
    <tr>
      <th>1</th>
      <td>ESVUFR</td>
      <td>1102</td>
      <td>股票</td>
      <td>亞泥</td>
      <td>1962/06/08</td>
      <td>TW0001102002</td>
      <td>上市</td>
      <td>水泥工業</td>
    </tr>
    <tr>
      <th>2</th>
      <td>ESVUFR</td>
      <td>1103</td>
      <td>股票</td>
      <td>嘉泥</td>
      <td>1969/11/14</td>
      <td>TW0001103000</td>
      <td>上市</td>
      <td>水泥工業</td>
    </tr>
    <tr>
      <th>3</th>
      <td>ESVUFR</td>
      <td>1104</td>
      <td>股票</td>
      <td>環泥</td>
      <td>1971/02/01</td>
      <td>TW0001104008</td>
      <td>上市</td>
      <td>水泥工業</td>
    </tr>
    <tr>
      <th>4</th>
      <td>ESVUFR</td>
      <td>1108</td>
      <td>股票</td>
      <td>幸福</td>
      <td>1990/06/06</td>
      <td>TW0001108009</td>
      <td>上市</td>
      <td>水泥工業</td>
    </tr>
  </tbody>
</table>

data['dtype'].value_counts()

上市認購(售)權證       15763
上櫃認購(售)權證        5008
股票               1684
ETF               142
臺灣存託憑證(TDR)        17
受益證券-資產基礎證券        10
特別股                 9
受益證券-不動產投資信託        6
臺灣存託憑證              1
Name: dtype, dtype: int64

確定可以爬後改用scrapy重寫，把資料寫入到DB
因為股票會更新，定期刪掉重爬更新這張表

import scrapy

TWSE_URL = 'http://isin.twse.com.tw/isin/C_public.jsp?strMode=2'
TPEX_URL = 'http://isin.twse.com.tw/isin/C_public.jsp?strMode=4'

columns = ['dtype', 'code', 'name', '國際證券辨識號碼', '上市日', '市場別', '產業別', 'CFI']


class StockCodeSpider(scrapy.Spider):
    name = 'stock_code'
    start_urls = [TWSE_URL, TPEX_URL]

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'MONGODB_COLLECTION': name,
        'MONGODB_ITEM_CACHE': 1000,
        'MONGODB_DROP': True
    }

    def parse(self, response):
        for tr in response.dom('.h4 tr:eq(0)').next_all().items():
            if tr('b'):
                dtype = tr.text()
            else:
                row = [td.text() for td in tr('td').items()]
                code, name = row[0].split('\u3000')
                yield dict(zip(columns, [dtype, code, name, *row[1: -1]]))

每隔一段時間就會有新上市櫃公司或下市櫃，而且這張表未來會常常使用，如果出狀況影響後面流程很大，所以設定成每個月爬一次並把這張表一開始先全部刪除再重新爬3次。

    start_urls = [TWSE_URL, TPEX_URL]*3

    custom_settings = {
        'DOWNLOAD_DELAY': 30,
        'CONCURRENT_REQUESTS': 1,
        'MONGODB_COLLECTION': name,
        'MONGODB_ITEM_CACHE': 1000,
        'MONGODB_UNIQ_KEY': [("code", 1)],
        'MONGODB_DROP': True
    }