09.爬個股三大法人

2019 iT 邦幫忙鐵人賽

DAY 9

AI & Data

量化投資與機器學習研究系列第 9 篇

2019鐵人賽

kuankuan

2018-10-23 21:47:34

8192 瀏覽

分享至

為什麼一直在爬蟲呢?因為我已經好久沒碰程式交易也很少操作股市了(除了還是天天看57金錢爆，這節目最近不太準)，現在只要傻傻空下去忍住不要回補，一反彈加空就好了，原本這幾個月失望到想靠程式交易另尋方法，靠著最近的重壓空單現在又有信心了，如果還是沒動力的話30天可能就無法完成了...

2017/12/15 上市公司還是這樣的欄位

"證券代號",
"證券名稱",
"外資買進股數",
"外資賣出股數",
"外資買賣超股數",
"投信買進股數",
"投信賣出股數",
"投信買賣超股數",
"自營商買賣超股數",
"自營商買進股數(自行買賣)",
"自營商賣出股數(自行買賣)",
"自營商買賣超股數(自行買賣)",
"自營商買進股數(避險)",
"自營商賣出股數(避險)",
"自營商買賣超股數(避險)",
"三大法人買賣超股數"

2018開始變這樣欄位

"證券代號",
"證券名稱",
"外陸資買進股數(不含外資自營商)",
"外陸資賣出股數(不含外資自營商)",
"外陸資買賣超股數(不含外資自營商)",
"外資自營商買進股數",
"外資自營商賣出股數",
"外資自營商買賣超股數",
"投信買進股數",
"投信賣出股數",
"投信買賣超股數",
"自營商買賣超股數",(比上櫃多的)
"自營商買進股數(自行買賣)",
"自營商賣出股數(自行買賣)",
"自營商買賣超股數(自行買賣)",
"自營商買進股數(避險)",
"自營商賣出股數(避險)",
"自營商買賣超股數(避險)",
"三大法人買賣超股數"

上櫃公司從2018年1月15號改變欄位

"證券代號",
"證券名稱",
"外陸資買進股數(不含外資自營商)",
"外陸資賣出股數(不含外資自營商)",
"外陸資買賣超股數(不含外資自營商)",
"外資自營商買進股數",
"外資自營商賣出股數",
"外資自營商買賣超股數",
外資及陸資XXX
外資及陸資XXX
外資及陸資XXX
"投信買進股數",
"投信賣出股數",
"投信買賣超股數",
"自營商買進股數(自行買賣)",
"自營商賣出股數(自行買賣)",
"自營商買賣超股數(自行買賣)",
"自營商買進股數(避險)",
"自營商賣出股數(避險)",
"自營商買賣超股數(避險)",
"三大法人買賣超股數"

好像沒什麼好介紹的只要把網址取出就是清資料的苦工了把資料放在每日量價後

TWSE_URL = 'http://www.twse.com.tw/fund/T86?response=json&date={y}{m:02d}{d:02d}&selectType=ALL'
TPEX_URL = 'http://www.tpex.org.tw/web/stock/3insti/daily_trade/3itrade_hedge_result.php?l=zh-tw&se=AL&t=D&d={y}/{m:02d}/{d:02d}'

爬完的結果如下之後還要把融資卷跟借卷加到同一張表

{'_id': '2018/10/03_0050',
 '市場別': '上市',
 '產業別': '',
 'name': '元大台灣50',
 'code': '0050',
 'date': '2018/10/03',
 '成交股數': 4369,
 '成交金額': 375778,
 '開盤價': 86.05,
 '最高價': 86.3,
 '最低價': 85.8,
 '收盤價': 85.95,
 '成交筆數': 1554,
 '三大法人買賣超': 1120.0,
 '外資自營商買賣超': 0.0,
 '外資自營商買進': 0.0,
 '外資自營商賣出': 0.0,
 '外陸資買賣超': 1609.0,
 '外陸資買進': 1611.0,
 '外陸資賣出': 2.0,
 '投信買賣超': -725.0,
 '投信買進': 0.0,
 '投信賣出': 725.0,
 '自營商買賣超': -747.0,
 '自營商買賣超避險': 983.0,
 '自營商買進': 154.0,
 '自營商買進避險': 1215.0,
 '自營商賣出': 901.0,
 '自營商賣出避險': 232.0}

# -*- coding: utf-8 -*-
import json
import time
from datetime import datetime
import pandas as pd
import scrapy

TWSE_URL = 'http://www.twse.com.tw/fund/T86?response=json&date={y}{m:02d}{d:02d}&selectType=ALL'
TPEX_URL = 'http://www.tpex.org.tw/web/stock/3insti/daily_trade/3itrade_hedge_result.php?l=zh-tw&se=AL&t=D&d={y}/{m:02d}/{d:02d}'

columns = ["_id",
           "外陸資買進",
           "外陸資賣出",
           "外陸資買賣超",
           "外資自營商買進",
           "外資自營商賣出",
           "外資自營商買賣超",
           "投信買進",
           "投信賣出",
           "投信買賣超",
           "自營商買進",
           "自營商賣出",
           "自營商買賣超",
           "自營商買進避險",
           "自營商賣出避險",
           "自營商買賣超避險",
           "三大法人買賣超"]


def parse_info(d, m):
    _id = m['date'] + '_' + d[0]
    if m['市場別'] == '上市':
        d.pop(11)
        d = d[2:]
        d = [int(x.replace(',', '')) / 1000 for x in d]
    else:
        del d[8:11]
        d = d[2:-1]
        d = [int(x.replace(',', '')) / 1000 for x in d]

    return dict(zip(columns, [_id, *d]))


class StockDaySpider(scrapy.Spider):
    name = 'stock_investor'

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'MONGODB_COLLECTION': 'stock_day',
        'MONGODB_ITEM_CACHE': 1,
        'MONGODB_HAS_ID_FIELD': True,
        'COOKIES_ENABLED': False
    }

    def __init__(self, beginDate=None, endDate=None, *args, **kwargs):
        super(StockDaySpider, self).__init__(beginDate=beginDate, endDate=endDate, *args, **kwargs)

    def start_requests(self):
        if self.beginDate and self.endDate:
            start = self.beginDate
            end = self.endDate
        else:
            date = datetime.today().strftime("%Y-%m-%d")
            start = date
            end = date

        for date in pd.date_range(start, end)[::-1]:
            today = '{}/{:02d}/{:02d}'.format(date.year, date.month, date.day)
            y = date.year
            m = date.month
            d = date.day

            url = TWSE_URL.format(y=y, m=m, d=d)
            time.sleep(8)
            yield scrapy.Request(url, meta={'date': today, '市場別': '上市'})
            y = y - 1911
            url = TPEX_URL.format(y=y, m=m, d=d)
            yield scrapy.Request(url, meta={'date': today, '市場別': '上櫃'})

    def parse(self, response):
        m = response.meta
        json_data = json.loads(response.text)

        if m['市場別'] == '上市':
            try:
                data = json_data['data']
                for d in data:
                    yield parse_info(d, m)
            except KeyError:
                pass
        else:
            try:
                data = json_data['aaData']
                for d in data:
                    yield parse_info(d, m)
            except KeyError:
                pass