股票預測二訪 :: 簡單測試及修正小問題

11th鐵人賽

預計撐兩天XD

2019-09-08 18:04:22

4048 瀏覽

分享至

零、引言

上篇的實作雖然最終有個簡單的小結果，但終究是臨陣磨槍。
最主要的原因呢，當然就是...

其實光是弄前面的東西(API)和搞懂TFRecord就花一堆時間 QQ

藉口啦! 一切都是藉口啦!
對啦! 藉口，所以今天就可以順理成章地把過程貼上來完成今天的發文了 (灑花~
就不說認真如我，還是稍微地給他修改一些程式碼，用用心心地重新實作一遍耶!!!
那麼來簡單說說下面的內容會有什麼吧!

1. 網路模型
- 此次是一個非常簡單的測試，但還是可以分享一下我所使用的網路模型
1. 輸入資料、輸出資料
- 會提到我如何處理輸入跟輸出資料，以及為什麼
1. TFRecord的簡單使用
- 因為我覺得TFRecord真的挺不錯的，但很難Orz...今天只會簡單提喔XD

一、網路模型

單純地測試性質，我使用了兩層的LSTM+Dropout後面在接3層的FC

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 90, 50)            10400     
_________________________________________________________________
dropout (Dropout)            (None, 90, 50)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 90, 50)            20200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 90, 50)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 30)                9720      
_________________________________________________________________
dense (Dense)                (None, 40)                1240      
_________________________________________________________________
dense_1 (Dense)              (None, 20)                820       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                210       
=================================================================
Total params: 42,590
Trainable params: 42,590
Non-trainable params: 0
_________________________________________________________________

二、資料輸入輸出

1. 選擇的資料，因為資料量太大了，所以我用rand的方式隨機搜尋
- 訓練資料來自隨機的120家公司的資料
- 而測試資料來自隨機20家公司的資料
1. 輸入輸出，我希望輸入的資料是前面的N天，並且可以往後預測M天
- 上一篇使用的是用錢30天的資料預測後一天的收盤價
- 今天這篇則使用前面90天預測之後10天的收盤價

三、TFRecord的簡單使用

繼上篇的daily-historical-stock-prices-1970-2018 Dataset。
在使用TFRecord之前我先將.CSV檔案轉存為.npy檔案，省了$45%$以上的空間($2GB->1.13GB$)

1. 轉換至.npy

names = pd.read_csv(dataset_path + dataset_dict['names'], engine='python')
companies = names.ticker.unique()
np.save('companies.npy', companies)
prices = pd.read_csv(dataset_path + dataset_dict['prices'], engine='python')
prices = prices.values
np.save('dataGen_full.npy', prices)

2. 生成TFRecorder

import os
import tensorflow as tf
import numpy as np

companies = np.load('companies.npy', allow_pickle=True)
prices = np.load('dataGen_full.npy', allow_pickle=True)

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _float32_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def create_tfrecords(writer, days, stock_sc):
    stock_sc = np.asarray(stock_sc, np.float32)
    max_days = len(stock_sc) - 10
    
    for i in range(days, max_days):
        x = stock_sc[i-days:i, 0]
        y = stock_sc[i:i+10, 0]
        feature = {
            'x': _bytes_feature([x.tostring()]),
            'y': _bytes_feature([y.tostring()])}
        example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(example.SerializeToString())

# Data Normalization
from sklearn.preprocessing import MinMaxScaler

num_choose = 20

count = len(companies)

rand_company_indexes = np.random.randint(count, size=num_choose)
companies_choose = companies[rand_company_indexes]

# open    close    adj_close    low    high    volume
scaler = MinMaxScaler()
days_before = 90
save_name = './testing.tfrecord'
with tf.python_io.TFRecordWriter(save_name) as writer:
    for company in companies_choose:
        indexes = prices[:, 0] == company
        stock = prices[indexes, 1:]
        training_data = stock[:, 2:3]
        if len(training_data) < 100:
            continue
        # print(training_data.shape)
        ### scaler
        get_sc = scaler.fit_transform(training_data)
        # print(get_sc)
        create_tfrecords(writer=writer, days=days_before, stock_sc=get_sc)

3. 簡單查看TFRecord裡面資料

import tensorflow as tf
file = "training.tfrecord"
record_iterator = tf.python_io.tf_record_iterator(path=file)
count = 0
for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)
    count += 1
    print(example)
    # Exit after 1 iteration as this is purely demonstrative.
    break
print('total of count : ', count)

有看到輸出應該是代表有存進去吧?

四、訓練

1. 使用`tf.data` API引入訓練資料

一開始的參數設定，將batch設為3000，單純是參考別人的寫法
上篇只有使用200而已，感覺非常不夠，學不起來...(憑感覺講的

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
reshape_size = 90
length = 352983
batch_size = 3000

def extract_features(example, reshape_size):
    features = tf.parse_single_example(
        example,
        features={
            'x': tf.FixedLenFeature([], tf.string),
            'y': tf.FixedLenFeature([], tf.string),
        }
    )
    stock = tf.decode_raw(features['x'], tf.float32)
    stock = tf.reshape(stock, [reshape_size])
    stock = tf.cast(stock, tf.float32)
    stock = tf.expand_dims(stock, -1)
    label = tf.decode_raw(features['y'], tf.float32)
    label = tf.reshape(label, [10])
    label = tf.cast(label, tf.float32)
    return stock, label

tfrecords_path = './training.tfrecord'
dataset = tf.data.TFRecordDataset(tfrecords_path)
dataset = dataset.map(lambda x: extract_features(x, reshape_size))
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.repeat()
train_gen = dataset.make_initializable_iterator()

2. 模型

可以透過summary()來查看自己的模型!!

# LSTM Training
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

model = Sequential()

model.add(LSTM(units = 50, return_sequences = True, input_shape = (reshape_size, 1)))
model.add(Dropout(0.2))

model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))

model.add(LSTM(units = 30, return_sequences = False))
model.add(Dense(units = 40))
model.add(Dense(units = 20))
model.add(Dense(units = 10))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.summary()

3. 訓練段程式碼

這邊特別使用callback函數將我的參數記錄下來
下方352983這個數字單純是資料個數，總共有如此多筆

from tensorflow import keras

epochs = 20
checkpoint_path = './first_check/'
with tf.Session() as sess:
    sess.run(train_gen.initializer)
    checkpoint_file = checkpoint_path+"/cp-{epoch:04d}.ckpt"
    cp_callback = keras.callbacks.ModelCheckpoint(checkpoint_file, save_weights_only=True, verbose=1, period=1)
    train = model.fit(train_gen, epochs = epochs, batch_size = batch_size, 
                      steps_per_epoch=352983 // batch_size, callbacks=[cp_callback])