時間序列

2018 iT 邦幫忙鐵人賽

DAY 13

AI & Machine Learning

探索 Microsoft CNTK 機器學習工具系列第 13 篇

2018鐵人賽

HO-HSUN

2018-01-01 21:15:51

3817 瀏覽

分享至

Introduction

時間序列(time series)是以週、月、季的歷史資料分佈為基礎，預測推估趨勢、季節、週期及隨機波動，例如預測未來商業活動的銷售情況。

一般常見應用於超級市場、便利商店、量販店等零售業，用以預測銷售情況，例如便利商店的銷售系統(point of sale, POS)亦大都會內建銷售情報分析功能。

Panda是 Python 函式庫，適用於時間序列分析的資料結構，NumPy 的資料結構也能適用於 Panda 中。
http://pandas.pydata.org/

Tasks

學習資源：cntk\Tutorials\CNTK_104_Finance_Timeseries_Basic_with_Pandas_Numpy.ipynb
指數型證券投資信託基金(Exchange-traded Funds, EFI)預測分類為是買或賣。
股票分析是專業領域知識。
複雜難以預測的基金買賣、股票市場，使用各種模型分析時要考慮各種因素，並參考該領域專家分析意見。

CNTK 物件宣告

# 引用相關組件
# 相容性需求，若使用舊版pyton時，可使用新版python函式
from __future__ import print_function
import datetime
import numpy as np
import os
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

import cntk as C
import cntk.tests.test_utils

# 測試並設定使用 CPU 或 GPU 作為目前測試環境
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)

# 重新設定 CNTK、NumPy 的亂數種子
C.cntk_py.set_fixed_random_seed(1)s
np.random.seed(123)

# 引用繪圖組件
%matplotlib inline

1.資料讀取(Data reading)：

引用組件：pandas datareader 資料讀取模組。

import time
try:
    from  pandas_datareader import data
except ImportError:
    !pip install pandas_datareader
    from  pandas_datareader import data

宣告函式：get_stock_data 從 Google 財經讀取股票資料，需要連結網路才能執行。

"""
Args:
    contract (str): the name of the stock/etf
    s_year (int): start year for data
    s_month (int): start month
    s_day (int): start day
    e_year (int): end year
    e_month (int): end month
    e_day (int): end day
Returns:
    Pandas Dataframe: Daily OHLCV bars
"""
def get_stock_data(contract, s_year, s_month, s_day, e_year, e_month, e_day):

    start = datetime.datetime(s_year, s_month, s_day)
    end = datetime.datetime(e_year, e_month, e_day)
    
    retry_cnt, max_num_retry = 0, 3
    
    while(retry_cnt < max_num_retry):
        try:
            bars = data.DataReader(contract,"google", start, end)
            return bars
        except:
            retry_cnt += 1
            time.sleep(np.random.randint(1,10)) 
            
    print("Google Finance is not reachable")
    raise Exception('Google Finance is not reachable')

引用組件：pickle 序列化組件

import pickle as  pkl

# We search in cached stock data set with symbol SPY.               
# Check for an environment variable defined in CNTK's test infrastructure
envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'
def is_test(): return envvar in os.environ

宣告函式：download 下載資料
取得 SPY 指數資料

def download(data_file):
    try:
        data = get_stock_data("SPY", 2000, 1,2,2017,1,1)
    except:
        raise Exception("Data could not be downloaded")
        
    dir = os.path.dirname(data_file)
        
    if not os.path.exists(dir):
        os.makedirs(dir)
        
    if not os.path.isfile(data_file):
        print("Saving", data_file )
        with open(data_file, 'wb') as f:
            pkl.dump(data, f, protocol = 2)
    return data

data_file = os.path.join("data", "Stock", "stock_SPY.pkl")

檢查資料是否存在於本地端，資料應於 CNTK 安裝時已安裝此資料集，若不是則檢查 CNTK 測試環境。

if os.path.exists(data_file):
        print("File already exists", data_file)
        data = pd.read_pickle(data_file) 
else: 
    # 資料應於 CNTK 安裝時儲存於本地端，若資料不存在，則檢查 CNTK 環境
    if is_test():
        test_file = os.path.join(os.environ[envvar], 'Tutorials','data','stock','stock_SPY.pkl')
        if os.path.isfile(test_file):
            print("Reading data from test data directory")
            data = pd.read_pickle(test_file)
        else:
            print("Test data directory missing file", test_file)
            print("Downloading data from Google Finance")
            data = download(data_file)         
    else:
        data = download(data_file)

2.資料處理(Data preprocessing)：

為了減少資料噪音，減少過度擬合(overfitting)，一般會進行數據平滑(smoothing)等處理。

預測未來的股市高或低，輸入特徵值；前一天、當前、未來的股市波動。

# 特徵名稱陣列
predictor_names = []

# 計算收盤價格差異作為特徵值
data["diff"] = np.abs((data["Close"] - data["Close"].shift(1)) / data["Close"]).fillna(0) 
predictor_names.append("diff")

# 計算成交量差異作為特徵值
data["v_diff"] = np.abs((data["Volume"] - data["Volume"].shift(1)) / data["Volume"]).fillna(0) 
predictor_names.append("v_diff")

# 計算前 8 天收盤價漲(1)或(0)跌
num_days_back = 8

for i in range(1,num_days_back+1):
    data["p_" + str(i)] = np.where(data["Close"] > data["Close"].shift(i), 1, 0) # i: number of look back days
    predictor_names.append("p_" + str(i))
    
# 將資料儲存於本地端
data.to_csv("PATH_TO_SAVE.csv")
data.head(10)

預測未來股票市場漲(1)或(0)跌。

data["next_day"] = np.where(data["Close"].shift(-1) > data["Close"], 1, 0)
data["next_day_opposite"] = np.where(data["next_day"]==1,0,1) # The label must be one-hot encoded

# 訓練資料集：設定資料集的時間範圍。
training_data = data["2001-02-05":"2009-01-20"] 

# 測試資料集：設定資料集的時間範圍。
test_data= data["2009-01-20":"2016-12-29"] 

training_features = np.asarray(training_data[predictor_names], dtype = "float32")
training_labels = np.asarray(training_data[["next_day","next_day_opposite"]], dtype="float32")

3.建立模型(Model creation)：

使用前饋神經網路，具有 10 個輸入維度和 50 個神經元。

print(training_features.shape)

預測未來股票市場漲(1)或(0)跌。

input_dim = 2 + num_days_back

# 漲(1)或(0)跌
num_output_classes = 2 

num_hidden_layers = 2
hidden_layers_dim = 2 + num_days_back
input_dynamic_axes = [C.Axis.default_batch_axis()]
input = C.input_variable(input_dim, dynamic_axes=input_dynamic_axes)
label = C.input_variable(num_output_classes, dynamic_axes=input_dynamic_axes)

def create_model(input, num_output_classes):
    h = input
    with C.layers.default_options(init = C.glorot_uniform()):
        for i in range(num_hidden_layers):
            h = C.layers.Dense(hidden_layers_dim, 
                               activation = C.relu)(h)
        r = C.layers.Dense(num_output_classes, activation=None)(h)   
    return r
    
z = create_model(input, num_output_classes)
loss = C.cross_entropy_with_softmax(z, label)
label_error = C.classification_error(z, label)
lr_per_minibatch = C.learning_parameter_schedule(0.125)
trainer = C.Trainer(z, (loss, label_error), [C.sgd(z.parameters, lr=lr_per_minibatch)])

設定訓練參數。

# 設定 minibatche 大小
minibatch_size = 100
num_minibatches = len(training_data.index) // minibatch_size

training_progress_output_freq = 1

資料視覺。

plotdata = {"batchsize":[], "loss":[], "error":[]}

4.訓練模型(Learning the model)：

設定資料批次大小(minibatche)，按時間序列傳送給訓練模型，切分不同資料批次大小可以得到不同訓練結果。

tf = np.split(training_features,num_minibatches)

print("Number of mini batches")
print(len(tf))

print("The shape of the training feature minibatch")
print(tf[0].shape)

tl = np.split(training_labels, num_minibatches)

# 每次處理一個資料批次
num_passes = 1

宣告函式：print_training_progress 顯示訓練過程資料

def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss = "NA"
    eval_error = "NA"
    if mb%frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        if verbose: 
            print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%".format(mb, training_loss, eval_error*100))
    return mb, training_loss, eval_error

訓練模型。

# 訓練神經網路
tf = np.split(training_features,num_minibatches)
tl = np.split(training_labels, num_minibatches)

for i in range(num_minibatches*num_passes): # multiply by the 
    features = np.ascontiguousarray(tf[i%num_minibatches])
    labels = np.ascontiguousarray(tl[i%num_minibatches])
    
    # Specify the mapping of input variables in the model to actual minibatch data to be trained with
    # 指定每次實際訓練的批次資料輸入到模型中
    trainer.train_minibatch({input : features, label : labels})
    batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq, verbose=1)
    if not (loss == "NA" or error =="NA"):
        plotdata["batchsize"].append(batchsize)
        plotdata["loss"].append(loss)
        plotdata["error"].append(error)

資料視覺化。

import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["loss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss ')
plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["error"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error ')
plt.show()