Day 6 預測是否高關懷

11th鐵人賽

阿瑜

2019-09-21 02:13:23

1152 瀏覽

分享至

看 example 是件非常重要的事 !!!

當把一個 example 看懂，當碰到其他 situation 就派的上場啦 ~~~

其實有兩個親身經歷，第一個是畢業專題；第二個是課堂期中作業。

第一個是在看宏毅老師的第一個作業的example: 將過去的PM2.5紀錄當作Input，把未來PM2.5當作Output，也就是說預測PM2.5的數值。

在實作的過程中，要先切資料，搭配提供的ppt的切法，能切越多份就切越多份。
處理資料成機器看得懂，餵得進去的樣子。
最後就是選擇優化函數，初始Learning Rate，weight & bias (隨機)，開始訓練。
將數據實際預測寫入 csv ，並繪成表格。

學會想過這個example時，其實還有進化版，可以使模型更精準，但先做個可以work的基礎版為當務之急，因為要畢業阿! 所以我就改成以過去每5分鐘的速率資料 X 5 當作我的Input，也就是說前25分鐘的資料當作features，預測第30分鐘的速率資料。

第二個為 Binary keras binary_crossentropy 的Case 。
這個是一本書裡的第?個example ，應該再數字辨認後。
是個預測鐵達尼號的人是否生還的例子。輸入為生日性別船艙等級哪裡登船幾位親戚，輸出為
是活著(1)還死掉(0) 。進而發現後面感人故事。我覺得這本書算是入門書，加上Keras 是高階的API ，其實寫個幾行就完成了模型的部分。跟傳統直接用數學實現模型的方式不太同，傳統的比較能感覺到
每個步驟 (建立模型) ，比較腳踏實地，什麼都自己來，而Keras的library call function 調參數就可以
得出結果，非常快速，但就是要多點想像，因為大部分都幫你做好啦。
而我由第二個延伸出的應用是預測哪些學生現在各科的成績會被歸類到高關懷。
今天在此放上這個 Case 。

說明

在說明的連結裡，有判斷的準則和建議的方向，最重要的是有 trainning data 和 testing data。

[程式碼]

IDE : Spyder
很多的套件都裝在裡面，不用額外裝。
右邊的 command line 就和 ipython 一樣視覺化做的很好。

# Spyder 會在 New 一個 file 後，自動加入時間的部分及編碼
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 19 22:31:02 2019
"""


import numpy
import pandas as pd
from sklearn import preprocessing
numpy.random.seed(10)

# 載入訓練資料
all_df = pd.read_excel("training.xlsx")

cols = ['程式設計','UNIX應用實務','微積分(I)','普通物理學','計算機概論','高關懷']
all_df = all_df[cols]

# 80% 的資料做訓練 / 20% 的資料做評估
msk = numpy.random.rand(len(all_df)) < 0.8
train_df = all_df[msk]
test_df = all_df[~msk]

# 處理Data ，並把資料缺補的部分以平均做填補
def PreprocessData(raw_df):
    B = all_df['程式設計'].mean()
    all_df['程式設計'] = all_df['程式設計'].fillna(B)
    
    C = all_df['UNIX應用實務'].mean()
    all_df['UNIX應用實務'] = all_df['UNIX應用實務'].fillna(C)
    
    D = all_df['微積分(I)'].mean()
    all_df['微積分(I)'] = all_df['微積分(I)'].fillna(D)
    
    E = all_df['普通物理學'].mean()
    all_df['普通物理學'] = all_df['普通物理學'].fillna(E)
    
    F = all_df['計算機概論'].mean()
    all_df['計算機概論'] = all_df['計算機概論'].fillna(F)

    I = all_df['高關懷'].mean()
    all_df['高關懷'] = all_df['高關懷'].fillna(I)
    
    array = all_df.values
    
    Label = array[:,5];
    Features = array[:,0:5]
    minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
    scaledFeatures=minmax_scale.fit_transform(Features) 
    return scaledFeatures,Label

train_Features,train_Label=PreprocessData(train_df)
test_Features,test_Label=PreprocessData(test_df)

# Sequential 模型建立 並加入layers 和 nodes 
# 選擇 loss function for binary 
from keras.models import Sequential
from keras.layers import Dense,Dropout
model = Sequential()
model.add(Dense(units=40, input_dim=5, 
                kernel_initializer='uniform', 
                activation='relu'))
model.add(Dense(units=30, 
                kernel_initializer='uniform', 
                activation='relu'))
model.add(Dense(units=1, 
                kernel_initializer='uniform',
                activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam', metrics=['accuracy'])
train_history =model.fit(x=train_Features, 
                         y=train_Label, 
                         validation_split=0.1, 
                         epochs=50, 
                         batch_size=30,verbose=2)

# 繪圖 
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

# 評估 準確率 及 loss rate
show_train_history(train_history,'acc','val_acc')
show_train_history(train_history,'loss','val_loss')

scores = model.evaluate(x=test_Features, 
                        y=test_Label)

#print(scores[1])
#test_df = pd.read_excel("testing.xlsx")
q1 = pd.Series([28,60,54,77,39,1])
q2 = pd.Series([41,70,70,60,64,0])
q3 = pd.Series([67,79,75,68,51,1])
q4 = pd.Series([63,69,71,49,51])
q5 = pd.Series([78,94,99,69,74])
q6 = pd.Series([56,96,89,90,68])
q7 = pd.Series([39,49,74,67,43])
q8 = pd.Series([55,49,81,61,35])
q9 = pd.Series([61,69,78,67,63])
q10 = pd.Series([36,0,93,64,52])

q = pd.DataFrame([list(q1),list(q2),list(q3),list(q4),list(q5),list(q6),list(q7),list(q8),list(q9),list(q10)],  
                  columns=['程式設計','UNIX應用實務','微積分(I)','普通物理學','計算機概論','高關懷'])

all_df=pd.concat([all_df,q])



test_Features,Label=PreprocessData(test_df)
test_probability = model.predict(test_Features)


# testing data 預測的機率結果 
print(test_probability[-10:])