Day 24：機器學習永遠不會跟你講錯 -- Keras 除錯技巧

第 12 屆 iThome 鐵人賽

DAY 24

AI & Data

輕鬆掌握 Keras 及相關應用系列第 24 篇

12th鐵人賽 ai machine learning tensorflow

I code so I am

2020-09-24 12:10:09

11568 瀏覽

分享至

前言

機器學習永遠不會跟你講錯，不管你丟甚麼東西進入訓練流程或預測，它都會給你一個答案，正是所謂的 Garbage In Garbage Out，因此，產生了兩個問題：

怎麼知道有錯?
如何除錯?
今天我們就來探討這個課題。

怎麼知道有錯?

通常有兩個現象，我們會認為有錯：

準確率過低，尤其是神經網路，如果輸入的資料有錯，通常準確率都會在低空徘徊。
與【資料探索與分析】(Exploratory Data Analysis, EDA)結果大相逕庭。

因此，進行專案時，千萬不要一下子就將資料丟進模型訓練，建議還是要照著下圖10個步驟依序進行，前3個步驟才是成功的關鍵，找到影響目標(Y)的關鍵因子(X)遠勝過模型的選擇，除非你是在參加 Kaggle 競賽。

圖一. 機器學習流程，修改自 Free Machine learning diagram

匯入資料集(Dataset)。
資料清理(Data Cleaning)：反覆進行資料清理與EDA，了解每個變數的特性及關聯度。
特徵選擇(Feature Engineering)：找到影響目標(Y)的關鍵因子(X)。
資料切割(Data Split)：隨機抽樣，將資料切割為訓練資料(Training Data)及測試資料(Test Data)。
選擇演算法(Learning Algorithms)。
模型訓練(Model Training)。
模型計分(Score Model)：計算準確度，衡量模型效能。
模型評估(Evaluate Model)：比較多個模型效能，找到最佳模型與最佳參數組合。
模型上線(Delivery)：提供服務。
預測(Predict)：輸入新資料(New Data)，經由模型預測結果。

常犯的簡單錯誤

輸入的資料內容不一致：例如訓練資料有經過特徵縮放，即常態化(Normalization)或標準化(Standardization)，但測試資料卻忘了作。
目標值(Y)未作適當處置：例如 One-hot Encoding，變成連續型變數的預測。
特殊的資料安排：例如時間序列(Time Series)，要以Y的前N期作為輸入，但未作適當轉換，或需要移動平均，卻忘記作。
圖片顏色的處理，採單色、RGB或HSV 色彩空間。
以上這些狀況都不會出現錯誤訊息，除非輸入維度及尺寸錯誤，大都是會發生準確率異常的低。

Keras 除錯技巧

Keras 除錯並不容易，通常 model.compile() 只會做基本的檢查，神經網路層層連接，只有等到執行結束，才能開始找錯誤，所以，個人建議依以下步驟進行：

使用 model.summary() 確認各層的 output 維度大小及參數個數。

import tensorflow as tf
from tensorflow.keras import layers

# 輸入的維度
input_shape = (28, 28, 1)

# 建立模型
model = tf.keras.Sequential(
    [
        tf.keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax"),
    ]
)

# 顯示 Model output
model.summary()

最後一行顯示如下圖二，可以看出每一層的output維度，如果是較複雜的神經網路如下圖三，有神經層合併時，特別要注意是否正確，紅框為合併層，最後一欄【Connected to】為合併的來源層。

圖二. 模型彙總，model.summary() 的執行結果

圖三. 較複雜的神經網路模型彙總

繪製模型圖，可以確認模型結構是否正確，show_shapes=True 可以顯示 input/output 維度，有助於比對，若圖太小，可利用 to_file='XXX' 存檔，利用影像軟體放大檢視。

keras.utils.plot_model(model, show_shapes=True, to_file='debug_model.png')

模型訓練完，可以顯示各層的權重值。

print(model_cnn.layers[-1].weights)

顯示各層的 output 值

from tensorflow import keras
from tensorflow.keras import layers

# 輸入的維度
input_shape = (28, 28, 1)

# 建立模型
model = tf.keras.Sequential([
    tf.keras.Input(shape=input_shape),
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(32, 3, activation='relu'),
    layers.MaxPooling2D(2),
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(32, 3, activation='relu'),
    layers.GlobalMaxPooling2D(),
    layers.Dense(10),
])
extractor = keras.Model(inputs=model.inputs,
                        outputs=[layer.output for layer in model.layers])
features = extractor(x_test[0:1])
features

只顯示倒數第2層的 output 值

intermediate_layer_model = keras.Model(inputs=model.input,
                                       outputs=model.layers[-2].output)
features = intermediate_layer_model(x_test[0:1])
features

自訂層除錯

若自訂神經層，再與其他層結合之前，可以先作單層測試，例如：

自訂神經層

import tensorflow as tf
from tensorflow.keras import layers

# 自訂層(Custom Layer)
class MyAntirectifier(layers.Layer):
    def build(self, input_shape):
        output_dim = input_shape[-1]
        self.kernel = self.add_weight(
            shape=(output_dim * 2, output_dim),
            initializer="he_normal",
            name="kernel",
            trainable=True,
        )

    def call(self, inputs):
        # Take the positive part of the input
        pos = tf.nn.relu(inputs)
        # Take the negative part of the input
        neg = tf.nn.relu(-inputs)
        
        # Concatenate the positive and negative parts
        # ****** bug, axis應該為1 ******* #
        concatenated = tf.concat([pos, neg], axis=0)
        
        # Project the concatenation down to the same dimensionality as the input
        return tf.matmul(concatenated, self.kernel)

測試:

x = tf.random.normal(shape=(2, 5))
y = MyAntirectifier()(x)

錯誤訊息如下，矩陣相乘出錯，維度大小有問題:

InvalidArgumentError: Matrix size-incompatible: In[0]: [4,5], In[1]: [10,5] [Op:MatMul]

可使用print或log加上debug訊息：

class MyAntirectifier(layers.Layer):
    def build(self, input_shape):
        output_dim = input_shape[-1]
        self.kernel = self.add_weight(
            shape=(output_dim * 2, output_dim),
            initializer="he_normal",
            name="kernel",
            trainable=True,
        )

    def call(self, inputs):
        pos = tf.nn.relu(inputs)
        neg = tf.nn.relu(-inputs)
        print("pos.shape:", pos.shape)
        print("neg.shape:", neg.shape)
        concatenated = tf.concat([pos, neg], axis=1)
        print("concatenated.shape:", concatenated.shape)
        print("kernel.shape:", self.kernel.shape)
        return tf.matmul(concatenated, self.kernel)

再測試:

x = tf.random.normal(shape=(2, 5))
y = MyAntirectifier()(x)

就可以顯示除錯訊息:

pos.shape: (2, 5)
neg.shape: (2, 5)
concatenated.shape: (2, 10)
kernel.shape: (10, 5)

筆者突發奇想，也許我們可以作一個自訂層(Custom Layer)不加任何處理，單純print，作除錯用，也許是一個不錯的主意。

訓練加 run_eagerly=True

Tensorflow Keras 訓練時是使用完全優化的運算圖(fully-compiled computation graph)計算，並不是一般的Python 程式碼，所以很難在程式中間除錯，這時，我們可以在 compile() 內加參數 run_eagerly=True，那麼訓練就會被轉換成除錯模式(debug mode)，可以在程式中間印出張量(Tensor)值，相對應會有一個缺點，訓練將變得非常慢，詳細的介紹可參考【Keras 官網】 Tip 3 段落，例如訓練一值無法收斂，每一批或步驟的梯度忽正忽負，那可能就是原因，也許應該先做個標準化處理。