[Day 22] 評估模型測試結果

2024 iThome 鐵人賽

DAY 22

AI/ ML & Data

輕鬆上手AI專案－影像分類到部署模型系列第 22 篇

16th鐵人賽 python 影像分類深度學習

Eunice

2024-10-05 02:53:35

151 瀏覽

分享至

前言

昨天介紹如何利用訓練好的模型進行推論，用一張影像來測試，如果準備了一份有許多影像的測試資料集，要怎麼推論呢？一張一張輸入效率太低了！今天要來介紹如何將測試資料集整份進行推論，以及將這份測試資料集做評估。

介紹

測試資料集的內容會選擇和主題相關的，以熊熊資料集來說，要找 5 種對應類別的熊熊圖片。本系列以每個類別各 10 張影像，來進行推論，並以幾個指標來評估測試結果，包含 Accuracy、Precision、Recall、F1-score 和 Cohen's Kappa。

推論測試資料集

實作程式碼：

from keras.models import load_model
import os
import numpy as np
from keras.preprocessing import image

# 載入模型
saved_model = load_model('your_model_name.h5') # 更改成自己的模型檔名

# 定義要遍歷的測試資料集路徑
folderlist = []
test_path = 'your_test_dataset_path' # 更改成自己的測試資料集路徑
# 定義類別資料夾路徑變數
black_path = test_path+'black/'
grizzly_path = test_path+'grizzly/'
panda_path = test_path+'panda/'
polar_path = test_path+'polar/'
teddy_path = test_path+'teddy/'
# 使用 append() 將資料夾路徑附加成一個 list
folderlist.append(black_path)
folderlist.append(grizzly_path)
folderlist.append(panda_path)
folderlist.append(polar_path)
folderlist.append(teddy_path)
# folderlist: ['your_test_dataset_path/black/', 'your_test_dataset_path/grizzly/', 'your_test_dataset_path/panda/', 'your_test_dataset_path/polar/', 'your_test_dataset_path/teddy/']

# 定義真實值
y_true = [0]*10 + [1]*10 + [2]*10 + [3]*10 + [4]*10
# 定義預測值，為一個列表，等等會附加推論結果進去
y_pred = []

# 推論
for i in range(len(folderlist)):
    test_list = os.listdir(folderlist[i])
    for j in range(len(test_list)):
        img = image.load_img(folderlist[i]+test_list[j])
        img = img.resize((256, 256))
        img_ = image.img_to_array(img)
        img = np.expand_dims(img, axis=0)
        output = saved_model.predict(img)
        result = np.argmax(output, axis=1)[0]
        y_pred.append(result)
        ground_truth_index = i * 10 + j
        # 印出推論結果與真實值比對
        print(f'Predict: {result}, Ground truth: {y_true[ground_truth_index]}')
        # 執行結果範例：Predict: 0, Ground truth: 0

💡可能會遇到的問題：

如果執行過程中遇到：

ImportError: Could not import PIL.Image. The use of load_img requires PIL.

安裝 Pillow 可以解決：

pip install Pillow

評估測試結果

除了之前介紹過的 Accuracy、Precision 和 Recall，今天還會使用到 2 個常用的評估指標 F1-score 和 Cohen's Kappa。

F1-score

F1-score 為 Precision 和 Recall 的調和平均數，範圍介於 [0, 1]，數值愈高表示愈佳。有時候資料集類別數量會較不平均，如果只使用單一評估指標會有些偏頗，造成結果解讀誤差，可以使用 F1-score 來當作評估指標。

F1-score 的公式：
$F1-score公式$

Cohen's Kappa

Cohen's Kappa 為一種常用來評估分類問題的指標，統計評分者間信度（Inter-rater Agreement）的一種方法，範圍介於 [-1, 1]，數值愈高表示兩位評分者的意見一致性愈高。

Cohen's Kappa 的公式：
$Kappa公式$

實作範例

將推論結果進行評估：

# 使用 scikit-learn 套件
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score

# 評估指標
average_param  = "macro" 
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average=average_param)
recall = recall_score(y_true, y_pred, average=average_param)
f1 = f1_score(y_true, y_pred, average=average_param)
kappa = cohen_kappa_score(y_true, y_pred)

# 印出結果
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')
print(f'Cohen\'s Kappa: {kappa}')

# 印出指標數值，取到小數點後三位
print('\n### Round ###')
print(f'Accuracy: {round(accuracy, 3)}')
print(f'Precision: {round(precision, 3)}')
print(f'Recall: {round(recall, 3)}')
print(f'F1-score: {round(f1, 3)}')
print(f'Cohen\'s Kappa: {round(kappa, 3)}')

說明

使用 accuracy_score() 計算準確度，precision_score() 計算精確度，recall_score() 計算召回率，f1_score() 計算 F1-分數及 cohen_kappa_score() 計算 Cohen's Kappa 值，並使用 round() 來進行數值四捨五入，第一個參數填入要取值的對象，第二個參數填入要取小數點後多少位。有一些評估指標函數會使用到 average 這個參數，因為範例為多類別的分類問題，這裡設定 macro 表示會平均計算每個類別的指標，每個類別對於結果的影響是一樣的。其他參數設定可以參考 scikit-learn API 文件。

執行結果

Accuracy: 0.88
Precision: 0.8974825174825176
Recall: 0.8799999999999999
F1-score: 0.8795211814702999
Cohen's Kappa: 0.85

### Round ###
Accuracy: 0.88
Precision: 0.897
Recall: 0.88
F1-score: 0.88
Cohen's Kappa: 0.85

今天介紹了如何評估測試結果，明天會延續今天的範例，介紹如何使用混淆矩陣喔