【Day 6】影像辨識 -- 區分訓練集、測試集轉換型態

2022 iThome 鐵人賽

DAY 6

AI & Data

了解AI多一點點系列第 6 篇

14th鐵人賽

mingchang

2022-09-06 11:18:29

3413 瀏覽

分享至

上回我們將所有照片調整為相同尺寸以及給予對應標籤，並分別放到兩個list之中了。接著我們要將這些照片以及標籤分為訓練用的訓練集以及測試訓練成果的測試集，若是不分開成訓練集以及測試集的話，在我們訓練完模型後，就只能用訓練模型時所用到的照片來做測試，但這樣的測試是沒有意義的，因此我們才要在訓練模型前就是先區分開來。

區分訓練集、測試集

from sklearn.model_selection import train_test_split

# split picture into train set and test set
train_feature, test_feature, train_label, test_label = train_test_split(images, labels, test_size = 0.2, random_state = 42)

我們利用scikit-learn內的函式train_test_split，將資料集給區分為訓練集和測試集。第一個參數為資料集的所有照片，第二個參數為所對應的標籤，第三個參數為測試集在原資料集中所佔的比例，最後一個參數則是亂數種子，作用為讓你再次執行這行程式時所得到的結果能夠相同。
執行完後，這樣我們就成功將訓練集以及測試集給分開了。接著我們要將訓練集以及測試集內的所有資料轉換成numpy array的形式，這樣的形式才能夠用來訓練模型。

轉換型態

import numpy as np

# convert from list into matrix
train_feature = np.array(train_feature)
test_feature = np.array(test_feature)
train_label = np.array(train_label)
test_label = np.array(test_label)
print("The shape of train_feature: ", train_feature.shape, ", the shape of train_label: ", train_label.shape)
print("The shape of test_feature: ", test_feature.shape, ", the shape of test_label: ", test_label.shape)

只要像上面程式碼內直接轉換成numpy array形式後，重新賦值給自己就完成了。這邊我們印出圖片和標籤的形狀給大家看，可以看到訓練集和測試集的圖片大小分別為(19955, 40, 40, 3)和(4989, 40, 40, 3)。若讀者的第一個參數和我的不同不需要擔心，第一個參數所指的是資料的數量，只要訓練集和測試集的比例為四比一就是正確的，這代表我們剛剛區分測試集以及資料集時，train_test_split函式有正確的執行成功；而第二個和第三個參數所代表的是圖片的尺寸，這裡顯示為40 * 40因為我們在一開始資料預處理時統一的圖片大小集為40 * 40；而最後一個參數3所代表的是rgb色彩，由於我們的資料集是彩色圖片，所以各個像素顏色是由三個數值rgb合成出來的。

儲存訓練集、測試集

import numpy as np

# store the features and labels into a folder
imagesavepath = "Cat_Dog_Dataset/"
if not os.path.exists(imagesavepath):
    os.makedirs(imagesavepath)
np.save(imagesavepath + "train_feature.npy", train_feature)
np.save(imagesavepath + "test_feature.npy", test_feature)
np.save(imagesavepath + "train_label.npy", train_label)
np.save(imagesavepath + "test_label.npy", test_label)

最後我們在現在的工作目錄中創立一個資料夾，將剛剛所分類好轉好型態的訓練集圖片、訓練集標籤、測試集圖片、測試集標籤分別存進資料夾中。這裡我們儲存的型態為.npy檔，這是python的numpy中的獨有的型態，是numpy專門用來儲存numpy array的資料、圖片的一種檔案型態。