DAY16-Mnist資料集的預處理

2023 iThome 鐵人賽

DAY 16

自我挑戰組

python-資料分析與機器學習系列第 16 篇

15th鐵人賽

cc0303

2023-10-01 11:36:15

432 瀏覽

分享至

Mnsit
Mnist（Modified National Institute of Standards and Technology）是一個常用視覺領域的數據集，通常用於圖像分類和機器學習任務的基準，訓練資料包含60000個圖像，而測試資料包含10000個圖像，常用於機器學習初學者入門的練習。

下載與讀取mnist

from keras.datasets import mnist

# 載入數據集，並將其分為訓練集和測試集
(train_feature, train_label), (test_feature, test_label) = mnist.load_data()

顯示第一筆訓練資料

import matplotlib.pyplot as plt
def show_image(image):
  flg = plt.gcf()
  flg.set_size_inches(4,4) #圖片大小
  plt.imshow(image,cmap = 'binary') #黑白灰階顯示
  plt.show()

show_image(train_feature[0])

Feature預處理
每一個Mnist數字圖片都是28*28的二維向量圖片，必須轉為784float數字的一維向量，並將float數字標準化，才能增加模型訓練效率。
image轉換

train_feature_vector = train_feature.reshape(len(train_feature),784).astype('float32')
test_feature_vector = test_feature.reshape(len(test_feature),784).astype('float32')

因為train_feature_vector的資料是0-255間的浮點數，因此除以255可得0-1間的浮點數，稱為標準化，可提高預測模型的準確度。
image標準化

train_feature_nor = train_feature_vector/255
test_feature_nor = test_feature_vector/255

Lable預處理
Lable的資料為0-9的數字，為了增加模型效率，可以採用One-Hot-Encoding編碼(輸出的位元中只有一個是1其他都是0)

1.顯示lable值

print(train_label[0:10])

[5 0 4 1 9 2 1 3 1 4]

2.轉為One-Hot-Encoding編碼

from keras.src.utils import np_utils
train_label_onehot = np_utils.to_categorical(train_label)
test_label_onehot = np_utils.to_categorical(test_label)

3.顯示lable轉為One-Hot-Encoding編碼

print(train_label_onehot[0:10])

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]  --->5
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]  --->0
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]  --->4
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]  --->1
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]  --->9
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]  --->2
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]  --->1
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]  --->3
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]  --->1
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] --->4

---20231001---