【27】遇到不平衡資料(Imbalanced Data) 時使用 Undersampling 解決實驗

2021 iThome 鐵人賽

DAY 27

AI & Data

30 天在 Colab 嘗試的 30 個影像分類訓練實驗系列第 27 篇

13th鐵人賽 tensorflow colab

Capillary J

2021-10-11 14:32:51

4639 瀏覽

分享至

Colab連結

不平衡資料集(Imbalanced Dataset) 指的是當你的資料集中，有某部分的 label 是極少數的狀況，在這種狀況下，若單純只用準確度 accuracy 作為指標會有些偏頗，也會容易讓模型偷懶，試想要是今天二分類問題，某樣本出現的機率本身就很小，那我是不是每次都回答另一個樣本就有99%準確度。

我們今天會使用 mnist 來實驗遇到這種問題時，用 Undersampling 方式，降低其他多數樣本來提升少數樣本的準確度。

由於我們要故意降低 mnist 某些樣本數，所以這次不使用 tfds 官方提供的數據，而是自己去下載原先的 mnist 來測試。

!wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz .

下載結束後，將檔案解壓出來。

import gzip
image_size=28

num_images = 60000
with gzip.open('train-labels-idx1-ubyte.gz') as bytestream:
    bytestream.read(8)
    buf = bytestream.read(1*num_images)
    train_labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)

with gzip.open('train-images-idx3-ubyte.gz') as bytestream:
    bytestream.read(16)
    buf = bytestream.read(image_size*image_size*num_images)
    train_images = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    train_images = train_images.reshape(num_images, image_size, image_size, 1)

num_images = 10000
with gzip.open('t10k-labels-idx1-ubyte.gz') as bytestream:
    bytestream.read(8)
    buf = bytestream.read(1*num_images)
    test_labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)

with gzip.open('t10k-images-idx3-ubyte.gz') as bytestream:
    bytestream.read(16)
    buf = bytestream.read(image_size*image_size*num_images)
    test_images = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    test_images = test_images.reshape(num_images, image_size, image_size, 1)

這次實驗我選定數字 6, 8, 9 這三個樣本當作不平衡的少數樣本，原因是我覺得這三個數字形狀上有某些部分相似。

我們先將資料集按照順序排好。

idx = np.argsort(train_labels)
train_labels_sorted = train_labels[idx]
train_images_sorted = train_images[idx]

idx = np.argsort(test_labels)
test_labels_sorted = test_labels[idx]
test_images_sorted = test_images[idx]

查看各個樣本數個是多少

unique, counts = np.unique(train_labels_sorted, return_counts=True)
dict(zip(unique, counts))

產出:

{0: 5923,
 1: 6742,
 2: 5958,
 3: 6131,
 4: 5842,
 5: 5421,
 6: 5918,
 7: 6265,
 8: 5851,
 9: 5949}

接下來限定 6,8,9 的樣本各取100筆。

idx_we_want = list(range(sum(counts[:6])+100)) + list(range(sum(counts[:7]) ,sum(counts[:8])+100)) + list(range(sum(counts[:9]) ,sum(counts[:9])+100))
train_label_imbalanced = train_labels_sorted[idx_we_want]
train_images_imbalanced = train_images_sorted[idx_we_want]

train_images_imbalanced, train_label_imbalanced = shuffle(train_images_imbalanced, train_label_imbalanced)

取完之後再確認一下個個樣本數:

{0: 5923,
 1: 6742,
 2: 5958,
 3: 6131,
 4: 5842,
 5: 5421,
 6: 100,
 7: 6265,
 8: 100,
 9: 100}

以上是訓練及的部分，再來，因為6,8,9的樣本變少了，但是其他樣本仍然多數，為了更有感覺模型在6,8,9這三種樣本的準確度如何?我們在測試資料集中針對這三種樣本單獨抽出來，作為訓練時的驗證資料集。

idx_we_want = list(range(sum(counts[:6]),sum(counts[:6])+counts[6])) + list(range(sum(counts[:8]),sum(counts[:8])+counts[8])) + list(range(sum(counts[:9]),sum(counts[:9])+counts[9]))
test_label_689 = test_labels_sorted[idx_we_want]
test_images_689 = test_images_sorted[idx_we_want]

測試集樣本分布狀況:

{6: 958, 8: 974, 9: 1009}

好的，清潔完資料後，我們開始來測試在這種不平衡的狀況之下，訓練模型會有什麼樣的問題。

model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, [3, 3], activation='relu', input_shape=(28,28,1)))
model.add(tf.keras.layers.Conv2D(64, [3, 3], activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10))

model.compile(
    optimizer=tf.keras.optimizers.SGD(LR),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history = model.fit(
    ds_train_im,
    epochs=EPOCHS,
    validation_data=ds_test,
)

產出:

Epoch 24/30
loss: 0.0195 - sparse_categorical_accuracy: 0.9932 - val_loss: 0.8394 - val_sparse_categorical_accuracy: 0.8089

我們得到測試集的準確度在80%左右。

接下來，我們把訓練集中的資料每個樣本各取100筆，大幅度將6,8,9以外的樣本減量到一樣100筆來訓練。

{0: 100,
 1: 100,
 2: 100,
 3: 100,
 4: 100,
 5: 100,
 6: 100,
 7: 100,
 8: 100,
 9: 100}

model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, [3, 3], activation='relu', input_shape=(28,28,1)))
model.add(tf.keras.layers.Conv2D(64, [3, 3], activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10))

model.compile(
    optimizer=tf.keras.optimizers.SGD(LR),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history = model.fit(
    ds_train_im,
    epochs=EPOCHS,
    validation_data=ds_test,
)

產出:

Epoch 27/30
loss: 0.1910 - sparse_categorical_accuracy: 0.9370 - val_loss: 0.2793 - val_sparse_categorical_accuracy: 0.9300

準確度提升成至93%，這次將 Undersampling 方法套用在 mnist 實驗算是有效果的。

【26】你都把 Batch Normalization 放在 ReLU 前面還是後面

【28】遇到不平衡資料(Imbalanced Data) 時使用 Oversampling 解決實驗

系列文

30 天在 Colab 嘗試的 30 個影像分類訓練實驗共 31 篇

RSS系列文訂閱系列文

8 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

請問ASUS伺服器RS300-E8-PS4硬碟問題

IT邦幫忙

30 天在 Colab 嘗試的 30 個影像分類訓練實驗系列 第 27 篇