iT邦幫忙

2022 iThome 鐵人賽

DAY 15
0

繼續昨天的例子,我們觀察在訓練過程中的loss和accuracy,可以看到訓練的accuracy已經到達1,但測試的accuracy呈現水平,而訓練的loss趨近於0,但測試的loss卻逐漸升高,又是一個熟悉的overfitting的樣子:
https://ithelp.ithome.com.tw/upload/images/20220919/20141158gxj9YRlqkp.png
https://ithelp.ithome.com.tw/upload/images/20220919/20141158d7lLaQnb6h.png

我們回去看前一天的例子,如果還記得,總共有29657個詞,26709個句子,其中最長的句子有40個詞,我們先設定到10000個單字量和每個句子最長32個字,另外將20000句子當訓練用,剩餘當測試用:

# Number of examples to use for training
training_size = 20000

# Vocabulary size of the tokenizer
vocab_size = 10000

# Maximum length of the padded sequences
max_length = 32

# Output dimensions of the Embedding layer
embedding_dim = 16

# Split the sentences
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]

# Split the labels
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

# Generate the word index dictionary
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# Generate and pad the training sequences
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Generate and pad the testing sequences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Convert the labels lists into numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

import tensorflow as tf

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

可以看到loss和accuracy也是有同樣的病徵,甚至測試的accuracy是下降的:
https://ithelp.ithome.com.tw/upload/images/20220919/20141158ZGVmngXaPm.png
https://ithelp.ithome.com.tw/upload/images/20220919/201411586LAmoYD7FM.png

但如果我們將單字量設定到1000而句子最長只有16個字,可以看到測試的loss是持平的:
https://ithelp.ithome.com.tw/upload/images/20220919/20141158XRjxuileWv.png
https://ithelp.ithome.com.tw/upload/images/20220919/20141158PJvFnRhAAT.png

所以可以假設是因為太多字眼反而會影響測試的判斷?
回到昨天的例子,我們用相同的設定,與相同的模型,也可看到類似的轉變,但loss和accuracy的水平都變差了,這需要多試幾次來調整。
https://ithelp.ithome.com.tw/upload/images/20220919/201411586tyN48kDrK.png
https://ithelp.ithome.com.tw/upload/images/20220919/20141158icl36mRAv5.png


上一篇
Day13 菜鳥的練功課程-多重宇宙辭典
下一篇
Day15 菜鳥的練功課程-Subword
系列文
來創造一個AI角色吧-新手的探尋之路30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言