Day18 菜鳥的練功課程-文字接龍

2022 iThome 鐵人賽

DAY 19

自我挑戰組

來創造一個AI角色吧-新手的探尋之路系列第 19 篇

14th鐵人賽

Leonard Lin

團隊團長找我來柬埔寨參加鐵人賽

2022-09-23 21:22:06

1243 瀏覽

分享至

今天改來玩另一個遊戲，文字接龍，但這接龍的規則不太一樣，是要根據前面的句子，繼續接下去，讓我們來看看怎麼做到。
首先我們來用比較詩意的句子，先把它load進來，並同樣創成字典庫：

# Load the dataset
data = open('./irish-lyrics-eof.txt').read()

# Lowercase and split the text
corpus = data.lower().split("\n")

# Initialize the Tokenizer class
tokenizer = Tokenizer()

# Generate the word index dictionary
tokenizer.fit_on_texts(corpus)

# Define the total words. You add 1 for the index `0` which is just the padding token.
total_words = len(tokenizer.word_index) + 1

接著我們再將每個句子轉成代碼後，拆成從最前面的兩個字，依序到完整句子的片段，以中文"我是一隻小小小小鳥"舉例，這樣拆是可以將每個字成標籤，這個字前面的句子就是對應的資料：

程式碼如下，然後再將標籤對應的代碼，轉換成單字量大小的陣列的index，也就是單字代號的那個索引嚇得值為1，其他為0：

# Initialize the sequences list
input_sequences = []

# Loop over every line
for line in corpus:

	# Tokenize the current line
	token_list = tokenizer.texts_to_sequences([line])[0]

	# Loop over the line several times to generate the subphrases
	for i in range(1, len(token_list)):
		
		# Generate the subphrase
		n_gram_sequence = token_list[:i+1]

		# Append the subphrase to the sequences list
		input_sequences.append(n_gram_sequence)
	
# Get the length of the longest line
max_sequence_len = max([len(x) for x in input_sequences])

# Pad all sequences
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))


# Create inputs and label by splitting the last token in the subphrases
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

# Convert the label into one-hot arrays
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

接下來就是建立模型，由於同樣有上下文的關係，所以用LSTM版本的，另外如同在圖形辨識時多個分類的設定：

# Hyperparameters
embedding_dim = 100
lstm_units = 150
learning_rate = 0.01

# Build the model
model = Sequential([
          Embedding(total_words, embedding_dim, input_length=max_sequence_len-1),
          Bidirectional(LSTM(lstm_units)),
          Dense(total_words, activation='softmax')
])

# Use categorical crossentropy because this is a multi-class problem
model.compile(
    loss='categorical_crossentropy', 
    optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), 
    metrics=['accuracy']
    )

epochs = 100

# Train the model
history = model.fit(xs, ys, epochs=epochs)

訓練玩模型後，首先我們來看它造句子的能力，我們要掀起個頭，然後我們就用這個片段去預測下個字，然後再把這個字將到原本的片段上，反覆看我們要造多長，它就會一直接下去了：

# Define seed text
seed_text = "good morning"

# Define total words to predict
next_words = 20

# Loop until desired length is reached
for _ in range(next_words):

	# Convert the seed text to a token sequence
	token_list = tokenizer.texts_to_sequences([seed_text])[0]

	# Pad the sequence
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	
	# Feed to the model and get the probabilities for each index
	probabilities = model.predict(token_list)

	# Get the index with the highest probability
	predicted = np.argmax(probabilities, axis=-1)[0]

	# Ignore if index is 0 because that is just the padding.
	if predicted != 0:
		
		# Look up the word associated with the index. 
		output_word = tokenizer.index_word[predicted]

		# Combine with the seed text
		seed_text += " " + output_word

# Print the result	
print(seed_text)

以"good morning”為例，造出20個字的句子是：