Day12 菜鳥的練功課程-整整齊齊與多讀點書

2022 iThome 鐵人賽

DAY 13

自我挑戰組

來創造一個AI角色吧-新手的探尋之路系列第 13 篇

14th鐵人賽

Leonard Lin

團隊團長找我來柬埔寨參加鐵人賽

2022-09-17 22:00:29

469 瀏覽

分享至

今天包含兩個主題，分別是padding和從外部讀句子。

由於每個句子的長短不一，但為了之後的神經元網路，所以我們要讓大家長度都一樣，可以使用”pad_sequences”，白話就是缺項補0，其中它有一些option，例如可以選擇向左或向右對齊，自訂最大長度，以及超過最大長度時要去頭還是去尾：

from tensorflow.keras.preprocessing.sequence import pad_sequences
# Print the padded result
padded = pad_sequences(sequences, padding='post', maxlen=10)

原本的句子是：

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

padding後就變成豆腐了：

[[ 4 2 1 3 0 0 0 0 0 0]
[ 4 2 1 6 0 0 0 0 0 0]
[ 5 2 1 3 0 0 0 0 0 0]
[ 7 5 8 1 3 9 10 0 0 0]]

以前常聽到別人教誨要多讀點書就會多識點字，對於機器人也是一樣的，所以我們可以讓它去讀各種的文學內容，在此要感謝好心人事提供資料，這邊有個範例它是json格式，內容包含許多新聞的標題，所以可讓程式去讀這些標題的句子：

import json

# Load the JSON file
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)
    
# Initialize lists
sentences = [] 

# Append elements in the dictionaries into each list
for item in datastore:
    sentences.append(item['headline'])

然後就可以用昨天同樣的方式建立辭典資料庫：

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the Tokenizer class
tokenizer = Tokenizer(oov_token="{OOV}")

# Generate the word index dictionary
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
# Print the length of the word index
print(f'number of words in word_index: {len(word_index)}')
# Print the word index
print(f'word_index: {word_index}')

# Generate and pad the sequences
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

# Print dimensions of padded sequences
print(f'shape of padded sequences: {padded.shape}')

我們可以去讀word_index和padded，統計出讀完的結果有29657個詞，而總共有26709個句子，其中最長的句子有40個詞。
如果是近期台灣新聞標題可能字彙量就不多了，很大部分都在照樣造句，這樣有冒犯嗎?