NLP-台語羅馬字: Word Embeddings Using Keras

第 12 屆 iThome 鐵人賽

AI & Data

Machine Learning系列第 31 篇

12th鐵人賽 npl word embedding 台語文

tjabi

2020-11-06 20:58:53

5447 瀏覽

分享至

在 Word Embeddings 中，每一個字都以一個n維度的稠密向量(n-dimensional dense vector)來表示；相似意義的字會有相似的向量。

執行 Word Embeddings，我們可以使用 Keras library 的 Embedding()。Embedding() 可以客製Word Embeddings或載入已訓練的Word Embeddings。

embedding_layer = Embedding(200, 32, input_length=50)

第一個參數－字彙數目或文章中 unique words 數目。
第二個參數－每個字彙向量的維度。
第二個參數－每個輸入(input)句子的長度。
Embedding()會產生一個2D向量(2D vector)，列代表字彙，行顯示相對應的維度。

Embedding() 可以學習客製(custom) Word Embeddings 或載入已訓練的 Word Embeddings。這裡我們將學習客製(custom) Word Embeddings。

首先，載入相關的 libraries：

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.embeddings import Embedding

我們將使用這個資料集：

corpus = [
    # Positive Reviews

    'tse sī tsi̍t tshut tshut-sik ê tiān-iánn',   # This is an excellent movie
    'tse tiān-iánn tsiok tsán guá kah-ì',        # The move was fantastic I like it
    
    # Negtive Reviews
    'khióng-pòo ê tshut-ián',    # horrible acting
    'guá bô kah-ì tse tiān-iánn',  # I did not like the movie
    'tse tiān-iánn si̍t-tsāi-sī khióng-pòo', # The movie was horrible
   
]

移出字彙中的'-'，若不移除 Embedding() 會將這些字在拆成兩個字。

import re
from nltk.tokenize import word_tokenize
# remove -
def remove_re(corpus):
    results = []
    for text in corpus:
        text = re.sub(r'-', "", text)
        results.append(text)
    return results
corpus = remove_re(corpus)

計算文章中 unique word 數目。

all_words = []
for sent in corpus:
    tokenize_word = word_tokenize(sent)
    for word in tokenize_word:
        all_words.append(word)
       
unique_words = set(all_words)
print(len(unique_words))

接著，我們必須把字彙轉換成數字才能被 Embedding() 讀取。使用 keras.preprocessing.text library 中的 one_hot 函數。

vocab_length = 30
embedded_sentences = [one_hot(sent, vocab_length) for sent in corpus]
print(embedded_sentences)

[[3, 25, 29, 15, 24, 1, 23], [3, 23, 22, 28, 17, 6], [7, 1, 20], [17, 23, 6, 3, 23], [3, 23, 2, 7]]
我們可以看到第一個句子有7個字，所以有7個整數在第一個 list 項目上。

再來，必須設定句子長度，將空白的indexes填補上0，這樣句子才能變成等長，才能被 Embedding() 讀取。使用 pad_sequences() 函數。

padded_sentences = pad_sequences(embedded_sentences, maxlen=12, padding='post') 
print(padded_sentences)

[[ 3 25 29 15 24 1 23 0 0 0 0 0]
[ 3 23 22 28 17 6 0 0 0 0 0 0]
[ 7 1 20 0 0 0 0 0 0 0 0 0]
[17 23 6 3 23 0 0 0 0 0 0 0]
[ 3 23 2 7 0 0 0 0 0 0 0 0]]

現在我們可以建立 model 了。

model = Sequential()
model.add(Embedding(vocab_length, 2, input_length= length_long_sentence))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

檢視 Embedding() 產生的 2D vector

output = model.predict(padded_sentences)
print(output)

[[[-0.01115279 -0.0053846 0.0145705 0.01441126 -0.01934116]
[ 0.01724459 0.03577454 0.02544147 0.0369082 0.02247829]
[-0.00657413 0.04421231 0.03926947 0.01498995 0.00432252]
[-0.01672726 0.04325547 -0.01818988 0.01232086 0.03949806]
[ 0.03714544 -0.03660127 0.03566999 -0.03256686 0.03914088]
[-0.00261252 0.01996125 -0.03446733 -0.01299053 0.00557587]
[ 0.01985036 0.02891095 0.04272795 0.03223069 0.01777556]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]]
....
....
....
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]]]