[常見的自然語言處理技術] 文本相似度(IV): 建立自己的Word2vec模型

2021 iThome 鐵人賽

DAY 15

AI & Data

當自然語言處理遇上深度學習系列第 15 篇

13th鐵人賽 natural language processing word embedding word2vec

Friedrich1942

2021-09-23 23:57:23

3293 瀏覽

分享至

前言

原本以為文本相似度這個主題兩天就可以結束了，沒想到花了四天來講。今天將會是介紹自然語言處理基礎的最後一篇，就讓我們做個客製化的 embedding model 來做個小收尾。

Word2vec模型-續

延續昨天關於 CBoW 的探討以及實作，我們今天將會讓神經網路進行學習，以建立二維的 word embeddings 。

連續詞袋模型架構（CBoW）-續

由於我後來用了 Tensorflow 1.X 的作法進行模型訓練，因此今天的模型定義法會與昨天有些不同。我們將經由 CBoW 演算法得出的訓練詞對(context, target)一一列出：

# Build a CBoW (contex, target) generator

from sklearn.feature_extraction.text import CountVectorizer

# set context_length
context_length = 2

# function to get cbows
def get_cbow_datapairs(tokens, context_length):
    cbows = list()
    for i, target in enumerate(tokens):
        if i < context_length:
            pass
        elif i < len(tokens) - context_length:
            context = tokens[i - context_length : i] + tokens[i + 1 : i + context_length + 1]
            vectoriser = CountVectorizer()
            vectoriser.fit_transform(context)
            context_no_order = vectoriser.get_feature_names()
            for word in context_no_order:
                cbows.append([word, target])
    return cbows
# generate data pairs
cbows_data = get_cbow_datapairs(tokens, context_length)

# prints out dataset
for cbow in cbows_data:
    print(cbow)

我們總共得到了33對 [context word, target word] （ context word 與 target word 分別為特徵與標籤）：

將每一筆訓練資料對中的單詞都進行 one-hot 編碼：

def get_onehot_list(word, vocab):
    onehot_encoded = [0] * len(vocab)
    if word in vocab:
        onehot_encoded[vocab[word]] = 1
    return onehot_encoded


X_train = list()
y_train = list()

# one-hot encode each data pair
for i in range(len(cbows_data)):
    X_train.append(get_onehot_list(cbows_data[i][0], vocab))
    y_train.append(get_onehot_list(cbows_data[i][1], vocab))
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
print("X_train: ", X_train, ", size: ", X_train.shape) # (33, 8)
print("y_train:", y_train, ", size: ", y_train.shape) # (33, 8)

我們採用 Tensorflow 作為建構網絡的框架（ framework ）。由於今天我使用 Tensorflow 1.X 語法來設計 Word2vec 淺層網絡，若是使用 Tensorflow 2.X 版本的小夥伴可以額外加入以下的程式碼：

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

昨天我們將 target word 的 contexts 一並輸入，故輸入層維度是 C x V，其中 C 即是兩倍的 context length ， V 是詞彙量（以我們的例子是8）。今天我們在訓練資料的準備上做了點手腳，將原有的( context words, target word )都「攤開」了，因此輸入層的維度即是 V。
接下來，開始建構輸入層到隱藏層之間的權重 W1 以及 bias b1 。所謂的 word embedding，以我們的例子而言，即是經過 one-hot 編碼之後傳入隱藏層的二維向量。從輸入層到隱藏層之間的神經網絡又稱為編碼器（ encoder ）。而神經網絡的另一個部分則為解碼器（ decoder ），由隱藏在二維向量轉為原先維度V的向量，經過 softmax 對各個維度進行機率估計，以此來接近經過 one-hot 編碼的 target word 。

x = tf.placeholder(tf.float32, shape = (None, vocab_size))
y_label = tf.placeholder(tf.float32, shape = (None, vocab_size))

# Build our model- Embedding Part
embed_dim = 2 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, embed_dim]))
b1 = tf.Variable(tf.random_normal([embed_dim])) #bias
hidden_repre = tf.add(tf.matmul(x, W1), b1)


W2 = tf.Variable(tf.random_normal([embed_dim, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
predict = tf.nn.softmax(tf.add( tf.matmul(hidden_repre, W2), b2))

接下來就是訓練的時刻了，整個訓練過程將會經過5000個訓練回合：

# Start training
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(predict), reduction_indices = [1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_epochs = 5000
# train for n_iter iterations
with tf.device("GPU:0"):
    print("Training with GPU:")
    start = time.time()
    for n in range(n_epochs):
        sess.run(train_step, feed_dict = {x: X_train, y_label: y_train})
        # print("epoch {}: loss is {}".format(n, sess.run(cross_entropy_loss, feed_dict = {x: X_train, y_label: y_train})))
    print("Training is done! Time spent: {} s".format(time.time() - start))

歷時12秒訓練完成！接下來我們測試一下 "king" 這個單詞的 word embedding ：

# predict word
vectors = sess.run(W1 + b1)
text_word = "king"
word_id = vocab[text_word]
print("word embedding of {} is {}".format(text_word, vectors[word_id]))

其二維word embedding如下：

接著我們使用scikit-learn 工具包當中的 t-SNE 將詞彙表中的每個單詞呈現在二維平面上：

from sklearn.manifold import TSNE
from sklearn import preprocessing
import matplotlib.pyplot as plt


model = TSNE(n_components = 2, random_state = 0)
np.set_printoptions(suppress = True)
vectors = model.fit_transform(vectors)


normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, "l2")

fig, ax = plt.subplots(figsize = (10, 8))
fig.suptitle("My Word Embeddings", fontsize = 20)
ax.set_xlim([-1.5, 1.5])
ax.set_ylim([-1.5, 1.5])
for token in tokens:
    print(token, vectors[vocab[token]][1])
    ax.annotate(token, (vectors[vocab[token]][0], vectors[vocab[token]][1] ))
plt.show()

從圖上我們可以觀察每個單詞分佈的狀況，也可以藉由 cosine distance 找出最接近的單詞：

text = "queen"
closest_word_cos = idx2word(find_closest_cosine(vocab[text_word], vectors), vocab)
print("using cosine distance:", end = ' ')
print("'{}' is closest to '{}'".format(text_word, closest_word_cos))
# using cosine distance: 'queen' is closest to 'woman'

跳躍式模型架構（Skip-Gram）

另一種取出( context, target )的演算法為跳躍式模型（ Skip-Gram, SG ），其是藉由中心單詞來推敲上下文序列。值得注意的是， CBoW 藉由嵌入每個 context word 再平均來得出藏在隱藏層的 word embedding，所以上下文的排序並不重要。而在 Skip-Gram 中， context words 的順序很重要。對於這個演算法的介紹，我們停留在概念介紹，就不像 CBoW 一樣一步一步定義模型，打造 word embeddings。

Skip-Gram 模型觀察中間的單詞來推敲上下文：

圖片來源：Practical Natural Language Processing by Sowmya Vajjala et al.

結論

除了使用 Tensorflow 、 PyTorch 等框架來從頭建立Word2vec模型，我們也可以透過套件 Gensim 來客製化屬於我們自己的 word embedding models ，有興趣的讀者可以參考下方的文章連結。今天的介紹就到此為止，耗時四天的文本相似度介紹也正式劃下句點。明天我們將快速回顧深度學習的概念以及重要模型，為之後建造屬於我們自己的翻譯器鋪上一條康莊大道！