[Day 19] - 文字向量化的前世今生：詞袋 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 19

自我挑戰組

NLP 新手的 30 天入門養成計畫系列第 19 篇

[Day 19] - 文字向量化的前世今生：詞袋

16th鐵人賽

sfg

2024-08-24 09:45:17

186 瀏覽

分享至

接下來要介紹的是詞袋 ( Bag of Words, BoW )，它也是文本轉數值資料的一種方法。在昨天的文章中，我們已經對它做了簡單的介紹，今天就來詳細聊聊吧。

假設我們的語料庫 ( corpus ) 中包含了很多文本，我們可以把所有的單詞都提取出來，做成一個完整的詞彙表。這個詞彙表可以看作是一個大袋子，裡面裝了很多不同的單詞，然後一個完整的句子就可以根據單詞出現的次數進行編碼 ( Encoding )，舉例來說：

I like to watch movies on the weekend.
She often watches movies at home.
They enjoy watching movies together.

假設一開始的語料庫中有這三個句子，先對他們做前處理和斷詞：

[’I’, ‘like’, ‘to’, ‘watch’, ‘movies’, ‘on’, ‘the’, ‘weekend’]
[’She’, ‘often’, ‘watch’, ‘movies’, ‘at’, ‘home’]
[’They’, ‘enjoy’, ‘watch’, ‘movies’, ‘together’]

然後將所有不重複的單詞整合在一個 list 中：

['watch', 'home', 'i', 'they', 'together', 'like', 'often', 'enjoy', 'weekend', 'movies', 'to', 'on', 'at', 'she', 'the']

到這邊為止，我們先用 NLTK 的工具實作看看吧：

import nltk, string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
nltk.download('punkt')
nltk.download('wordnet')

corpus = [
    "I like to watch movies on the weekend.",
    "She often watches movies at home.",
    "They enjoy watching movies together."
]

def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word, 'v') for word in tokens]
    return tokens

def corpus_to_bow(corpus):
    preprocessed_corpus = []
    for text in corpus:
        tokens = preprocess(text)
        preprocessed_corpus.extend(tokens)
    return list(set(preprocessed_corpus))

bow = corpus_to_bow(corpus)
print(bow)

['watch', 'home', 'i', 'they', 'together', 'like', 'often', 'enjoy', 'weekend', 'movies', 'to', 'on', 'at', 'she', 'the']

我們又用到之前學過的前處理技術了，在這裡可以補充一下，如果沒有做詞型還原的話，watch、watches 和 watching 就會被視為不一樣的單詞，讓整個詞袋顯得很冗長。

現在我們知道整個語料庫裡面有這些詞了，接下來就是要進行編碼，首先，我們要判斷句子中每一個單詞是否出現，然後將出現的次數標記在詞袋中對應的位置，這樣就完成了編碼。

來實作看看吧：

from collections import Counter

def text_to_vector(text, bow):
    tokens = preprocess(text)
    counts = Counter(tokens)
    vector = [counts.get(word, 0) for word in bow]
    return vector

corpus_vector = [text_to_vector(text, bow) for text in corpus]
print(corpus_vector)

[[1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1], [1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0], [1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

通過這樣的作法，我們就可以把原本 corpus 裡面的文章轉化為向量的形式，而且如果需要對一個新的句子來編碼，直接傳入這個句子和詞袋到函式 text_to_vector 就完成了，比方說 We love watching movies on the weekend. 就可以表示為 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]。

但事實上這樣的編碼方式存在很多問題：

無法處理 Out of Vocabulary ( OOV ) 單詞：像是剛剛這個句子中，因為 We 和 love 沒有出現在詞袋中，就沒有辦法對他們編碼。
不考慮語法結構：在建立詞袋的過程中，句子裡的單詞是被打散順序儲存的，因此我們只能獲得整個句子的編碼，而無法讓電腦學習這個句子的結構
浪費儲存空間：編碼之後的結果是一個稀疏向量 ( Sparse Vector )，當詞袋裡面有越多單詞，該向量的維度就越高，而且浪費很多空間儲存沒有用的 0。

也因為它有這麼多的缺點，現在大多使用其他更高效的編碼方式，像是明天會介紹到的 Word2Vec。

推薦文章