Day 9: 親手寫個檢索系統吧（二）倒排索引

11th鐵人賽 nlp 自然語言處理搜尋引擎資訊檢索

丹尼爾胡

2019-09-10 04:35:49

4240 瀏覽

分享至

1. 預處理

今天的實作我們會使用華爾街日報的的文件集，我有預先將文件集切割成只有兩萬份文件的集合，這份文件集能夠從以下的code中下載。在今天的實作中，我們會將每一行視為一份文件來處理（利用NLTK工具來記號化和正規劃）。

import requests
from pathlib import Path

fname = 'wsta_col_20k.gz'
my_file = Path(fname)
if not my_file.is_file():
    url = 'https://hyhu.me/resources/' + fname
    r = requests.get(url)

    # Save to the current directory
    with open(fname, 'wb') as f:
        f.write(r.content)

來測試一下我們下載的結果，讀讀看第一行（第一份文件）。

import gzip

raw_docs = []
with gzip.open(fname, 'rt') as f:
    for raw_doc in f:
        raw_docs.append(raw_doc)

print(len(raw_docs))
print(raw_docs[0])

接著，我們開始預處理。我們可以先用NLTK的工具word_tokenize來記號化每份文件，接著使用PorterStemmer來stem和小寫化(lowercase)各文件，並且把這些正規化後的字彙加上Unique ID存進Python的dict資料型態，所有的字彙M有自己的字彙ID[0..M−1]。這過程可能需要幾分鐘，如果一直沒跑出來不要緊張，等他一下～

import nltk

# 感謝thwu的提醒，這邊需要下載`punkt`以及宣告`stemmer`
nltk.download('punkt')
stemmer = nltk.stem.PorterStemmer()

# processed_docs 儲存預處理過的文件列表
processed_docs = []
# vocab 儲存(term, term id)組合
vocab = {}
# total_tokens 儲存總共字數（不是字彙量，而是記號總量）
total_tokens = 0

for raw_doc in raw_docs:
    
    # norm_doc 儲存正規化後的文件
    norm_doc = []
    
    # 使用word_tokenize
    tokenized_document = nltk.word_tokenize(raw_doc)
    for token in tokenized_document:
        stemmed_token = stemmer.stem(token).lower()
        norm_doc.append(stemmed_token)

        total_tokens += 1
        
        # 將正規化後的字彙存進vocab (不重複存同樣字型的字彙)
        if stemmed_token not in vocab:
            vocab[stemmed_token] = len(vocab)
            
    processed_docs.append(norm_doc)

print("Number of documents = {}".format(len(processed_docs)))
print("Number of unique terms = {}".format(len(vocab)))
print("Number of tokens = {}".format(total_tokens))

接著，我們可以使用Python的Counter來計算每個文件的詞頻。我們將每個字彙當作key，它的詞頻當作value。最後我們將所有文件各自的Counter存進doc_term_freqs列表中。

以一個Document為例：

the old night keeper keeps the keep in the town. in the big old house in the big old gown. The house in the town had the big old keep where the old night keeper never did sleep. The keeper keeps the keep in the night and keeps in the dark and sleeps in the light.

在經過記號化和stemming之後，它的Counter應該長這樣：

Counter({'the': 14, 'in': 7, 'keep': 6, 'old': 5, '.': 4, 'night': 3, 'keeper': 3, 'big': 3, 'town': 2, 'hous': 2, 'sleep': 2, 'and': 2, 'gown': 1, 'had': 1, 'where': 1, 'never': 1, 'did': 1, 'dark': 1, 'light': 1})

from collections import Counter

# doc_term_freqs 儲存每個文件分別的字彙及詞頻Counter
doc_term_freqs = []

for norm_doc in processed_docs:
    tfs = Counter()
    # 計算詞頻
    for token in norm_doc:
        tfs[token] += 1
    doc_term_freqs.append(tfs)

print(len(doc_term_freqs))
print(doc_term_freqs[0])
print(doc_term_freqs[100])

2. 倒排索引

再來就是我們的倒排索引，開發上我主要分成六個部分：

字彙表 vocab ，用以記錄字彙與其ID
每個文件的長度 doc_len
doc_ids 是一個list，其中儲存了這個doc所包含的所有字彙的ID。is a list indexed by term IDs. For each term ID, it stores a list of document ids of all documents containing that term
doc_term_freqs 是一個list，其中儲存了這個doc中相對應doc_ids的詞頻（就像Day 7中的第二個倒排索引）。每一個term ID都應該儲存著自己的文件-字彙詞頻列表(document term frequencies f_{d,t}，這個列表說明了每個檔案 d 中各個字彙 t 分別出現的頻率 f
doc_term_freqs 記錄了各個詞出現在每個文件的詞頻，而 doc_freqs 則記錄各個詞總共出現在多少個文件中。這會跟我們明天所要說的TF-IDF有關。文件頻率(Document Frequency) ft 的記法是，只要他曾經出現過，ft 就會+1。
最後我存了兩個數字 total_num_docs 和 max_doc_len ，他們分別記錄了總共處理的文件數量（應該要是兩萬份）以及單一文件最長的長度

這裡有些儲存的資料並不是為了之後計算TF-IDF用的，而是一些方便我們驗證開發正確性的統計數字。

class InvertedIndex:
    def __init__(self, vocab, doc_term_freqs):
        self.vocab = vocab
        self.doc_len = [0] * len(doc_term_freqs)
        self.doc_term_freqs = [[] for i in range(len(vocab))]
        self.doc_ids = [[] for i in range(len(vocab))]
        self.doc_freqs = [0] * len(vocab)
        self.total_num_docs = 0
        self.max_doc_len = 0
        for docid, term_freqs in enumerate(doc_term_freqs):
            doc_len = sum(term_freqs.values())
            self.max_doc_len = max(doc_len, self.max_doc_len)
            self.doc_len[docid] = doc_len
            self.total_num_docs += 1
            for term, freq in term_freqs.items():
                term_id = vocab[term]
                self.doc_ids[term_id].append(docid)
                self.doc_term_freqs[term_id].append(freq)
                self.doc_freqs[term_id] += 1
    
    def num_terms(self):
        return len(self.doc_ids)
    
    def num_docs(self):
        return self.total_num_docs
    
    def docids(self, term):
        term_id = self.vocab[term]
        return self.doc_ids[term_id]
    		
	def freqs(self, term):
        term_id = self.vocab[term]
        return self.doc_term_freqs[term_id]
    
    def f_t(self, term): 
        term_id = self.vocab[term]
        return self.doc_freqs[term_id]
    
    def space_in_bytes(self):
        # 我們假設每個integer使用8 bytes
        space_usage = 0
        for doc_list in self.doc_ids:
            space_usage += len(doc_list) * 8
        for freq_list in self.doc_term_freqs:
            space_usage += len(freq_list) * 8
        return space_usage
        
    
invindex = InvertedIndex(vocab, doc_term_freqs)
    
# print inverted index stats
print("documents = {}".format(invindex.num_docs()))
print("number of terms = {}".format(invindex.num_terms()))
print("longest document length = {}".format(invindex.max_doc_len))
print("uncompressed space usage MiB = {:.3f}".format(invindex.space_in_bytes() / (1024.0 * 1024.0)))

今天實作的Jupyter Notebook存在這裡。

Day 8: 認識文件矩陣以及索引的建立

Day 10: TF-IDF 文件加權與實作

系列文

深入淺出搜尋引擎和自然語言處理共 30 篇

RSS系列文訂閱系列文

133 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

1 則留言

thwu

iT邦新手 4 級 ‧ 2019-10-09 15:40:57

您好
我於 Jupyter 中輸入在 "預處理" 這段的實作，並在執行程式時出現錯誤提示。
加入以下兩行程式碼後才可以執行

import nltk
# ---加入以下兩行
nltk.download('punkt')
stemmer = nltk.stem.PorterStemmer()
# ---

# processed_docs 儲存預處理過的文件列表
processed_docs = []
# vocab 儲存(term, term id)組合
vocab = {}
# total_tokens 儲存總共字數（不是字彙量，而是記號總量）
total_tokens = 0
...

提供給您參考，謝謝