一. 前言
前一天已經說明N-gram的一些計算方式了,這篇會以實作'預測詞'來作為N-gram的範例,就是利用前面的詞來預測後面該接哪個詞較好,這是參與某堂課裡面的其中一項作業,因覺得應用不錯,故分享~但我有做一些小修改,讓我比較好講解XD~
二. 資料集
資料集是利用維基百科的資料的中文資料做處理,該課程提供的資料集較小,猜測應該是有縮小資料集,但來源是維基的沒有錯
三. 演算法
四. 實作
import 套件:
import re
from collections import Counter
簡單的前處理,只保留中文,並斷句:
def prepocess_text(line: str) -> list:
# 只保留中文字元,並且斷開不連續的中文字
chinese = r'[\u4E00-\u9FFF]+'
segments = re.findall(chinese, line)
return segments
讀取data,用prepocess_text來做前處理:
word_list = []
with open('./wiki_zh_small.txt', encoding="utf-8") as file:
for line in file.readlines():
word_list += prepocess_text(line)
prepocess_text('“今天”雨會下非常大,大到你受不了')
# output: ['今天', '雨會下非常大', '大到你受不了']
建立counter的class,用來記錄1個字出現幾次,2個字出現幾次:
class etWordCounters:
def __init__(self, n):
self.n = n
self.counters = [Counter() for _ in range(n + 1)]
def generate_gram(self, segments):
# 若 n=1->unigram n=2-> bigram
for i in range(1, 1 + self.n):
for segment in segments:
self.counters[i] += Counter(self._skip(segment, i))
# 用 self.counters[0] 來記錄總共有幾個字,這邊用'eating'當key XD
self.counters[0] = Counter({'eating': sum(dict(self.counters[1]).values())})
def __getitem__(self, k):
return self.counters[k]
def _skip(self, segment, n):
if len(segment) < n:
return []
shift = n - 1
for i in range(len(segment) - shift):
yield segment[i : i + shift + 1]
建立N-gram 的class:
class Ngram:
def __init__(self, n: int, counters: list):
"""
n: n-gram's n
counters: etWordCounters object
"""
self.n = n
self.major_counter = counters[n]
self.minor_counter = counters[n-1]
def predict_next_word(self, prefix: str = '', top_k: int = 5):
"""
explain:
if prefix is empty string, using 1-gram predict next word
elif prefix is greater than one words use, using 2-gram
"""
# 表示前無詞可預測
if self.n <= 1:
prefix = 'eating'
else:
prefix = prefix[-(self.n - 1):] # 取得str後面個字
count_prefix = self.minor_counter[prefix]
probs = []
# get word and probablity
for key, count in dict(self.major_counter).items():
# transfer eating to ''
prefix = '' if prefix == 'eating' else prefix
if key.startswith(prefix):
prob = count / count_prefix
probs.append((prob, key[-1]))
sorted_probs = sorted(probs, reverse=True)
return sorted_probs[:top_k] if top_k > 0 else sorted_probs
def get_word_dict(self, prefix=''):
"""
explain:
get predict_next_word result and transfer to dict
"""
return {word: prob for prob, word in self.predict_next_word(prefix, top_k=-1)}
計算字的次數:
counters = etWordCounters(n=2)
counters.generate_gram(word_list)
來看共有一個字:
# 全部有多少字
counters[0]
# output: Counter({'eating': 371373}) 總共有371373個字
建立 uni-gram 與 bi-gram model
unigram = Ngram(1, counters)
bigram = Ngram(2, counters)
預測建立預測下個字model的class,若前面無詞用uni-gram,有一個字以上用bi-gram:
class ChineseWordPredict:
def __init__(self, unigram, bigram):
self.unigram = unigram
self.bigram = bigram
def predict_proba(self, prefix='', top_k=5):
# 使用Ngram來建立選字系統
if len(prefix) == 0:
return self.unigram.predict_next_word(prefix, top_k)
elif len(prefix) >= 1:
return self.bigram.predict_next_word(prefix, top_k)
model = ChineseWordPredict(unigram, bigram)
預測'我思'的下個字:
probs = model.predict_proba('我思', top_k=4)
probs
# output:
# [(0.3370165745856354, '想'),
# (0.12154696132596685, '考'),
# (0.09944751381215469, '維'),
# (0.04419889502762431, '是')]
這次介紹怎麼建立N-gram 的模型,下一篇會開始介紹POS(part of speech) 以及如何用python實作演算法,POS的主題還蠻長的,應該會花個幾天寫這個主題~