iT邦幫忙

2022 iThome 鐵人賽

DAY 4
0

延續昨天講解的 pre-processing 步驟,另一項重要的步驟就是斷詞(tokenization)。在網路爬蟲取得語料之後,先用前一篇提到的 regular expression 初步整理完語料,但此時 pre-processing 的任務並未結束。觀察手邊的資料,很容易發現,大部分的語料都是以句子的方式呈現,因為無論是社群網站上的貼文、新聞文本以及更多其他文字資訊的呈現,皆是完整的句子才能夠傳遞清楚且完整的意義。但是,句子包含大量的資訊以及複雜的結構,因此,進行 NLP 任務時若以句子當作最小單位作為輸入,會讓電腦難無法抓取其中的資訊、難以消化這樣的資料形式(我絕對不會說我在第一次執行 NLP 任務時就是這樣把長長的句子丟進模型訓練,整台電腦起飛,什麼結果都沒有產出......)而無法訓練出語言模型。

那麼,如果不以句子為單位,要以什麼做為輸入呢?答案就是本篇的主角----斷詞所生成的詞條(tokens)!詞條基本上可以理解為我們常常在講的詞語/字詞(words)。而將句子切成好多個詞條的過程就叫做斷詞(tokenization)或稱分詞(word segmentation)。在 NLP 任務執行時,英文相對來說是較好斷詞的語言,畢竟其在書寫時字與字之間就留有空白,只要依照空白處切分字詞基本上不會有錯。而中文的斷詞就非常不容易,因為字詞之間沒有明確的分界,且可能因斷詞時切分的位置不同,語意也有所不同。故中文的斷詞到現在依然是一個直得研究的問題。目前主要的中文斷詞引擎有 CKIP 以及 Jieba。

另外,在斷詞時,通常會一併去除標點符號,形成只有一個一個單一詞彙的狀態,除非標點符是你的研究中一項重要的特徵,才有可能將其保留。除了標點符號之外,還有一類字詞會在斷詞時被刪除,那就是停用詞(stop words)。停用詞就是在 NLP 任務時過濾掉的字詞,因為那些字詞的存在可能影響電腦的判斷、降低效率等等。根據維基百科中的說明,應該就能清楚知道為何有所謂的停用詞。「對於一個給定的目的,任何一類的詞語都可以被選作停用詞。通常意義上,停用詞大致分為兩類。一類是人類語言中包含的功能詞,這些功能詞極其普遍,與其他詞相比,功能詞沒有什麼實際含義,比如'the'、'is'、'at'、'which'、'on'等。另一類詞包括詞彙詞,比如'want'等,這些詞應用十分廣泛,但是對這樣的詞搜尋引擎無法保證能夠給出真正相關的搜索結果,難以幫助縮小搜索範圍,同時還會降低搜索的效率,所以通常會把這些詞移去,從而提高搜索性能。」

說了那麼多,接下來我們就來簡單的實作一下吧~首先,從較簡單的英文斷詞開始練習

英文斷詞

  1. 打開 R Studio,因為這次我們需要進行斷詞,所以要另外下載 package。在 R Studio 的 console 欄輸入 install.packages("tokenizers") 或是如下圖所示,點擊 packages -> install -> 在 packages 欄位輸入 tokenizers -> 點擊 install

packages

tokenizers

  1. 下載完後我們需要導入 package 讓電腦知道, 所以要用 library()

library(tokenizers)

  1. 再來就可以進行斷詞了,這邊先用一小段英文作為示範。英文資料來源為 CNN News (https://edition.cnn.com/2022/08/31/health/life-expectancy-declines-2021/index.html)

sents = c("After a historic drop in 2020, life expectancy in the United States took another significant hit in 2021. According to provisional data published Wednesday by the US Centers for Disease Control and Prevention, life expectancy at birth dropped by nearly a year between 2020 and 2021 -- and by more than two and a half years overall since the start of the Covid-19 pandemic. Life expectancy at birth fell to 76.1 years, the lowest it has been in the US since 1996, and the biggest 2-year decline in a century.")

tokenize_words(sents)

輸出結果為:
[1] "after" "a" "historic" "drop" "in"
[6] "2020" "life" "expectancy" "in" "the"
[11] "united" "states" "took" "another" "significant"
[16] "hit" "in" "2021" "according" "to"
[21] "provisional" "data" "published" "wednesday" "by"
[26] "the" "us" "centers" "for" "disease"
[31] "control" "and" "prevention" "life" "expectancy"
[36] "at" "birth" "dropped" "by" "nearly"
[41] "a" "year" "between" "2020" "and"
[46] "2021" "and" "by" "more" "than"
[51] "two" "and" "a" "half" "years"
[56] "overall" "since" "the" "start" "of"
[61] "the" "covid" "19" "pandemic" "life"
[66] "expectancy" "at" "birth" "fell" "to"
[71] "76.1" "years" "the" "lowest" "it"
[76] "has" "been" "in" "the" "us"
[81] "since" "1996" "and" "the" "biggest"
[86] "2" "year" "decline" "in" "a"
[91] "century"

可以發現,除了標點符號不見了以外,所有的字都變成了小寫。另外,常作為 stop words 的字,例如 'a', 'to', 'the' 仍然保留著。因此,我們可以設定自己的 stop words 來改善這個現象。


tokenize_words(sents, stopwords=c('a', 'to', 'the', 'in', 'at', 'and', 'of'))

輸出結果為:
[1] "after" "historic" "drop" "2020" "life"
[6] "expectancy" "united" "states" "took" "another"
[11] "significant" "hit" "2021" "according" "provisional"
[16] "data" "published" "wednesday" "by" "us"
[21] "centers" "for" "disease" "control" "prevention"
[26] "life" "expectancy" "birth" "dropped" "by"
[31] "nearly" "year" "between" "2020" "2021"
[36] "by" "more" "than" "two" "half"
[41] "years" "overall" "since" "start" "covid"
[46] "19" "pandemic" "life" "expectancy" "birth"
[51] "fell" "76.1" "years" "lowest" "it"
[56] "has" "been" "us" "since" "1996"
[61] "biggest" "2" "year" "decline" "century"

可以發現自己設定的 stop words 已成功刪除~

除了 tokenize_words 之外,也可使用 tokenize_ptb() 這個基於 Penn Tree Bank 的斷詞系統。


tokenize_ptb(sents)

輸出結果為:
[1] "After" "a" "historic" "drop" "in"
[6] "2020" "," "life" "expectancy" "in"
[11] "the" "United" "States" "took" "another"
[16] "significant" "hit" "in" "2021." "According"
[21] "to" "provisional" "data" "published" "Wednesday"
[26] "by" "the" "US" "Centers" "for"
[31] "Disease" "Control" "and" "Prevention" ","
[36] "life" "expectancy" "at" "birth" "dropped"
[41] "by" "nearly" "a" "year" "between"
[46] "2020" "and" "2021" "--" "and"
[51] "by" "more" "than" "two" "and"
[56] "a" "half" "years" "overall" "since"
[61] "the" "start" "of" "the" "Covid-19"
[66] "pandemic." "Life" "expectancy" "at" "birth"
[71] "fell" "to" "76.1" "years" ","
[76] "the" "lowest" "it" "has" "been"
[81] "in" "the" "US" "since" "1996"
[86] "," "and" "the" "biggest" "2-year"
[91] "decline" "in" "a" "century" "."

觀察結果可以發現這個斷詞系統會保留標點符號和原本的大小寫,因此可以因自己的需求選用不同的斷詞系統。

另外,也可以直接使用 tm package 定好的 stop words 來斷詞~


library(tm)
tokenize_words(sents, stopwords=stopwords())

輸出結果為:
[1] "historic" "drop" "2020" "life" "expectancy"
[6] "united" "states" "took" "another" "significant"
[11] "hit" "2021" "according" "provisional" "data"
[16] "published" "wednesday" "us" "centers" "disease"
[21] "control" "prevention" "life" "expectancy" "birth"
[26] "dropped" "nearly" "year" "2020" "2021"
[31] "two" "half" "years" "overall" "since"
[36] "start" "covid" "19" "pandemic" "life"
[41] "expectancy" "birth" "fell" "76.1" "years"
[46] "lowest" "us" "since" "1996" "biggest"
[51] "2" "year" "decline" "century"

英文部分就先講解到這邊,接下來是中文的斷詞,本篇採用 Jieba 斷詞系統。

中文斷詞

  1. 下載並引用 jiebaR,設定斷詞系統
library(jiebaR)
cutter <- worker() # 設定斷詞系統
  1. 使用一小段中文作為示範。中文資料來源(https://tw.news.yahoo.com/莫德納次世代疫苗是什麼-副作用有哪些-誰要接種-完整解析-看這-040000853.html)

text = '食藥署表示,經整體評估其有效性及安全性,並考量國內緊急公共衛生需求,同意核准莫德納雙價疫苗可適用於18歲以上成人主動免疫之追加接種,其用法用量為在國內已授權的COVID-19疫苗之基礎接種或追加劑後,間隔至少3個月施打。未來食藥署將持續監控國內外接種COVID-19疫苗的安全警訊,分析評估疫苗不良事件通報資料,執行安全監視機制,保障民眾接種疫苗之安全。'

cutter[text] 

輸出結果為:
[1] "食藥署" "表示" "經" "整體" "評估" "其" "有效性"
[8] "及" "安全性" "並" "考量" "國內" "緊急" "公共衛生"
[15] "需求" "同意" "核准" "莫" "德納" "雙價" "疫苗"
[22] "可" "適用於" "18" "歲" "以上" "成人" "主動免疫"
[29] "之" "追加" "接種" "其" "用法" "用量" "為"
[36] "在" "國內" "已" "授權" "的" "COVID" "19"
[43] "疫苗" "之" "基礎" "接種" "或" "追加" "劑"
[50] "後" "間隔" "至少" "3" "個" "月" "施"
[57] "打" "未來" "食藥" "署將" "持續" "監控" "國內外"
[64] "接種" "COVID" "19" "疫苗" "的" "安全" "警訊"
[71] "分析" "評估" "疫苗" "不良" "事件" "通報" "資料"
[78] "執行" "安全" "監視" "機制" "保障" "民眾" "接種"
[85] "疫苗" "之" "安全"

這邊發現有些「食藥署」並沒有被正確斷詞,因此可以設定自己的字典來幫助斷詞。


new_words = c("食藥署") # 設定新字
writeLines(new_words, '/Users/biaoyun/Documents/Ithome/new_words.txt') # 寫出檔案
cutter = worker(user='/Users/biaoyun/Documents/Ithome/new_words.txt) 
# 運用新設定的字典做為 cutter

cutter[text]

輸出結果為:
[1] "食藥署" "表示" "經" "整體" "評估" "其" "有效性"
[8] "及" "安全性" "並" "考量" "國內" "緊急" "公共衛生"
[15] "需求" "同意" "核准" "莫" "德納" "雙價" "疫苗"
[22] "可" "適用於" "18" "歲" "以上" "成人" "主動免疫"
[29] "之" "追加" "接種" "其" "用法" "用量" "為"
[36] "在" "國內" "已" "授權" "的" "COVID" "19"
[43] "疫苗" "之" "基礎" "接種" "或" "追加" "劑"
[50] "後" "間隔" "至少" "3" "個" "月" "施"
[57] "打" "未來" "食藥署" "將" "持續" "監控" "國內外"
[64] "接種" "COVID" "19" "疫苗" "的" "安全" "警訊"
[71] "分析" "評估" "疫苗" "不良" "事件" "通報" "資料"
[78] "執行" "安全" "監視" "機制" "保障" "民眾" "接種"
[85] "疫苗" "之" "安全"

會發現,所有的「食藥署」都正確斷詞了~是不是很簡單呢?那麼今天就先介紹到這,明天繼續囉!


上一篇
Day 3 語料預處理 (Pre-processing) 解說+實作
下一篇
Day 5 資料型態介紹 前篇
系列文
語言學與NLP30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言