文字雲(Word Cloud),顧名思義就是很多很多的文字匯集成像雲朵一樣的形狀。想必大家對這項視覺化的產物都不陌生吧?那麼,文字雲有什麼好處呢?文字雲的存在能讓讀者在不閱讀所有文章的前提下,快速了解並聚焦大批文章中主要的議題。另外,當一年要結束時,我們也常常看到這一整年來的趨勢、話題、大事件的回顧。其中一項常用的手法就是使用關鍵詞分析,再透過文字雲呈現。讓所有的主題清晰且一目了然。
而在 NLP 任務中,尤其與文本分析、情緒分析相關的研究,文字雲也很常被用來呈現研究成果。目的也是讓讀者能夠迅速了解內容或是了解某些特定文本中出現的關鍵詞彙。儘管現在已經有許多網站能夠製作文字雲,但其實用程式也能簡單且快速的完成喔~就讓我們一起來實作看看吧!
文字雲的製作過程簡單來說分成以下幾個步驟:
斷詞(tokenization):還記得之前說過,電腦無法處理大量的句子吧?所以要先幫電腦整理成它能接受的資料。如果是中文的話,記得要確保專有名詞、人名等等都被正確斷詞喔!
製作出字頻表:既然是文字雲,我們就不會把所有的字詞都放進去,放進去的要是頻率高且關鍵的。因此要製作出字頻表之後,挑出頻率較高的(我通常會設置前 20 名的字)。
刪除停用詞(stop words):上一點提到要高頻率且關鍵的詞。之前有介紹過,停用詞雖然高頻率,但都不是「關鍵」的字詞,故為了不影響最終結果,我們必須刪除停用詞。
根據字頻表做出文字雲!
講完了步驟,當然要來試試看了!
這邊,英文和中文我都會各做一個文字雲,文本都是隨意挑一篇新聞。
新聞來源(https://edition.cnn.com/2022/09/08/opinions/duty-reign-queen-elizabeth-peter-bergen/index.html)
library(tm)
library(tokenizers)
library(wordcloud) # 第一代文字雲
library(wordcloud2) # 第二代文字雲
library( RColorBrewer)
news_text = c("Duty is a rather old-fashioned concept today in a world rife with public figures who hunger only for power to be achieved by any means available.But duty is the one word to best summarize the reign of Queen Elizabeth II, who died on Thursday at 96. The Queen selflessly gave of herself. Hers was a role that is ceremonial, but it is also deeply embedded in the oldest constitutional monarchy in the world and in a country that has given the world so many of the concepts and policies that we associate with democracy.Seven years after the end of World War II, the Queen, aged only 25, ascended to the British throne. Harry Truman was the President of the United States, and Winston Churchill was Prime Minister of the United Kingdom.Since then, the Queen reigned for 13 additional US presidencies: Dwight Eisenhower, John F. Kennedy, Lyndon Johnson, Richard Nixon, Gerald Ford, Jimmy Carter, Ronald Reagan, George H.W. Bush, Bill Clinton, George W. Bush, Barack Obama, Donald Trump and, now, Joe Biden.In many ways the Queen symbolized the special relationship between the United Kingdom and the United States. A rite of passage for almost every one of the 14 US presidents since she took the throne was her hosting a state visit for the president in the UK, or her attending a formal state dinner put on by the president in Washington, DC. Most recently she met with President Joe Biden in June at Windsor Castle. According to Robert Hardman, the dean of royal biographers, she was particularly close to Reagan who she found to be the most charming. They shared a love of the outdoors and of horses. It was a friendship that went on long after Reagan had stepped down as president, Hardman reported in his 2018 book Queen of the World. The Queen and Obama also enjoyed a close relationship, according to Hardman. She had an extraordinary run; most British subjects can only remember one monarch. During her long reign, the Queen presided over the dissolution of great swaths of the British Empire, continuing a process that began under her father's reign. She also officially installed three women as her prime ministers -- Margaret Thatcher, Theresa May and, just on Tuesday, Liz Truss, who met with the Queen for her formal investiture as prime minister at Balmoral Castle in Scotland.
As Queen, she performed an astonishing 21,000 engagements and was patron of hundreds of organizations, including those dedicated to education and training, sports and recreation, faith, arts and culture, according to statistics released by the Royal Household in May when Britain celebrated the Queen's 70 years on the throne.
The contrast is striking between who the Queen was and the former British Prime Minister Boris Johnson, who stepped down on Tuesday after being forced out of office. Johnson is a serial liar about matters both large and small, who attended private parties at his official residence at Downing Street during a rigorous Covid-19 lockdown that he himself had authorized. He later apologized.") # 新聞文章
news_text_clean = gsub("[[:punct:]]+", "", news_text) # 刪除標點符號
news_text_clean =gsub('[[:digit:]]+', "", news_text_clean) # 刪除數字
news_tokenized = tokenize_words(news_text_clean, lowercase = F, stopwords= c(stopwords("english"), "The", "A", "one", "also", "As", "according")) # 斷詞
freq = sort(table(unlist(news_tokenized)), T) # 字頻表
text_freq = data.frame(word = names(freq), freq=as.vector(freq)) # 字頻表轉換成 data frame
customed_colors = c("#000080", "#ffff00", "#6495ed", "#00bfff", "#87cefa", "#db7093", "#ba55d3", "#b22222", "#008080", "#ff8c00", "#6b8e23") # 文字雲配色
wordcloud(text_freq$word, text_freq$freq, min.freq = 2, random.order = F, ordered.colors = F, colors = customed_colors) # 第一代文字雲製作,min.freq 代表至少要出現幾次的字才能出現在文字雲上
wordcloud2(text_freq, size = 1.7, color = customed_colors, backgroundColor="white")
# 第二代文字雲,size 代表 freq,越大代表要在文字雲上的字字頻越高
執行結果為:
第一代文字雲
第二代文字雲(它其實是一個 html 檔,可以進行互動)
新聞來源(https://udn.com/news/story/123021/6599997?from=udn-cardnews)
library(jiebaR)
library(wordcloud) # 第一代文字雲
library(wordcloud2) # 第二代文字雲
library( RColorBrewer)
ch_news = c("英國白金漢宮宣布,英國女王伊麗莎白二世八日(英國時間)駕崩,享耆壽九十六歲。伊麗莎白二世在位七十年,六日剛任命她的第十五位首相特拉斯。她是當今世上在位最久的君主,也早已成為英國象徵。
任命特拉斯 最後公開露面
女王晚年行動不便,六日在蘇格蘭巴摩拉城堡接見並任命新首相特拉斯,是在位七十年來首度沒有在白金漢宮辦理新舊任首相交接。從官方照片可以看到,女王和特拉斯見面時,雖然面帶笑容,但看起來很虛弱,拿著拐杖支撐,她在象徵性典禮中與特拉斯握手,任命特拉斯領導新政府。
這是女王最後一次公開露面。她隨後沒有主持國協運動會開幕式,並推遲與樞密院顧問團的會議。八日一早白金漢宮發表聲明,稱伊麗莎白二世身體欠安,在御醫團隊建議下接受「醫療監看」。隨後傳出王儲查理王子夫婦及王位第二順位繼承人女王長孫威廉、女王最疼愛的次子安德魯王子相繼趕往蘇格蘭巴摩拉城堡消息。就連與王室漸行漸遠但人在英國的女王次孫哈利,得知消息後也趕往巴摩拉城堡。
菲立普親王過世 形單影隻
伊麗莎白二世去年以來一直受行走和站立時的問題所苦,不得不取消一連串公眾活動。今年二月她曾感染新冠病毒,雖然康復,但她說染疫讓她「筋疲力竭」。
統治英國七十年的伊麗莎白二世是英國史上在位最久的君主,親見十五位首相來來去去。她登基時,英國首相是邱吉爾。
今年六月,英國為女王登基七十年舉辦白金禧慶典,當時她雖在白金漢宮陽台露面,但缺席其他活動。她的夫婿菲立普親王於去年四月九日逝世,享耆壽九十九歲。女王夫婦結縭七十三年,是英國王室中維持最久的婚姻。菲立普親王過世後,她在葬禮上顯得形單影隻,心情不免深受影響。
在位70年 外訪經歷豐富
伊麗莎白二世一九五三年六月二日在西敏寺大教堂加冕,時年廿七歲。這也是英國王室加冕典禮首度在電視播出,觀眾包括兩千萬英國人和近一億北美居民。
伊麗莎白二世即位後,面臨英國戰後國力衰退和大英帝國逐漸解體。英國殖民地在一九五○年代至一九六○年代紛紛獨立,女王嘗試組織大英國協維繫關係,以大英國協元首身分訪問澳洲、紐西蘭和加拿大等國。
女王曾在一九七六年前往美國,出席美國從英國獨立兩百周年紀念,她任內共接見十三位美國總統。雖然女王外訪經歷豐富,但對於英國政治,她始終恪守中立原則。
女王夫婦子孫昌茂,育有王儲查理等三男一女,孫子女包括威廉和哈利王子等,曾孫子女則有喬治王子、夏綠蒂公主、路易王子和哈利王子的兒子亞契等。查理繼位後,威廉王子將成為王儲,兒子喬治王子則是王位第二順位繼承人。") # 匯入新聞
new_terms = c("巴摩拉城堡", "菲立普親王", "染疫", "伊麗莎白二世")
writeLines(new_terms, '/Users/biaoyun/Documents/Ithome/ch_terms.txt') # 自訂字典
stopwords = c( "在","的","上", "下","是", "個","來","為","亦","或", "之", "與", "於", "用", "都", "等", "日", "月", "年", "週", "嗎", "以", "就", "但", "及", "也", "了", "要", "不", "會", "和", "對", "著", "後", "她", "他")
writeLines(stopwords, '/Users/biaoyun/Documents/Ithome/ch_stopwords.txt') # 自訂停用詞
cutter = worker(user='/Users/biaoyun/Documents/Ithome/ch_terms.txt', stop_word = '/Users/biaoyun/Documents/Ithome/ch_stopwords.txt') # 引用字典和停用詞
ch_news <- gsub("[0-9a-zA-Z]+?", "", ch_news) # 刪除數字和字母
ch_news <- cutter[ch_news] # 斷詞
freq_ch <- sort(table(ch_news), T)
freq_ch = as.data.frame(freq_ch)
colnames(freq_ch) <- c("Words", "Freq") # 字頻表
head(freq_ch, 10) # 查看前10筆資料
par(family=("DFHsiu-W3-WINP-BF")) # 設定字體 Mac
customed_colors = c("#000080", "#6495ed", "#00bfff", "#87cefa", "#db7093", "#ba55d3", "#b22222", "#ff8c00", "#6b8e23") # 顏色
ch_wordcloud = wordcloud(freq_ch$Words, freq_ch$Freq, min.freq = 2, random.order = F, ordered.colors = F, colors = customed_colors); ch_wordcloud # 第一代文字雲
ch_wordcloud2 = wordcloud2(freq_ch, size = 1.3, color = customed_colors, backgroundColor="white"); ch_wordcloud2
# 第二代文字雲
大家應該都看出來了吧?沒錯,打這篇的時候是台灣時間得知女王逝世的當天,所以新聞皆與女王相關,就一起緬懷女王吧><
文字雲其實還有許多小小的設定,大家再多多自行研究吧~明天見!