文字雲(Word Cloud),顧名思義就是很多很多的文字匯集成像雲朵一樣的形狀。想必大家對這項視覺化的產物都不陌生吧?那麼,文字雲有什麼好處呢?文字雲的存在能讓讀者在不閱讀所有文章的前提下,快速了解並聚焦大批文章中主要的議題。另外,當一年要結束時,我們也常常看到這一整年來的趨勢、話題、大事件的回顧。其中一項常用的手法就是使用關鍵詞分析,再透過文字雲呈現。讓所有的主題清晰且一目了然。
而在 NLP 任務中,尤其與文本分析、情緒分析相關的研究,文字雲也很常被用來呈現研究成果。目的也是讓讀者能夠迅速了解內容或是了解某些特定文本中出現的關鍵詞彙。儘管現在已經有許多網站能夠製作文字雲,但其實用程式也能簡單且快速的完成喔~就讓我們一起來實作看看吧!
製作出字頻表:既然是文字雲,我們就不會把所有的字詞都放進去,放進去的要是頻率高且關鍵的。因此要製作出字頻表之後,挑出頻率較高的(我通常會設置前 20 名的字)。
刪除停用詞(stop words):上一點提到要高頻率且關鍵的詞。之前有介紹過,停用詞雖然高頻率,但都不是「關鍵」的字詞,故為了不影響最終結果,我們必須刪除停用詞。
library(wordcloud) # 第一代文字雲
library(wordcloud2) # 第二代文字雲
library( RColorBrewer)
news_text = c("Duty is a rather old-fashioned concept today in a world rife with public figures who hunger only for power to be achieved by any means available.But duty is the one word to best summarize the reign of Queen Elizabeth II, who died on Thursday at 96. The Queen selflessly gave of herself. Hers was a role that is ceremonial, but it is also deeply embedded in the oldest constitutional monarchy in the world and in a country that has given the world so many of the concepts and policies that we associate with democracy.Seven years after the end of World War II, the Queen, aged only 25, ascended to the British throne. Harry Truman was the President of the United States, and Winston Churchill was Prime Minister of the United Kingdom.Since then, the Queen reigned for 13 additional US presidencies: Dwight Eisenhower, John F. Kennedy, Lyndon Johnson, Richard Nixon, Gerald Ford, Jimmy Carter, Ronald Reagan, George H.W. Bush, Bill Clinton, George W. Bush, Barack Obama, Donald Trump and, now, Joe Biden.In many ways the Queen symbolized the special relationship between the United Kingdom and the United States. A rite of passage for almost every one of the 14 US presidents since she took the throne was her hosting a state visit for the president in the UK, or her attending a formal state dinner put on by the president in Washington, DC. Most recently she met with President Joe Biden in June at Windsor Castle. According to Robert Hardman, the dean of royal biographers, she was particularly close to Reagan who she found to be the most charming. They shared a love of the outdoors and of horses. It was a friendship that went on long after Reagan had stepped down as president, Hardman reported in his 2018 book Queen of the World. The Queen and Obama also enjoyed a close relationship, according to Hardman. She had an extraordinary run; most British subjects can only remember one monarch. During her long reign, the Queen presided over the dissolution of great swaths of the British Empire, continuing a process that began under her father's reign. She also officially installed three women as her prime ministers -- Margaret Thatcher, Theresa May and, just on Tuesday, Liz Truss, who met with the Queen for her formal investiture as prime minister at Balmoral Castle in Scotland.
As Queen, she performed an astonishing 21,000 engagements and was patron of hundreds of organizations, including those dedicated to education and training, sports and recreation, faith, arts and culture, according to statistics released by the Royal Household in May when Britain celebrated the Queen's 70 years on the throne.
The contrast is striking between who the Queen was and the former British Prime Minister Boris Johnson, who stepped down on Tuesday after being forced out of office. Johnson is a serial liar about matters both large and small, who attended private parties at his official residence at Downing Street during a rigorous Covid-19 lockdown that he himself had authorized. He later apologized.") # 新聞文章
news_text_clean = gsub("[[:punct:]]+", "", news_text) # 刪除標點符號
news_text_clean =gsub('[[:digit:]]+', "", news_text_clean) # 刪除數字
news_tokenized = tokenize_words(news_text_clean, lowercase = F, stopwords= c(stopwords("english"), "The", "A", "one", "also", "As", "according")) # 斷詞
freq = sort(table(unlist(news_tokenized)), T) # 字頻表
text_freq = data.frame(word = names(freq), freq=as.vector(freq)) # 字頻表轉換成 data frame
customed_colors = c("#000080", "#ffff00", "#6495ed", "#00bfff", "#87cefa", "#db7093", "#ba55d3", "#b22222", "#008080", "#ff8c00", "#6b8e23") # 文字雲配色
wordcloud(text_freq$word, text_freq$freq, min.freq = 2, random.order = F, ordered.colors = F, colors = customed_colors) # 第一代文字雲製作,min.freq 代表至少要出現幾次的字才能出現在文字雲上
wordcloud2(text_freq, size = 1.7, color = customed_colors, backgroundColor="white")
# 第二代文字雲,size 代表 freq,越大代表要在文字雲上的字字頻越高
第二代文字雲(它其實是一個 html 檔,可以進行互動)
ch_news = c("英國白金漢宮宣布,英國女王伊麗莎白二世八日(英國時間)駕崩,享耆壽九十六歲。伊麗莎白二世在位七十年,六日剛任命她的第十五位首相特拉斯。她是當今世上在位最久的君主,也早已成為英國象徵。
任命特拉斯 最後公開露面
菲立普親王過世 形單影隻
在位70年 外訪經歷豐富
女王夫婦子孫昌茂,育有王儲查理等三男一女,孫子女包括威廉和哈利王子等,曾孫子女則有喬治王子、夏綠蒂公主、路易王子和哈利王子的兒子亞契等。查理繼位後,威廉王子將成為王儲,兒子喬治王子則是王位第二順位繼承人。") # 匯入新聞
new_terms = c("巴摩拉城堡", "菲立普親王", "染疫", "伊麗莎白二世")
writeLines(new_terms, '/Users/biaoyun/Documents/Ithome/ch_terms.txt') # 自訂字典
stopwords = c( "在","的","上", "下","是", "個","來","為","亦","或", "之", "與", "於", "用", "都", "等", "日", "月", "年", "週", "嗎", "以", "就", "但", "及", "也", "了", "要", "不", "會", "和", "對", "著", "後", "她", "他")
writeLines(stopwords, '/Users/biaoyun/Documents/Ithome/ch_stopwords.txt') # 自訂停用詞
cutter = worker(user='/Users/biaoyun/Documents/Ithome/ch_terms.txt', stop_word = '/Users/biaoyun/Documents/Ithome/ch_stopwords.txt') # 引用字典和停用詞
ch_news <- gsub("[0-9a-zA-Z]+?", "", ch_news) # 刪除數字和字母
ch_news <- cutter[ch_news] # 斷詞
freq_ch <- sort(table(ch_news), T)
freq_ch = as.data.frame(freq_ch)
colnames(freq_ch) <- c("Words", "Freq") # 字頻表
head(freq_ch, 10) # 查看前10筆資料
par(family=("DFHsiu-W3-WINP-BF")) # 設定字體 Mac
customed_colors = c("#000080", "#6495ed", "#00bfff", "#87cefa", "#db7093", "#ba55d3", "#b22222", "#ff8c00", "#6b8e23") # 顏色
ch_wordcloud = wordcloud(freq_ch$Words, freq_ch$Freq, min.freq = 2, random.order = F, ordered.colors = F, colors = customed_colors); ch_wordcloud # 第一代文字雲
ch_wordcloud2 = wordcloud2(freq_ch, size = 1.3, color = customed_colors, backgroundColor="white"); ch_wordcloud2
# 第二代文字雲