如何統計字詞的出現頻率

2024 iThome 鐵人賽

DAY 26

AI/ ML & Data

From Python Beginner To AI Engineer系列第 27 篇

16th鐵人賽

Penut Chen

2024-10-10 09:17:13

803 瀏覽

分享至

在 Python 裡面，有個相當方便的 Counter 類別可以來協助我們進行統計，例如：

>>> from collections import Counter
>>>
>>> arr = [9, 8, 7, 8, 7]
>>> print(Counter(arr))
Counter({8: 2, 7: 2, 9: 1})

在介紹隨機數的章節時，首次用到了 import 的套件語法，而這裡的 from collections import Counter 同樣也是一種套件語法，代表從 collections 套件裡面匯入 Counter 這個類別。這兩種用法在某些情況下是互通的：

>>> import collections
>>>
>>> collections.Counter(arr)
Counter({8: 2, 7: 2, 9: 1})

但是如果每次用到 Counter 時，都寫 collections.Counter 實在太長了！因此有 from ... import ... 的語法，來幫助我們寫出更簡潔的程式碼。

接著就來看看 Counter 實際上做了什麼，他將 arr 列表的元素統計了一輪，其中 8 出現 2 次，7 出現 2 次，9 出現 1 次，所以 Counter 幫我們統計出一個類似字典 dict 的結果。這個 Counter 的用法大致跟字典相同：

>>> c = Counter(arr)
>>> print(c[8])
>>> print(c[7])
>>> print(c[9])
2
2
1

但是與 dict 不同的地方在於，如果試圖存取不存在的鍵值 (Key)，也會回傳預設值 0，例如：

>>> print(c[6])  # arr 裡面並沒有 6
0

這個結果也很合理，畢竟 arr 裡面就是沒有 6。如果想要新增值進去，可以直接操作：

>>> c[6] = 2  # 可以直接賦值
>>> c[6] += 0.5  # 也可以是浮點數
>>> c
Counter({6: 2.5, 8: 2, 7: 2, 9: 1})

Counter 同樣可以用在字串上面，用來統計字元很方便：

>>> s = "庭院深深深幾許"
>>> print(Counter(s))
Counter({'深': 3, '庭': 1, '院': 1, '幾': 1, '許': 1})

目前我們還不會幫中文斷詞，但是可以簡單根據空格幫英文文章斷詞：

>>> article = "Alan Mathison Turing OBE FRS was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer. Turing is widely considered to be the father of theoretical computer science."
>>>
>>> c = Counter(article.split(" "))
Counter({'of': 5,
         'the': 4,
         'Turing': 3,
...
         'to': 1,
         'father': 1,
         'science.': 1})

article 是圖靈的英文維基介紹，將這段文字用空格斷開後，透過 Counter 類別進行統計。但因為字數較多，所以統計出來的結果也相對較多，如果不想一次看到這麼多結果，可以透過 .most_common() 方法來檢視統計次數最高的幾個單字：

>>> c.most_common(5)
[('of', 5), ('the', 4), ('Turing', 3), ('computer', 3), ('theoretical', 3)]

可以看到，在這一小段文字裡面，出現頻率最高的是介係詞 of 與冠詞 the，其次是圖靈的名字。透過這種簡單的程式碼，就可以瞭解一段文字的基礎性質。如果要迭代 .most_common() 回傳的結果，可以像這樣：

>>> for word, count in c.most_common(5):
>>>     print(f"{word} 出現了 {count} 次")
of 出現了 5 次
the 出現了 4 次
Turing 出現了 3 次
computer 出現了 3 次
theoretical 出現了 3 次

熟悉 Counter 類別，對於經常面臨統計分析的人工智慧領域而言相當有幫助。