一種分析某單詞在文章中重要程度公式
TF-IDF值與檔案中出現次數成正比,語料庫出現頻率成反比
指某詞語在檔案中的出現頻率
ni,j:該字詞在檔案中出現次數
Σni,k:檔案中字詞數量
指某詞語在文章中的重要性
D:檔案數量
1+|j:ti dj|:含有ti詞語的檔案數量
1:避免分母為0
假設一篇文章總共有100個詞語,而「大角怪」出現了5次,
而「大角怪」在1,000篇文章出現,文章數量總共有10,000,000篇。
from math import log
def tf(term, doc, normalize=True):
doc = doc.lower().split()
if (normalize):
result = doc.count(term.lower())/float(len(doc))
else:
result = doc.count(term.lower())/1
return result
def idf(term, docs):
num_text_with_term = len(
[True for doc in docs if term.lower() in doc.lower().split()])
try:
return 1.0 + log(len(docs) / num_text_with_term)
except ZeroDivisionError:
return 1.0
def tf_idf(term, doc, docs):
return tf(term, doc)*idf(term, docs)
corpus = \
{'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',
'b': 'Professor Plumb has a green plant in his study ',
'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}
## i.lower()=>轉小寫
## split()=>分割
QUERY_TERMS = ['green']
for term in [t.lower() for t in QUERY_TERMS]:
for doc in sorted(corpus):
print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))
print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")
for doc in sorted(corpus):
score = tf_idf(term, corpus[doc], corpus.values())
print('TF-IDF(%s): %s' % (doc, term), score,"\n")
# 將tf*idf相加
import nltk
terms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."
text = [text for text in terms.split()]
## 斷詞處理,存為列表
tc = nltk.TextCollection(text)
## 放入nltk的套件處理
term = 'a'
## 搜尋字
idx = 0
print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))
# If a term does not appear in the corpus, 0.0 is returned.
print('IDF(%s): %s' % ('a', term), tc.idf(term))
print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))