iT邦幫忙

0

【文字分析】3-4 TF-IDF文字概念

  • 分享至 

  • xImage
  •  

【文字分析】3-4 TF-IDF文字概念

說明

一種分析某單詞在文章中重要程度公式
TF-IDF值與檔案中出現次數成正比,語料庫出現頻率成反比

TF

    指某詞語在檔案中的出現頻率

ni,j:該字詞在檔案中出現次數
Σni,k:檔案中字詞數量

IDF

    指某詞語在文章中的重要性

D:檔案數量
1+|j:ti dj|:含有ti詞語的檔案數量
1:避免分母為0

TF-IDF

範例:

假設一篇文章總共有100個詞語,而「大角怪」出現了5次,
而「大角怪」在1,000篇文章出現,文章數量總共有10,000,000篇。

文字加權

程式範例

公式函式

tf

from math import log

def tf(term, doc, normalize=True):
    doc = doc.lower().split()
    if (normalize):
        result = doc.count(term.lower())/float(len(doc))
    else:
        result = doc.count(term.lower())/1
    return result

idf

def idf(term, docs):
    num_text_with_term = len(
        [True for doc in docs if term.lower() in doc.lower().split()])
    try:
        return 1.0 + log(len(docs) / num_text_with_term)
    except ZeroDivisionError:
        return 1.0

tf-idf

def tf_idf(term, doc, docs):
    return tf(term, doc)*idf(term, docs)

公式運用

宣告內容

corpus = \
    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',
     'b': 'Professor Plumb has a green plant in his study ',
     'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}
## i.lower()=>轉小寫
## split()=>分割

QUERY_TERMS = ['green']

帶入公式

for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))
    print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print('TF-IDF(%s): %s' % (doc, term), score,"\n")
        # 將tf*idf相加

套件運用

內容宣告

import nltk

terms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."

text = [text for text in terms.split()]
## 斷詞處理,存為列表
tc = nltk.TextCollection(text)
## 放入nltk的套件處理
term = 'a'
## 搜尋字
idx = 0

公式處理

print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))
# If a term does not appear in the corpus, 0.0 is returned.
print('IDF(%s): %s' % ('a', term), tc.idf(term))
print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))

執行結果


圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言