【文字分析】3-4 TF-IDF文字概念

筆記

蔡基 2021-02-01 13:44:04 ‧ 2029 瀏覽

分享至

【文字分析】3-4 TF-IDF文字概念

說明

一種分析某單詞在文章中重要程度公式
TF-IDF值與檔案中出現次數成正比,語料庫出現頻率成反比

TF

    指某詞語在檔案中的出現頻率

ni,j:該字詞在檔案中出現次數
Σni,k:檔案中字詞數量

IDF

    指某詞語在文章中的重要性

D:檔案數量
1+|j:ti dj|:含有ti詞語的檔案數量
1:避免分母為0

TF-IDF

範例:

假設一篇文章總共有100個詞語,而「大角怪」出現了5次,
而「大角怪」在1,000篇文章出現,文章數量總共有10,000,000篇。

文字加權

程式範例

公式函式

tf

from math import log

def tf(term, doc, normalize=True):
    doc = doc.lower().split()
    if (normalize):
        result = doc.count(term.lower())/float(len(doc))
    else:
        result = doc.count(term.lower())/1
    return result

idf

def idf(term, docs):
    num_text_with_term = len(
        [True for doc in docs if term.lower() in doc.lower().split()])
    try:
        return 1.0 + log(len(docs) / num_text_with_term)
    except ZeroDivisionError:
        return 1.0

tf-idf

def tf_idf(term, doc, docs):
    return tf(term, doc)*idf(term, docs)

公式運用

宣告內容

corpus = \
    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',
     'b': 'Professor Plumb has a green plant in his study ',
     'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}
## i.lower()=>轉小寫
## split()=>分割

QUERY_TERMS = ['green']

帶入公式

for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))
    print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print('TF-IDF(%s): %s' % (doc, term), score,"\n")
        # 將tf*idf相加

套件運用

內容宣告

import nltk

terms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."

text = [text for text in terms.split()]
## 斷詞處理，存為列表
tc = nltk.TextCollection(text)
## 放入nltk的套件處理
term = 'a'
## 搜尋字
idx = 0

公式處理

print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))
# If a term does not appear in the corpus, 0.0 is returned.
print('IDF(%s): %s' % ('a', term), tc.idf(term))
print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))

執行結果

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22211 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

【文字分析】3-4 TF-IDF文字概念

【文字分析】3-4 TF-IDF文字概念

說明

TF

IDF

TF-IDF

範例:

文字加權

程式範例

公式函式

tf

idf

tf-idf

公式運用

宣告內容

帶入公式

套件運用

內容宣告

公式處理

執行結果

尚未有邦友留言

標記使用者