# TFIDF計算說明

IDF的部份，原本經典算法是：，為了避免分母為零，也經常會使用`log(n/(df(k)+1))`計算，也就是將分子`+1`。scikit-learn裡面則分成兩種，預設使用`smooth`的版本：`log((n+1)/(df(k)+1))`，也就是分子分母都加一；另一種是經典原始版本，而外加上1：`log(n/df(k))+1`

# 嘗試實驗

## 引入套件

``````from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import math
from sklearn.preprocessing import normalize
``````

## 實驗資料

``````d1 = 'a b d e d f a f e fa d s a b n'
d2 = 'a z a f e fa h'
d3 = 'a z a f e fa h'
``````

## 計算TFIDF

``````vectorizer = TfidfVectorizer(sublinear_tf=False, stop_words=None, token_pattern="(?u)\\b\\w+\\b", smooth_idf=True, norm='l2')
tfidf = vectorizer.fit_transform([d1,d2,d3])
df_tfidf = pd.DataFrame(tfidf.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("TFIDF")
df_tfidf
``````

Output:

a b d e f fa h n s z
d1 0.327205 0.457784 0.686676 0.218137 0.218137 0.109068 0.000000 0.228892 0.228892 0.000000
d2 0.604380 0.000000 0.000000 0.302190 0.302190 0.302190 0.424717 0.000000 0.000000 0.424717
d3 0.604380 0.000000 0.000000 0.302190 0.302190 0.302190 0.424717 0.000000 0.000000 0.424717

## 重頭計算看看

### CountVector (TF)

``````vectorizer = CountVectorizer(stop_words=None, token_pattern="(?u)\\b\\w+\\b")
tf = vectorizer.fit_transform([d1,d2,d3])
df_tf = pd.DataFrame(tf.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("CountVector")
tf
``````

Output:

a b d e f fa h n s z
d1 3 2 3 2 2 1 0 1 1 0
d2 2 0 0 1 1 1 1 0 0 1
d3 2 0 0 1 1 1 1 0 0 1

## IDF

``````vectorizer = TfidfVectorizer(sublinear_tf=False, stop_words=None, token_pattern="(?u)\\b\\w+\\b", smooth_idf=True, norm=None)
X = vectorizer.fit_transform([d1,d2,d3])
r = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("IDF")
idf = vectorizer.idf_
pd.DataFrame([vectorizer.idf_], columns=vectorizer.get_feature_names())
``````

Output:

a b d e f fa h n s z
0 1.0 1.693147 1.693147 1.0 1.0 1.0 1.287682 1.693147 1.693147 1.287682

`b``smooth`的方式計算看看：

``````math.log(3+1/1+1) # => 1.6094379124341003
``````

※ 所有文件有3個，其中1個包含b。

``````math.log(math.e) # => 1
``````

## 計算TFIDF

``````tf*idf
``````

Output:

a b d e f fa h n s z
d1 3.0 3.386294 5.079442 2.0 2.0 1.0 0.000000 1.693147 1.693147 0.000000
d2 2.0 0.000000 0.000000 1.0 1.0 1.0 1.287682 0.000000 0.000000 1.287682
d3 2.0 0.000000 0.000000 1.0 1.0 1.0 1.287682 0.000000 0.000000 1.287682

## 標準化

``````tfidf = normalize(tf*idf, norm="l2")
r = pd.DataFrame(tfidf,columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
r
``````

Output:

a b d e f fa h n s z
d1 0.384107 0.433566 0.650349 0.256071 0.256071 0.128036 0.000000 0.216783 0.216783 0.000000
d2 0.622686 0.000000 0.000000 0.311343 0.311343 0.311343 0.400911 0.000000 0.000000 0.400911
d3 0.622686 0.000000 0.000000 0.311343 0.311343 0.311343 0.400911 0.000000 0.000000 0.400911

OK, 一致了。

