深入了解scikit Learn裡TFIDF計算方式

11th鐵人賽

lagagain

2019-10-28 15:46:29

14065 瀏覽

分享至

TFIDF計算說明

參加今年iT鐵人賽時，曾經寫過簡單使用scikit-learn裡的TFIDF看看，並寫到scikit-learn裡tfidf計算方式與經典算法不同。後來在官方文件中找到說明，也簡單嘗試了一下。這次來做點分享。

在經典算法，TF是這樣計算： $Classic TF$ 。不過scikit-learn是直接用n_(i,j)，也就是使用CountVectorizer的結果。

IDF的部份，原本經典算法是： $Classic IDF-1$ ，為了避免分母為零，也經常會使用log(n/(df(k)+1))計算，也就是將分子+1。scikit-learn裡面則分成兩種，預設使用smooth的版本：log((n+1)/(df(k)+1))，也就是分子分母都加一；另一種是經典原始版本，而外加上1：log(n/df(k))+1。

最後sckit-lean會做標準化（normalize），所以最後結果會是normaliz(tf*idf)。

嘗試實驗

引入套件

from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import math
from sklearn.preprocessing import normalize

實驗資料

d1 = 'a b d e d f a f e fa d s a b n'
d2 = 'a z a f e fa h'
d3 = 'a z a f e fa h'

對，基本上就是簡單使用scikit-learn裡的TFIDF看看所測試的方式，所以當然也要將stop_words=None, token_pattern="(?u)\\b\\w+\\b"而外設定，原因這裡就不多做說明了。

計算TFIDF

vectorizer = TfidfVectorizer(sublinear_tf=False, stop_words=None, token_pattern="(?u)\\b\\w+\\b", smooth_idf=True, norm='l2')
tfidf = vectorizer.fit_transform([d1,d2,d3])
df_tfidf = pd.DataFrame(tfidf.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("TFIDF")
df_tfidf

Output:

a	b	d	e	f	fa	h	n	s	z
d1	0.327205	0.457784	0.686676	0.218137	0.218137	0.109068	0.000000	0.228892	0.228892	0.000000
d2	0.604380	0.000000	0.000000	0.302190	0.302190	0.302190	0.424717	0.000000	0.000000	0.424717
d3	0.604380	0.000000	0.000000	0.302190	0.302190	0.302190	0.424717	0.000000	0.000000	0.424717

重頭計算看看

CountVector (TF)

vectorizer = CountVectorizer(stop_words=None, token_pattern="(?u)\\b\\w+\\b")  
tf = vectorizer.fit_transform([d1,d2,d3])
df_tf = pd.DataFrame(tf.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("CountVector")
tf

Output:

	a	b	d	e	f	fa	h	n	s	z
d1	3	2	3	2	2	1	0	1	1	0
d2	2	0	0	1	1	1	1	0	0	1
d3	2	0	0	1	1	1	1	0	0	1

這個結果與TfidfVectorizer(sublinear_tf=False, stop_words=None, token_pattern="(?u)\\b\\w+\\b", smooth_idf=False, use_idf=False, norm=None) 無異。

IDF

先看看原本的結果：

vectorizer = TfidfVectorizer(sublinear_tf=False, stop_words=None, token_pattern="(?u)\\b\\w+\\b", smooth_idf=True, norm=None)  
X = vectorizer.fit_transform([d1,d2,d3])
r = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
print("IDF")
idf = vectorizer.idf_
pd.DataFrame([vectorizer.idf_], columns=vectorizer.get_feature_names())

Output:

	a	b	d	e	f	fa	h	n	s	z
0	1.0	1.693147	1.693147	1.0	1.0	1.0	1.287682	1.693147	1.693147	1.287682

拿b用smooth的方式計算看看：

math.log(3+1/1+1) # => 1.6094379124341003

※ 所有文件有3個，其中1個包含b。

要注意的是，Python裡的math.log預設底數是math.e，也就是自然對數。上面式子等同於math.log(3+1/1+1, math.e)。

math.log(math.e) # => 1

計算TFIDF

tf*idf

Output:

	a	b	d	e	f	fa	h	n	s	z
d1	3.0	3.386294	5.079442	2.0	2.0	1.0	0.000000	1.693147	1.693147	0.000000
d2	2.0	0.000000	0.000000	1.0	1.0	1.0	1.287682	0.000000	0.000000	1.287682
d3	2.0	0.000000	0.000000	1.0	1.0	1.0	1.287682	0.000000	0.000000	1.287682

恩，與一開始結果還是有差異。

標準化

tfidf = normalize(tf*idf, norm="l2")
r = pd.DataFrame(tfidf,columns=vectorizer.get_feature_names(), index=['d1', 'd2', 'd3'])
r

Output:

	a	b	d	e	f	fa	h	n	s	z
d1	0.384107	0.433566	0.650349	0.256071	0.256071	0.128036	0.000000	0.216783	0.216783	0.000000
d2	0.622686	0.000000	0.000000	0.311343	0.311343	0.311343	0.400911	0.000000	0.000000	0.400911
d3	0.622686	0.000000	0.000000	0.311343	0.311343	0.311343	0.400911	0.000000	0.000000	0.400911