iT邦幫忙

2018 iT 邦幫忙鐵人賽
DAY 16
0

Loading & exploring Wikipedia data

https://ithelp.ithome.com.tw/upload/images/20180101/20107448GIpcBvnDjy.png

接下來我們想要透過 tf-idf 來建立一個 document retrieval system

import graphlab

# load some text data from wiki pages on people
people = graphlab.SFrame('/home/user/nylon7/machine_learning/week4/people_wiki.gl')
people.head()
URI name text
http://dbpedia.org/resource/Digby_Morrell ... Digby Morrell digby morrell born 10october 1979 is a former ...
http://dbpedia.org/resource/Alfred_J._Lewy ... Alfred J. Lewy alfred j lewy aka sandylewy graduated from ...
http://dbpedia.org/resource/Harpdog_Brown ... Harpdog Brown harpdog brown is a singerand harmonica player who ...
http://dbpedia.org/resource/Franz_Rottensteiner ... Franz Rottensteiner franz rottensteiner bornin waidmannsfeld lower ...
http://dbpedia.org/resource/G-Enka ... G-Enka henry krvits born 30december 1974 in tallinn ...
http://dbpedia.org/resource/Sam_Henderson ... Sam Henderson sam henderson bornoctober 18 1969 is an ...
http://dbpedia.org/resource/Aaron_LaCrate ... Aaron LaCrate aaron lacrate is anamerican music producer ...

接著我們要做的是瀏覽資料,我們先具體的關注某一個人的資料

# explore the dataset and checkout the text it contains
obama = people[people['name'] == 'Barack Obama']

obama['text']
# barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american.....

Exploring word counts

# Get the word count for obama article 

obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])
print obama['word_count']

https://ithelp.ithome.com.tw/upload/images/20180101/20107448M01kqVXo9c.png

# Sort the word count for obama article

# Turning dictonary of word counts into a table
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

# Sorting the word counts to show most common words at the top
obama_word_count_table.sort('count',ascending=False)

從中可以發現,有很多單字都是沒有意義的(the、in、and、of...),除了obama

word count
the 40
in 30
and 21
of 18
to 14
his 11
obama 9
act 8
he 7
a 7

Computing & exploring TF-IDFs

我們無法只針對 obama 做 TF-IDF 因為它是基於全部的文檔,你需要把那些在每篇文章中出現過的文字標準化( normalizer)

# Compute TF-IDF for the corpus

# step 1:get word counts
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

# step 2:#calculate tf-idf & normalizer
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
people['tfidf'] = tfidf
tfidf

https://ithelp.ithome.com.tw/upload/images/20180101/201074485ZC8rZXxr8.png

接著我們就可以檢查 Obama 的 tf-idf

# Examine the TF-IDF for the Obama article
obama[['tfidf']].stack('tfidf',new_column_name['word','tfidf']).sort('tfidf',ascending=False)
word tfidf
obama 43.2956530721
act 27.678222623
iraq 17.747378588
control 14.8870608452
law 14.7229357618
ordered 14.5333739509
military 13.1159327785
involvement 12.7843852412
response 12.7843852412
democratic 12.4106886973

先前我們也做過類似的事情,只不過有意義的單字只有 obama ,但是這次你可以發現單字中充滿了關聯性

Computing distances between Wikipedia articles

先看看目標與觀測點間距離是如何展示的

我們選擇了 Bill Clinton 與 David Beckham ,理論上 Clinton 應該跟 obama 比較近

原因是兩個都是政治人物、前美國總統並且同為民主黨的黨員

# Manually compute distances between a few people
clinton = people[people['name'] == 'Bill Clinton']
beckham = people[people['name'] == 'David Beckham']

# Is Obama closer to Clinton than to Beckham?
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])
# 0.8339854936884276

graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])
# 0.9791305844747478

這裡呈現的數字意含著,數字越小代表越接近,因此由結果來看合乎推測

從剛剛到現在我們已經針對個別的人來算出其距離,接下來我想做到的是一篇文章與其他篇文章的距離

Building & exploring a nearest neighbors model for Wikipedia articles

在前面的課程中我們有討論過 nearest neighbors model 以及若何將之用於文章檢索上

# Build a nearest neighbor model for document retrieval
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Applying the nearest-neighbors model for retrieval
knn_model.query(obama)

https://ithelp.ithome.com.tw/upload/images/20180101/201074482vkyeteRpc.png

Other examples of document retrieval

jolie = people[people['name'] == 'Angelina Jolie']
knn_model.query(jolie)
query_label reference_label distance rank
0 Angelina Jolie 0.0 1
0 Brad Pitt 0.784023668639 2
0 Julianne Moore 0.795857988166 3
0 Billy Bob Thornton 0.803069053708 4
0 George Clooney 0.8046875 5

從結果來看,毫無疑問 obama 離自己最近,而 Joe Biden 則是他之前的副手(前美國副總統)

而離 Angelina Jolie 最近的是 Brad Pitt(丈夫),不過我想主因是兩個人都是美國好萊塢影音,而不是因為夫妻關係

Reference:


上一篇
[day 14] 分群與相似度-4
下一篇
[day 16] 推薦系統 -1
系列文
到底是在learning什麼拉30

尚未有邦友留言

立即登入留言