[day 10] 分類-實作

2018 iT 邦幫忙鐵人賽

DAY 11

AI & Machine Learning

到底是在learning什麼拉系列第 11 篇

2018鐵人賽 machine learning

nylon

團隊晶心壯士

2017-12-29 18:28:13

2081 瀏覽

分享至

Loading & exploring product review data

在分類這章裡面，我們討論了如何透過分類來搭建情緒分析器

接下將透過一個簡單的例子，帶大家做個簡單的情緒分析器

一樣要導入GraphLab

接著引入產品的資料(amazon_baby.gl)，透過graphlab來讓資料視覺化

import graphlab

products = graphlab.SFrame('/home/user/nylon7/machine_learning/week3/amazon_baby.gl')
products.head()

最後我們可以透過head這個method來得到資料的前幾行

name	review	rating
Planetwise Flannel Wipes	These flannel wipes areOK, but in my opinion ...	3.0
Planetwise Wipe Pouch	it came early and was notdisappointed. i love ...	5.0
Annas Dream Full Quiltwith 2 Shams ...	Very soft and comfortableand warmer than it ...	5.0
Stop Pacifier Suckingwithout tears with ...	This is a product wellworth the purchase. I ...	5.0
Stop Pacifier Suckingwithout tears with ...	All of my kids have criednon-stop when I tried to ...	5.0
Stop Pacifier Suckingwithout tears with ...	When the Binky Fairy cameto our house, we didn't ...	5.0
A Tale of Baby's Dayswith Peter Rabbit ...	Lovely book, it's boundtightly so you may no ...	4.0
Baby Tracker® - DailyChildcare Journal, ...	Perfect for new parents.We were able to keep ...	5.0
Baby Tracker® - DailyChildcare Journal, ...	A friend of mine pinnedthis product on Pinte ...	5.0
Baby Tracker® - DailyChildcare Journal, ...	This has been an easy wayfor my nanny to record ...	4.0

這份資料大致上有三個項目，分別是品名、評論、等級

Creating the word count vector

接著我們想要建立自己的word count，把整條評論變成分詞，這樣才有助於我們做情緒分析

products['word_count'] = graphlab.text_analytics.count_words(products['review'])
products.head()

統計完之後，同樣可以透過head將其印出，我們可以看到word_count的欄位，他幫我們統計了每一個單字的數量

name	review	rating	word_count	sentiment
Planetwise Wipe Pouch	it came early and was notdisappointed. i love ...	5.0	{'and': 3, 'love': 1,'it': 2, 'highly': 1, ...	1
Annas Dream Full Quiltwith 2 Shams ...	Very soft and comfortableand warmer than it ...	5.0	{'and': 2, 'quilt': 1,'it': 1, 'comfortable': ...	1
Stop Pacifier Suckingwithout tears with ...	This is a product wellworth the purchase. I ...	5.0	{'ingenious': 1, 'and':3, 'love': 2, ...	1
Stop Pacifier Suckingwithout tears with ...	All of my kids have criednon-stop when I tried to ...	5.0	{'and': 2, 'parents!!':1, 'all': 2, 'puppet.': ...	1
Stop Pacifier Suckingwithout tears with ...	When the Binky Fairy cameto our house, we didn't ...	5.0	{'and': 2, 'this': 2,'her': 1, 'help': 2, ...	1
A Tale of Baby's Dayswith Peter Rabbit ...	Lovely book, it's boundtightly so you may no ...	4.0	{'shop': 1, 'noble': 1,'is': 1, 'it': 1, 'as': ...	1
Baby Tracker® - DailyChildcare Journal, ...	Perfect for new parents.We were able to keep ...	5.0	{'and': 2, 'all': 1,'right': 1, 'when': 1, ...	1
Baby Tracker® - DailyChildcare Journal, ...	A friend of mine pinnedthis product on Pinte ...	5.0	{'and': 1, 'help': 1,'give': 1, 'is': 1, ' ...	1
Baby Tracker® - DailyChildcare Journal, ...	This has been an easy wayfor my nanny to record ...	4.0	{'journal.': 1, 'nanny':1, 'standarad': 1, ...	1
Baby Tracker® - DailyChildcare Journal, ...	I love this journal andour nanny uses it ...	4.0	{'all': 1, 'forget': 1,'just': 1, 'food': 1, ...	1

Exploring the most popular product

將著我們可以先來看看產品的某一例，比如

products['name'].show()

從中你可以發現，得到最多評論的產品是Vulli Sophie

同樣我們可以透過show()將評等視覺化

# explore Vulli Sophie
giraffe_reviews = products[ products['name'] == 'Vulli Sophie the Giraffe Teether']

#785
len(giraffe_reviews)

giraffe_reviews['rating'].show(view='Categorical')

直觀來看我們可以發現，這像產品的評價是不錯的，原因是其5顆星的評等遠高於其他

Defining which reviews have positive or negative sentiment

回顧一下上面的圖表，這像產品沒有"讚"與"爛"的標籤，它是一種光譜，而不是一種二元的對立

因此這時候我們就要做出一點取捨，我們假設5跟4都代表著正向的情緒而1跟2代表負面

至於3我們先將他撇除因為我認為3顆星代表著不好不壞（當然你可以嚴格一點，這邊我只是想展示當你想把某個區間的評價消除可以怎麼做）

# build a sentiment classifiers

# ignore all 3* reviews
products = products[products['rating'] != 3]

# positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >= 4

products.head()

此時在檢視一次會發現多了一個sentiment的欄位了(1為正向、0為負面)

Training a sentiment classifier

我們必須分兩步驟來實現這個classifier

先將資料分成測試集與訓練集
建立模型，以訓練集做為輸入訓練機器

這個範例中我們呼叫的是logistic_classifier（邏輯回歸），但事實上還有Decision Trees（決策樹）、Support Vector Machine（支持向量機）這兩種方法，這將會在之後討論

在 logistic classifier 中除了要給 train_data 外還要輸入 train 的目標還有 features，再將測試集輸入到validation_set，最後就可以得到訓練的結果

# 80% train set 20% test set
train_data,test_data = products.random_split(.8,seed=0)

sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features = ['word_count'],
                                                      validation_set=test_data)

從圖表中可以很清楚看到，就結果來說Training-accuracy越來越趨近1(越來越好)

Evaluating a classifier & the ROC curve

接著我們要來評估我們的情感分析模組

還記得之前提到的False negative 與 False positive 嗎？

ROC curve 就是專門用來針對confusion matrix中的 False negative 與 False positive

# evaluate the sentiment model use roc curve
sentiment_model.evaluate(test_data,metric='roc_curve')

# Data Visualization
sentiment_model.show(view='Evaluation')

ROC curve 的圖表隱含著的是對 True Positive 與 False Positive 的取捨

	Predicted label +	Predicted label -
True label +	True Positive	False Negative
True label -	False Negativ	True Negativ

我們可以看到當中的 True Positive 與 True Negativ 資料非常的不平衡，總體精確率為91.1%

Applying model to find most positive & negative reviews for a product

接下來我想用我的訓練好的模組來分析我 giraffe ，這邊有一點要注意，還記得前面提到的嗎？若要夠精確就必須將結果加上機率才行，所以 output_type 要設為 probability

# Applying the learned model to understand sentiment for giraffe
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews,output_type='probability')

giraffe_reviews.head()

name	review	rating	word_count	predicted_sentiment
Vulli Sophie the GiraffeTeether ...	He likes chewing on allthe parts especially the ...	5.0	{'and': 1, 'all': 1,'because': 1, 'it': 1, ...	0.999513023521
Vulli Sophie the GiraffeTeether ...	My son loves this toy andfits great in the diaper ...	5.0	{'and': 1, 'right': 1,'help': 1, 'just': 1, ...	0.999320678306
Vulli Sophie the GiraffeTeether ...	There really should be alarge warning on the ...	1.0	{'and': 2, 'all': 1,'would': 1, 'latex.': 1, ...	0.013558811687
Vulli Sophie the GiraffeTeether ...	All the moms in my moms'group got Sophie for ...	5.0	{'and': 2, 'one!': 1,'all': 1, 'love': 1, ...	0.995769474148
Vulli Sophie the GiraffeTeether ...	I was a little skepticalon whether Sophie was ...	5.0	{'and': 3, 'all': 1,'months': 1, 'old': 1, ...	0.662374415673
Vulli Sophie the GiraffeTeether ...	I have been reading aboutSophie and was going ...	5.0	{'and': 6, 'seven': 1,'already': 1, 'love': 1, ...	0.999997148186
Vulli Sophie the GiraffeTeether ...	My neice loves her sophieand has spent hours ...	5.0	{'and': 4, 'drooling,':1, 'love': 1, ...	0.989190989536
Vulli Sophie the GiraffeTeether ...	What a friendly face!And those mesmerizing ...	5.0	{'and': 3, 'chew': 1,'be': 1, 'is': 1, ...	0.999563518413
Vulli Sophie the GiraffeTeether ...	We got this just for myson to chew on instea ...	5.0	{'chew': 2, 'seemed': 1,'because': 1, 'about.': ...	0.970160542725
Vulli Sophie the GiraffeTeether ...	My baby seems to likethis toy, but I could ...	3.0	{'and': 2, 'already': 1,'some': 1, 'it': 3, ...	0.195367644588

我們可以看到模組將每個預測中都增加了機率

接著如果我們想要知道給予負面評價的人都怎麼說？如果要視覺化呈現我們透過排序的方式，然後將 ascending 設為即可看到從負評開始的排列

# sort the reviews based on the predicted sentiment
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment',ascending=False)
giraffe_reviews.head()

最後我們也可以透過存取 review 欄位來更細緻看看評論到底寫些什麼？