第 11 屆 iThome 鐵人賽

DAY 6

AI & Data

跟top kaggler學習如何贏得資料分析競賽系列第 6 篇

[Day 6] bag of words / BOW 詞袋

11th鐵人賽 kaggle

madeleine

2019-09-07 22:50:53

3994 瀏覽

分享至

從文字或影像提取特徵 Feature extraction from texts, images
純 Text 競賽: Allen AI challenge
純 Images 競賽: Data Science Bowl

文字特徵有兩種處理方法, 詞袋跟word2vec, 詞袋較簡單.

截圖自coursera

BOW 詞袋的概念, 是把文字轉化成向量, 請參考下列名詞解釋.

截圖自coursera

sklearn.feature_extractin.text.CountVectorizer

Bag of words : TFiDF

Term freqency / TF

tf = 1/x.sum(axis=1)[:,None]
x = x * tf

截圖自coursera

Inverse Document Frequency / IDF

idf = np.log(x.shape[0]/(x>0),sum(0))
x = x * idf

截圖自coursera

sklearn.feature_extraction.textTfidfVectorizer

詞袋模型 bag of words model 名詞解釋:

引用自曾元顯 2012年10月圖書館學與資訊科學大辭典
http://terms.naer.edu.tw/detail/1679006/

　　詞袋模型（bag of words model）重點不在於這個想像中的袋子，而在於其對待袋子中的詞彙方式，亦即每個詞彙都是獨立的單位，不考慮其相依性。例如：文件A中的內容（如篇名）若為：「病人與醫生的糾紛研究」，以詞袋模型表示，則該文件可以表達成：「病人、糾紛、醫生、研究」這四個獨立的詞彙。

文件中的詞彙代表空間中的一個維度，而維度與維度之間是獨立的，如此形成文件向量，便於後續的向量計算。如上例，文件A與文件B以（病人、醫生、糾紛、研究、醫療、缺失、改善、探討），8個詞當維度，可以分別表示成（1, 1, 1, 1, 0, 0, 0, 0）與（0, 0, 0, 0, 1, 1, 1, 1）的向量。

Bag of words : N-gram

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

截圖自coursera

N-gram / n元語法名詞解釋:

引用自2003年6月資訊與通信術語辭典
http://terms.naer.edu.tw/detail/1283111/
建立在n－1階馬可夫模型上的一種概率語法，依據語句中n個語詞之同現概率的統計資料，來推斷句子的結構關係。當n＝2時，稱為二元語法（bigram）；當n＝3時，稱為三元語法（trigram）。

資料预處理

Lowercase
Stemming（字幹搜尋；字幹檢索）
democracy, democratic, and democratization -> democr
Lemmatization（詞性還原）
democracy, democratic, and democratization -> democracy
Stopwords - 冠词和介詞, 無意義的詞, 也可能是出现很多次的詞

NLTK, Natural Language Toolkit library for python

sklearn.feature_extraction.text.CountVectorizer: max_df

Feature extraction from text

Recap BOW 流程

-預處理 Lowercase, Stemming, Lemmatization 及 Stopwords
-Ngram
-後製 : TFiTF

Bag of words

Feature extraction from text with Sklearn(http://scikit-learn.org/stable/modules/feature_extraction.html)
More examples of using Sklearn(https://andhint.github.io/machine-learning/nlp/Feature-Extraction-From-Text/)

Word2vec

Tutorial to Word2vec(https://www.tensorflow.org/tutorials/word2vec)
Tutorial to word2vec usage(https://rare-technologies.com/word2vec-tutorial/)
Text Classification With Word2Vec(http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)
Introduction to Word Embedding Models with Word2Vec(https://taylorwhitten.github.io/blog/word2vec)

NLP Libraries

NLTK(http://www.nltk.org/)
TextBlob(https://github.com/sloria/TextBlob)

[Day 5] Datetime Feature 與 Coordinate Feature (座標)

[Day 7] Word2vec, CNN

系列文

跟top kaggler學習如何贏得資料分析競賽共 30 篇

RSS系列文訂閱系列文

21 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

跟top kaggler學習如何贏得資料分析競賽 系列 第 6 篇

[Day 6] bag of words / BOW 詞袋

Bag of words : TFiDF

Term freqency / TF

Inverse Document Frequency / IDF

詞袋模型 bag of words model 名詞解釋:

Bag of words : N-gram

N-gram / n元語法 名詞解釋:

資料预處理

Feature extraction from text

Recap BOW 流程

Bag of words

Word2vec

NLP Libraries

尚未有邦友留言

標記使用者

跟top kaggler學習如何贏得資料分析競賽系列第 6 篇

N-gram / n元語法名詞解釋: