DAY 20
3
AI & Machine Learning

## 前言

1. 成效衡量指標(Metrics)：準確率(Accuracy)、精確率(Precision)、召回率(Recall)。
2. 詞袋(Bag Of Words, BOW)。
3. n元語法(N-Gram)
4. tf-idf(term frequency–inverse document frequency）
5. 命名實體識別(Named Entity Recognition，NER)。

## 成效衡量指標(Metrics)

1. 準確率(Accuracy)=(tp+tn)/(tp+fp+fn+tn)
2. 精確率(Precision)=tp/(tp+fp)
3. 召回率(Recall)=tp/(tp+fn)

NLTK提供相關的函數，可以計算上述指標，用下列一段程式說明，就會很清楚。

``````from __future__ import print_function
from nltk.metrics import *
training='雞 貓 雞 貓 狗 人'.split()
testing ='雞 貓 雞 貓 貓 貓'.split()
# 6個猜對4個
print("accuracy=", accuracy(training,testing))
trainset=set(training)
testset=set(testing)
print("trainset=", trainset)
print("testset=", testset)
# 猜的全部在4"種"類別內
print("precision=", precision(trainset,testset))
# 4"種"猜對兩種
print("recall=", recall(trainset,testset))
``````

## 詞袋(Bag Of Words)

(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.

[
"John",
"likes",
"to",
"watch",
"movies",
"Mary",
"too",
"also",
"football",
"games"
]

(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

Continuous Bag-of Words(CBOW)：是將一段句子的中間字當作label，其左右文字為input words，所以是多個字input 一個輸出label。
[
"John to", ==> likes
"likes watch", ==> to
"to movies", ==> watch
"movies likes", ==> Mary
"Mary movies", ==> likes
"likes movies", ==> movies
]

## n元語法(N-Gram)

[
"John likes",
"likes to",
"to watch",
"watch movies",
"Mary likes",
"likes movies",
"movies too",
]

[
"John to",
"likes watch",
"to movies",
"movies likes",
"Mary movies",
"likes movies",
"movies too",
]

Google 創造的 Word2Vec 詞向量模型就是利用 CBOW 及 skip-gram 計算詞向量。

## tf-idf(term frequency–inverse document frequency）

tf-idf = tf * idf。

``````import nltk.corpus
from nltk.text import TextCollection

# 載入NLTK的範例文句
from nltk.book import text1, text2, text3
# text1 = "Moby Dick by Herman Melville 1851"
# text2 = "Sense and Sensibility by Jane Austen 1811"
# text3 = "The Book of Genesis"
# ...
# text9 = "The Man Who Was Thursday by G . K . Chesterton 1908"

# 使用 TextCollection class 計算 tf-idf
# input為 text1, text2, text3
mytexts = TextCollection([text1, text2, text3])

# 計算 tf，例如，book在text3的tf
# tf = text.count(term) / len(text)
mytexts.tf("book", text3)
# 2.233937985881512e-05

# 計算 idf，例如，book在text1、text2、text3的idf
# idf = (log(len(self._texts) / matches) if matches else 0.0)
mytexts.idf("Book")
# 1.0986122886681098
``````

## 命名實體識別(Named Entity Recognition，NER)

NER是一種解析文件並標註各個實體類別的技術，例如人、組織、地點...等。NLTK支援此一功能，包含兩種標註函數ne_chunk及Stanford NER，ne_chunk 程式如下：

``````from nltk import ne_chunk
from nltk import word_tokenize
sent = "Mark is studying at Stanford University in California"
print(ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False))
# output
# (PERSON Mark/NNP)
# is/VBZ
# studying/VBG
# at/IN
# (ORGANIZATION Stanford/NNP University/NNP)
# in/IN
# (GPE California/NNP))
``````

``````import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))
# [ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
# ...
# 'Bastille Opera'] 'in' [LOC: 'Paris']
# ...
# 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
``````