iT邦幫忙

2023 iThome 鐵人賽

DAY 28
0

NLP步驟


將文件收集到語料庫內


基元化(語彙基元tokens)

句子基元化&偵測語言

  • 文本通常被分割成句子
  • pip install langdetect

I:

from nltk.tokenize import sent_tokenize
import nltk
from langdetect import detect

nltk.download('punkt')
DeutschText = 'Tschüss, Danke'
print(sent_tokenize(DeutschText))
for sentence in sent_tokenize(DeutschText, language='german'):
    print(sentence)
print(detect(sentence))
EngText = 'GoodBye, Thank you'
for sentence in sent_tokenize(EngText, language='english'):
    print(sentence)
print(detect(sentence))
JpText = 'さようなら、ありがとう'
for sentence in sent_tokenize(JpText):
    print(sentence)
print(detect(sentence))
EsText = 'Adios, Gracias'
for sentence in sent_tokenize(EsText, language='spanish'):
    print(sentence)
print(detect(sentence))

O:

['Tschüss, Danke']
Tschüss, Danke
de
GoodBye, Thank you
so
さようなら、ありがとう
ja
Adios, Gracias
es

單字基元化

  • 句子中的單詞被分割成基本的單元 => "分詞"(Tokenization)
  • 正則表達式、語法分析、或使用預訓練的模型
    I:
import nltk
sent = "I am almost dead this time"
token = nltk.word_tokenize(sent)
print(token)

O:

['I', 'am', 'almost', 'dead', 'this', 'time']

移除剔除字(stopword)

  • 停用詞是在文本中非常常見的詞語,它們通常不包含太多有意義的信息,因此在NLP常常被去除

  • list

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

ENGsw = set(stopwords.words('english'))
n = 8
for i, word in enumerate(ENGsw):
    if i % n == 0 and i != 0:
        print()
    print(word, end=' , ')

  • Stopword for eng list
because , that , s , how , by , you're , were , doing , 
y , below , has , d , above , no , an , that'll , 
shan , all , such , i , than , mightn't , not , is , 
before , hasn't , they , having , you'd , had , while , wasn't , 
you've , we , too , through , shouldn , yourself , was , them , 
off , don't , why , so , herself , he , ain , him , 
more , on , both , mightn , other , and , over , who , 
there , my , some , aren't , between , same , up , re , 
yourselves , she , wouldn't , once , aren , doesn't , hadn't , ours , 
wasn , the , in , mustn , it , been , whom , t , 
itself , from , with , ll , have , shouldn't , against , again , 
doesn , being , haven , if , be , only , hadn , o , 
won't , just , for , wouldn , to , are , or , mustn't , 
isn't , own , do , very , yours , theirs , then , our , 
couldn , won , hasn , when , at , it's , into , didn't , 
about , hers , but , those , himself , haven't , its , you , 
few , ourselves , during , myself , am , further , don , you'll , 
can , isn , these , ve , his , she's , does , weren , 
any , needn , should , her , now , themselves , couldn't , out , 
me , should've , most , under , which , where , as , down , 
what , each , this , your , nor , after , weren't , m , 
didn , ma , shan't , did , until , of , a , will , 
here , their , needn't , [nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
  • Stopword for ZH-TC list

  • pip install opencc

from nltk.corpus import stopwords
from opencc import OpenCC

# 轉換簡體中文為繁體中文


def convert_to_traditional(text):
    cc = OpenCC('s2twp')
    return cc.convert(text)


sw = set(stopwords.words('chinese'))
n = 8

for i, word in enumerate(sw):
    if i % n == 0 and i != 0:
        print()
    traditional_word = convert_to_traditional(word)
    print(traditional_word, end=',')
多次,哎,可是,以下,不怕,每天,然後,一邊,
以後,並不是,倘使,所,為何,嚇,哪些,莫若,
絕對,被,各級,達到,認為,成為,設若,特點,
幫助,相對,各自,哈哈,已經,當時,哈,呀,
瞭解,一來,是不是,不是,相信,繼續,人家,故,
各地,滿足,看看,每年,那些,直到,其餘,現代,
過來,一些,即便,爾後,任,不論,仍然,不久,
決定,到,咱,維持,不特,強調,總的來說,替,
組成,啥,防止,當,進而,彼此,無寧,也是,
作為,即令,照,適用,保持,只要,另一方面,向著,
遇到,己,得到,烏乎,省得,這麼些,這麼樣,其一,
啐,任憑,本,何況,上來,各個,兩者,由此可見,
乃至,假如,其,按,換言之,諸位,那麼樣,附近,
需要,能夠,這裡,縱,因,無法,恰恰相反,自家,
下面,要麼,論,毋寧,本著,多數,主要,給,
只是,別說,不變,嗎,一方面,它,於是,不一,
你的,一則,呸,來著,較之,除此之外,吱,總的說來,
嘔,哼,似的,大大,合理,說明,不敢,吧噠,
該,最高,失去,安全,目前,打,特別是,起見,
裡面,那邊,並,重大,叫做,它的,緊接著,受到,
上升,前進,知道,深入,互相,至,寧肯,哪年,
不能,如上所述,雖則,地,如其,好的,上述,可,
極了,全面,可以,或者,變成,集中,乃,真是,
於是乎,透過,跟,用,自個兒,出去,規定,此外,
不但,全部,戰鬥,中間,接著,造成,者,廣大,
練習,突然,每,設使,俺們,什麼,憑,的,
她的,另外,換句話說,這,在,若是,自各兒,往,
且,為什麼,嘻,根本,能否,大批,有著,而況,
啪達,它們,我,共同,之前,注意,普通,比方,
著,但,唄,應當,移動,良好,上面,今後,
反之,必然,的話,為主,產生,不比,今年,後面,
吧,旁人,最大,為,而且,臨,甚至,考慮,
譁,不拘,怎,從事,以前,雖然,看來,做到,
存在,某個,普遍,可能,各人,總而言之,各位,具體說來,
前面,嗯,中小,這兒,形成,隨著,當著,適應,
與,相反,罷了,一片,豐富,怎麼,加以,並且,
採取,一致,也罷,嘎登,有著,最後,過去,大多數,
這麼,限制,哼唷,然則,此間,允許,寧,怎麼辦,
必須,正如,之後,以及,即,既是,歡迎,前者,
尤其,個,不獨,因而,宣佈,周圍,時候,依照,
騰,怎麼樣,大力,部分,呼哧,好象,嘛,豈但,
但是,有的,越是,呃,無論,把,則,咋,
嗚,倘然,複雜,望,同時,矣,毫不,先後,
相應,自身,通常,主張,們,鞏固,不過,其二,
範圍,怎麼,嘎,之後,當然,往往,爭取,慢說,
哪個,並沒有,遭到,鑑於,其它,能,將,喲,
哎呀,之,反過來,焉,盡,此,沿著,由,
縱令,除了,不得,朝著,就是,以至,其次,哇,
如此,在下,有力,與其,顯著,實現,構成,大約,
分別,接著,一般,上下,真正,屬於,而是,當前,
還是,據,應用,行動,及時,云云,何,密切,
這會兒,不同,表明,這樣,趁著,對應,其實,那時,
嚴格,他們,完成,即若,其他,總之,朝,那樣,
啊,是,也,同樣,各種,嗚呼,徹底,總的來看,
嚴重,故此,即或,曾經,管,使得,一切,咦,
一,假使,因此,什麼,比如,要,以後,不單,
及其,哪天,就是說,以,沿,於,邊,一起,
這些,趕,最後,獲得,那,此時,從,他人,
彼,正常,結果,倘或,靠,避免,看到,特殊,
覺得,而,專門,及,相當,有所,同一,有,
相同,除非,經過,上去,假若,寧願,確定,少數,
然而,直接,心裡,問題,企圖,出來,待,咚,
願意,起,今後,看出,及至,只限,這時,別的,
除,著呢,轉動,更加,總是,呢,了,那個,
加入,所以,你,若,任務,哪樣,之一,開展,
之類,趁,出現,處理,巨大,要不,最近,俺,
這麼點兒,某些,相對而言,哦,這麼,今天,寧可,自從,
為了,隨,各,之所以,至於,強烈,憑藉,與此同時,
而言,向,舉行,充分,幾乎,噓,開始,準備,
也好,後來,拿,等等,不然,他的,不只,結合,
一面,大量,雖,不管,只有,況且,應該,即使,
堅決,以便,啦,基本,方面,不斷,首先,逐步,
每當,起來,促進,卻不,掌握,要不是,認識,另,
既然,非徒,呵,喏,左右,果真,同,衝,
根據,隨著,不足,照著,召開,總結,乎,固然,
儘管,綜上所述,反過來說,有些,行為,我的,比較,後面,
而已,一定,對於,繼而,以致,反映,原來,所有,
為什麼,那裡,有點,加強,一時,比,經常,這個,
兮,過,咳,哪,什麼樣,引起,以為,再者,
得,得出,這邊,如何,要不然,轉貼,嗬,一直,
要求,其中,歸,顯然,前後,完全,就,要是,
似乎,她們,清楚,那兒,讓,現在,由於,幾時,
多少,鄙人,是否,何時,下去,順著,廣泛,重新,
或,如,來,那麼,不夠,以至於,雖說,反應,
抑或,轉變,依,後來,那麼,這種,以外,偉大,
和,有效,自己,具體地說,她,意思,立即,喔唷,
所謂,非但,擴大,乘,以免,順,何處,有時,
迅速,然後,如若,開外,沒有,個人,以來,關於,
認真,取得,一下,哪裡,方便,雙方,不可,容易,
既,正在,果然,按照,進入,不僅,誰知,甚麼,
叮咚,某,叫,有關,不會,整個,又,逐漸,
依靠,若非,以上,是的,再說,不惟,咱們,萬一,
怎樣,每個,一次,並不,突出,具體,較,嗡嗡,
不光,加之,喂,為著,那麼些,與否,個別,堅持,
第,下來,漫說,連同,有利,像,嘿,積極,
而外,唉,我們,等,下列,尚且,一旦,一樣,
別,先生,大家,哎喲,縱然,他,實際,哪怕,
表示,重要,離,倘,非常,例如,最好,如下,
相等,縱使,倘若,噯,你們,任何,哪兒,多,
您,可見,甚而,麼,它們的,不成,對,阿,
必要,那會兒,代替,不問,如果,聯絡,進行,經,
哉,相似,使用,自,幾,許多,還有,適當,
這就是說,一天,進步,從而,不要,不如,具有,十分,
因為,常常,人們,明顯,明確,看見,連,高興,
先後,哩,運用,冒,哪邊,否則,或是,誰,
這點,
  • Stopword for english list
from nltk.corpus import stopwords
from opencc import OpenCC

# 轉換簡體中文為繁體中文

def convert_to_traditional(text):
    cc = OpenCC('s2twp')
    return cc.convert(text)

sw = set(stopwords.words('english'))
n = 8

for i, word in enumerate(sw):
    if i % n == 0 and i != 0:
        print()
    traditional_word = convert_to_traditional(word)
    print(traditional_word, end=', ')
because, that, s, how, by, you're, were, doing, 
y, below, has, d, above, no, an, that'll, 
shan, all, such, i, than, mightn't, not, is, 
before, hasn't, they, having, you'd, had, while, wasn't, 
you've, we, too, through, shouldn, yourself, was, them, 
off, don't, why, so, herself, he, ain, him, 
more, on, both, mightn, other, and, over, who, 
there, my, some, aren't, between, same, up, re, 
yourselves, she, wouldn't, once, aren, doesn't, hadn't, ours, 
wasn, the, in, mustn, it, been, whom, t, 
itself, from, with, ll, have, shouldn't, against, again, 
doesn, being, haven, if, be, only, hadn, o, 
won't, just, for, wouldn, to, are, or, mustn't, 
isn't, own, do, very, yours, theirs, then, our, 
couldn, won, hasn, when, at, it's, into, didn't, 
about, hers, but, those, himself, haven't, its, you, 
few, ourselves, during, myself, am, further, don, you'll, 
can, isn, these, ve, his, she's, does, weren, 
any, needn, should, her, now, themselves, couldn't, out, 
me, should've, most, under, which, where, as, down, 
what, each, this, your, nor, after, weren't, m, 
didn, ma, shan't, did, until, of, a, will, 
here, their, needn't
  • Stopword for german list
from nltk.corpus import stopwords
from opencc import OpenCC

# 轉換簡體中文為繁體中文

def convert_to_traditional(text):
    cc = OpenCC('s2twp')
    return cc.convert(text)

sw = set(stopwords.words('german'))
n = 8

for i, word in enumerate(sw):
    if i % n == 0 and i != 0:
        print()
    traditional_word = convert_to_traditional(word)
    print(traditional_word, end=', ')
seiner, von, einer, sich, jener, sie, dieselben, wollen, 
andere, da, dazu, das, einmal, keinen, keines, um, 
an, diesem, jenes, zu, jedem, war, dessen, die, 
alles, haben, es, einem, derselben, ich, sein, meiner, 
solchem, eine, unseres, dich, als, seinem, gewesen, nicht, 
meinem, also, deines, eurer, manches, was, nur, dort, 
habe, so, kein, welchem, sonst, diesen, wirst, wie, 
ihr, andern, meine, meinen, sollte, mein, zum, würden, 
oder, welches, dein, dieses, zwischen, welchen, dieselbe, einiger, 
auch, anderem, er, euer, derselbe, ob, und, nichts, 
der, dem, ein, wird, nun, ihm, keine, einen, 
manche, in, anderes, deine, dieser, keiner, etwas, man, 
deiner, seinen, jeden, ihnen, auf, unseren, bis, könnte, 
doch, welche, daß, im, unsere, solche, solchen, anderer, 
hatten, solches, vor, wenn, hinter, allem, noch, wollte, 
seine, euren, indem, hab, damit, sind, ihrem, ander, 
hatte, allen, anderr, jeder, einigen, unter, du, mich, 
welcher, kann, über, jenen, ihren, warst, jene, denselben, 
diese, durch, solcher, soll, unserem, wir, würde, weil, 
eurem, werde, euch, weiter, können, mit, bei, jetzt, 
ohne, uns, ins, muss, ihn, aller, desselben, dir, 
jenem, vom, werden, wo, zwar, einig, aber, am, 
für, wieder, dies, denn, bin, den, ihre, des, 
gegen, ihrer, musste, waren, ihres, dann, deinen, meines, 
weg, jede, derer, jedes, anderm, unser, eure, machen, 
einige, eures, nach, deinem, manchem, dass, einigem, ist, 
sondern, während, bist, keinem, aus, hat, selbst, sehr, 
eines, viel, anderen, demselben, dasselbe, zur, mancher, seines, 
hier, einiges, anders, alle, manchen, will, hin, mir, 

詞幹分析(stemming)

  • 將單詞還原為其基本形式
  • 處理單詞的變化形式
    I:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.snowball import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
ess = SnowballStemmer('english', ignore_stopwords=True)
print(ess.stem('files'))
fss = SnowballStemmer('french', ignore_stopwords=True)
print(fss.stem('courais'))
print(ess.stem('teeth'))
ps = PorterStemmer()
print(ps.stem('teeth'))
ls = LancasterStemmer()
print(ls.stem('teeth'))

O:

file
cour
teeth
teeth
tee

建立一般詞彙


文件向量化

計數向量化

I:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'Hallo, ich bin Student',
    'ich bin 20 Jahre alt',
    'Du auch? Danke'
]
cv = CountVectorizer()
vectorized_corpus = cv.fit_transform(corpus)
print(vectorized_corpus.todense())

O:

[[0 0 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 0 0 1 1 0]
 [0 0 1 0 1 1 0 0 0 0]]

TF-IDF(詞頻-逆文檔頻率)向量化

  • 詞頻(TF) 表示特定詞語在文檔中出現的頻率。高詞頻的詞語在文檔中更重要。
  • 逆文檔頻率(IDF) 表示特定詞語在整個語料庫中的稀有程度。稀有的詞語擁有更高的逆文檔頻率。
    I:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    '這是第一個文件。',
    '這是第二個文件。',
    '這是第三個文件。',
    '這是一個示例文件。',
]

# 初始化TF-IDF向量化器
tfidf_vectorizer = TfidfVectorizer()

# 對文件進行基元化和TF-IDF向量化
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# TF-IDF矩陣
print(tfidf_matrix.toarray())

O:

[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]

將文件分類或分群


文字表示模型

  • 先將文字進行量化( quantification )。而透過數字來表示語言的演算法,就稱之為語言模型( language model )。

詞嵌入向量(Word Embedding)

詞袋 (Bag-of-words)

  • 優點: 直觀、操作容易
  • 缺點: 無法表達前後語意關係

詞袋模型(Bag of Words Model, 簡稱BoW)

a. 看到他我就不爽。(先斷詞)=>看到/他/我/就/不爽
b. 看到他我就火大。(先斷詞)=>看到/他/我/就/火大
  • 缺點:
    • 容易造成維度災難(curse of dimensionality)
    • 向量表達過於稀疏(sparse)
    • 無法表達語意

BoW延伸: TF(詞頻)-IDF(逆向檔案頻率)

  • 某一特定單詞在整個文件中出現的頻率*整個文件的數量與某一特定單詞文件的數量的比率
"蘋果" 出現在文件 A 中的次數(TF)是 10。
"蘋果" 在整個語料庫中的文件數是 100,整個語料庫中文件總數是 1,000。
現在,我們可以計算 "蘋果" 的 IDF 如下:

IDF("蘋果") = log(1,000 / 100) = log(10) ≈ 2.3026

最後,我們可以計算 "蘋果" 的 TF-IDF 分數,例如:

TF-IDF("蘋果") = TF * IDF = 10 * 2.3026 ≈ 23.026

這表示 "蘋果" 在文件 A 中的相對重要性約為 23.026。

TF-IDF 可以用於文本檢索、信息檢索、文本分類等自然語言處理任務中,以評估單詞在文本中的相對重要性。

希望這些補充資訊對你有所幫助!如果你有其他問題或需要更多資訊,請隨時告訴我。

BoW延伸: CBoW & Skip-gram Model

CBoW

  • 給定上下文單詞的向量,預測中間目標詞的向量
  1. 假設我們有一個句子:"我喜歡學習自然語言處理"。
  2. 我們將這個句子分成單詞:"我", "喜歡", "學習", "自然語言處理"。
  3. 對於每個目標詞,模型會收集它周圍的上下文單詞,例如對於目標詞 "學習",上下文單詞是 ["我", "喜歡", "自然語言處理"]。
  4. 模型將上下文單詞的詞向量進行平均,然後預測目標詞的詞向量。

目標詞:"學習"
上下文單詞:["我", "喜歡", "自然語言處理"]
模型會學習一個詞向量來表示 "學習",該向量將最好地捕捉到上下文的語義。

Skip-gram

  • 從目標詞預測周圍的上下文詞
  • 更適合處理大型語料庫
  1. "我喜歡學習自然語言處理"為例。
  2. 分成單詞:"我", "喜歡", "學習", "自然語言處理"。
  3. 以每個單詞為目標詞,預測其周圍的上下文單詞。例如,對於目標詞 "學習",上下文單詞可以是 ["我", "喜歡", "自然語言處理"]。
  4. 模型將上下文單詞的詞向量轉換為目標詞的詞向量。

目標詞:"學習"
上下文單詞:["我", "喜歡", "自然語言處理"]
模型會學習一個詞向量來表示 "學習",以便最好地預測它的上下文詞

Scikit-learning Example

網路範例

from sklearn.datasets import fetch_openml  # 載入資料集
from sklearn.datasets import fetch_california_housing  # 載入加利福尼亞房價資料集
import pandas as pd  # 資料處理
import numpy as np  # 數值計算

data_url = "http://lib.stat.cmu.edu/datasets/boston"  # 資料集URL

raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)  
# sep="\s+" 表示資料使用空格作為分隔符號
# skiprows=22 表示跳過前 22 行
# header=None 表示資料沒有標題行

data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])  # 提取部分資料並水平堆疊在一起
# raw_df.values[::2, :] 提取偶數行的所有列
# raw_df.values[1::2, :2] 提取奇數行的前兩列

target = raw_df.values[1::2, 2]  # 提取目標值資料

housing = fetch_california_housing()  # 載入加利福尼亞房價資料集

housing = fetch_openml(name="house_prices", as_frame=True)  # 載入名為 "house_prices" 的資料集

print(housing)  

{'data':         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape LandContour  ... PoolArea PoolQC  Fence MiscFeature MiscVal MoSold YrSold SaleType  SaleCondition
0        1          60       RL         65.0     8450   Pave  None      Reg         Lvl  ...        0   None   None        None       0      2   2008       WD         Normal
1        2          20       RL         80.0     9600   Pave  None      Reg         Lvl  ...        0   None   None        None       0      5   2007       WD         Normal
2        3          60       RL         68.0    11250   Pave  None      IR1         Lvl  ...        0   None   None        None       0      9   2008       WD         Normal
3        4          70       RL         60.0     9550   Pave  None      IR1         Lvl  ...        0   None   None        None       0      2   2006       WD        Abnorml
4        5          60       RL         84.0    14260   Pave  None      IR1         Lvl  ...        0   None   None        None       0     12   2008       WD         Normal
...    ...         ...      ...          ...      ...    ...   ...      ...         ...  ...      ...    ...    ...         ...     ...    ...    ...      ...            ...
1455  1456          60       RL         62.0     7917   Pave  None      Reg         Lvl  ...        0   None   None        None       0      8   2007       WD         Normal
1456  1457          20       RL         85.0    13175   Pave  None      Reg         Lvl  ...        0   None  MnPrv        None       0      2   2010       WD         Normal
1457  1458          70       RL         66.0     9042   Pave  None      Reg         Lvl  ...        0   None  GdPrv        Shed    2500      5   2010       WD         Normal
1458  1459          20       RL         68.0     9717   Pave  None      Reg         Lvl  ...        0   None   None        None       0      4   2010       WD         Normal
1459  1460          20       RL         75.0     9937   Pave  None      Reg         Lvl  ...        0   None   None        None       0      6   2008       WD         Normal

[1460 rows x 80 columns], 'target': 0       208500
1       181500
2       223500
3       140000
4       250000
         ...
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64, 'frame':         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape LandContour Utilities  ... PoolArea PoolQC  Fence MiscFeature MiscVal MoSold YrSold  SaleType  SaleCondition  SalePrice
0        1          60       RL         65.0     8450   Pave  None      Reg         Lvl    AllPub  ...        0   None   None        None       0      2   2008        WD         Normal     208500
1        2          20       RL         80.0     9600   Pave  None      Reg         Lvl    AllPub  ...        0   None   None        None       0      5   2007        WD         Normal     181500
2        3          60       RL         68.0    11250   Pave  None      IR1         Lvl    AllPub  ...        0   None   None        None       0      9   2008        WD         Normal     223500
3        4          70       RL         60.0     9550   Pave  None      IR1         Lvl    AllPub  ...        0   None   None        None       0      2   2006        WD        Abnorml     140000
4        5          60       RL         84.0    14260   Pave  None      IR1         Lvl    AllPub  ...        0   None   None        None       0     12   2008        WD         Normal     250000
...    ...         ...      ...          ...      ...    ...   ...      ...         ...       ...  ...      ...    ...    ...         ...     ...    ...    ...       ...            ...        ...
1455  1456          60       RL         62.0     7917   Pave  None      Reg         Lvl    AllPub  ...        0   None   None        None       0      8   2007        WD         Normal     175000
1456  1457          20       RL         85.0    13175   Pave  None      Reg         Lvl    AllPub  ...        0   None  MnPrv        None       0      2   2010        WD         Normal     210000
1457  1458          70       RL         66.0     9042   Pave  None      Reg         Lvl    AllPub  ...        0   None  GdPrv        Shed    2500      5   2010        WD         Normal     266500
1458  1459          20       RL         68.0     9717   Pave  None      Reg         Lvl    AllPub  ...        0   None   None        None       0      4   2010        WD         Normal     142125
1459  1460          20       RL         75.0     9937   Pave  None      Reg         Lvl    AllPub  ...        0   None   None        None       0      6   2008        WD         Normal     147500

[1460 rows x 81 columns], 'categories': None, 'feature_names': ['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'], 'target_names': ['SalePrice'], 'DESCR': "Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.\n\nWith 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.\n\nMSSubClass: Identifies the type of dwelling involved in the sale.\t\n\n        20\t1-STORY 1946 & NEWER ALL STYLES\n        30\t1-STORY 1945 & OLDER\n        40\t1-STORY W/FINISHED ATTIC ALL AGES\n        45\t1-1/2 STORY - UNFINISHED ALL AGES\n        50\t1-1/2 STORY FINISHED ALL AGES\n        60\t2-STORY 1946 & NEWER\n        70\t2-STORY 1945 & OLDER\n        75\t2-1/2 STORY ALL AGES\n        80\tSPLIT OR MULTI-LEVEL\n        85\tSPLIT FOYER\n        90\tDUPLEX - ALL STYLES AND AGES\n       120\t1-STORY PUD (Planned Unit Development) - 1946 & NEWER\n       150\t1-1/2 STORY PUD - ALL AGES\n       160\t2-STORY PUD - 1946 & NEWER\n       180\tPUD - MULTILEVEL - INCL SPLIT LEV/FOYER\n       190\t2 FAMILY CONVERSION - ALL STYLES AND AGES\n\nMSZoning: Identifies the general zoning classification of the sale.\n\t\t\n       A\tAgriculture\n       C\tCommercial\n       FV\tFloating Village Residential\n       I\tIndustrial\n       RH\tResidential High Density\n       RL\tResidential Low Density\n       RP\tResidential Low Density Park \n       RM\tResidential Medium Density\n\t\nLotFrontage: Linear feet of street connected to property\n\nLotArea: Lot size in square feet\n\nStreet: Type of road access to property\n\n       Grvl\tGravel\t\n       Pave\tPaved\n       \t\nAlley: Type of alley access to property\n\n       Grvl\tGravel\n       Pave\tPaved\n       NA \tNo alley access\n\t\t\nLotShape: General shape of property\n\n       Reg\tRegular\t\n       IR1\tSlightly irregular\n       IR2\tModerately Irregular\n       IR3\tIrregular\n       \nLandContour: Flatness of the property\n\n       Lvl\tNear Flat/Level\t\n       Bnk\tBanked - Quick and significant rise from street grade to building\n       HLS\tHillside - Significant slope from side to side\n       Low\tDepression\n\t\t\nUtilities: Type of utilities available\n\t\t\n       AllPub\tAll public Utilities (E,G,W,& S)\t\n       NoSewr\tElectricity, Gas, and Water (Septic Tank)\n       NoSeWa\tElectricity and Gas Only\n       ELO\tElectricity only\t\n\t\nLotConfig: Lot configuration\n\n       Inside\tInside lot\n       Corner\tCorner lot\n       CulDSac\tCul-de-sac\n       FR2\tFrontage on 2 sides of property\n       FR3\tFrontage on 3 sides of property\n\t\nLandSlope: Slope of property\n\t\t\n       Gtl\tGentle slope\n       Mod\tModerate Slope\t\n       Sev\tSevere Slope\n\t\nNeighborhood: Physical locations within Ames city limits\n\n       Blmngtn\tBloomington Heights\n       Blueste\tBluestem\n       BrDale\tBriardale\n       BrkSide\tBrookside\n       ClearCr\tClear Creek\n       CollgCr\tCollege Creek\n       Crawfor\tCrawford\n       Edwards\tEdwards\n       Gilbert\tGilbert\n       IDOTRR\tIowa DOT and Rail Road\n       MeadowV\tMeadow Village\n       Mitchel\tMitchell\n       Names\tNorth Ames\n       NoRidge\tNorthridge\n       NPkVill\tNorthpark Villa\n       NridgHt\tNorthridge Heights\n       NWAmes\tNorthwest Ames\n       OldTown\tOld Town\n       SWISU\tSouth & West of Iowa State University\n       Sawyer\tSawyer\n       SawyerW\tSawyer West\n       Somerst\tSomerset\n       StoneBr\tStone Brook\n       Timber\tTimberland\n       Veenker\tVeenker\n\t\t\t\nCondition1: Proximity to various conditions\n\t\n       Artery\tAdjacent to arterial street\n       Feedr\tAdjacent to feeder street\t\n       Norm\tNormal\t\n       RRNn\tWithin 200' of North-South Railroad\n       RRAn\tAdjacent to North-South Railroad\n       PosN\tNear positive off-site feature--park, greenbelt, etc.\n       PosA\tAdjacent to postive off-site feature\n       RRNe\tWithin 200' of East-West Railroad\n       RRAe\tAdjacent to East-West Railroad\n\t\nCondition2: Proximity to various conditions (if more than one is present)\n\t\t\n       Artery\tAdjacent to arterial street\n       Feedr\tAdjacent to feeder street\t\n       Norm\tNormal\t\n       RRNn\tWithin 200' of North-South Railroad\n       RRAn\tAdjacent to North-South Railroad\n       PosN\tNear positive off-site feature--park, greenbelt, etc.\n       PosA\tAdjacent to postive off-site feature\n       RRNe\tWithin 200' of East-West Railroad\n       RRAe\tAdjacent to East-West Railroad\n\t\nBldgType: Type of dwelling\n\t\t\n       1Fam\tSingle-family Detached\t\n       2FmCon\tTwo-family Conversion; originally built as one-family dwelling\n       Duplx\tDuplex\n       TwnhsE\tTownhouse End Unit\n       TwnhsI\tTownhouse Inside Unit\n\t\nHouseStyle: Style of dwelling\n\t\n       1Story\tOne story\n       1.5Fin\tOne and one-half story: 2nd level finished\n       1.5Unf\tOne and one-half story: 2nd level unfinished\n       2Story\tTwo story\n       2.5Fin\tTwo and one-half story: 2nd level finished\n       2.5Unf\tTwo and one-half story: 2nd level unfinished\n       SFoyer\tSplit Foyer\n       SLvl\tSplit Level\n\t\nOverallQual: Rates the overall material and finish of the house\n\n       10\tVery Excellent\n       9\tExcellent\n       8\tVery Good\n       7\tGood\n       6\tAbove Average\n       5\tAverage\n       4\tBelow Average\n       3\tFair\n       2\tPoor\n       1\tVery Poor\n\t\nOverallCond: Rates the overall condition of the house\n\n       10\tVery Excellent\n       9\tExcellent\n       8\tVery Good\n       7\tGood\n       6\tAbove Average\t\n       5\tAverage\n       4\tBelow Average\t\n       3\tFair\n       2\tPoor\n       1\tVery Poor\n\t\t\nYearBuilt: Original construction date\n\nYearRemodAdd: Remodel date (same as construction date if no remodeling or additions)\n\nRoofStyle: Type of roof\n\n       Flat\tFlat\n       Gable\tGable\n       Gambrel\tGabrel (Barn)\n       Hip\tHip\n       Mansard\tMansard\n       Shed\tShed\n\t\t\nRoofMatl: Roof material\n\n       ClyTile\tClay or Tile\n       CompShg\tStandard (Composite) Shingle\n       Membran\tMembrane\n       Metal\tMetal\n       Roll\tRoll\n       Tar&Grv\tGravel & Tar\n       WdShake\tWood Shakes\n       WdShngl\tWood Shingles\n\t\t\nExterior1st: Exterior covering on house\n\n       AsbShng\tAsbestos Shingles\n       AsphShn\tAsphalt Shingles\n       BrkComm\tBrick Common\n       BrkFace\tBrick Face\n       CBlock\tCinder Block\n       CemntBd\tCement Board\n       HdBoard\tHard Board\n       ImStucc\tImitation Stucco\n       MetalSd\tMetal Siding\n       Other\tOther\n       Plywood\tPlywood\n       PreCast\tPreCast\t\n       Stone\tStone\n       Stucco\tStucco\n       VinylSd\tVinyl Siding\n       Wd Sdng\tWood Siding\n       WdShing\tWood Shingles\n\t\nExterior2nd: Exterior covering on house (if more than one material)\n\n       AsbShng\tAsbestos Shingles\n       AsphShn\tAsphalt Shingles\n       BrkComm\tBrick Common\n       BrkFace\tBrick Face\n       CBlock\tCinder Block\n       CemntBd\tCement Board\n       HdBoard\tHard Board\n       ImStucc\tImitation Stucco\n       MetalSd\tMetal Siding\n       Other\tOther\n       Plywood\tPlywood\n       PreCast\tPreCast\n       Stone\tStone\n       Stucco\tStucco\n       VinylSd\tVinyl Siding\n       Wd Sdng\tWood Siding\n       WdShing\tWood Shingles\n\t\nMasVnrType: Masonry veneer type\n\n       BrkCmn\tBrick Common\n       BrkFace\tBrick Face\n       CBlock\tCinder Block\n       None\tNone\n       Stone\tStone\n\t\nMasVnrArea: Masonry veneer area in square feet\n\nExterQual: Evaluates the quality of the material on the exterior \n\t\t\n       Ex\tExcellent\n       Gd\tGood\n       TA\tAverage/Typical\n       Fa\tFair\n       Po\tPoor\n\t\t\nExterCond: Evaluates the present condition of the material on the exterior\n\t\t\n       Ex\tExcellent\n       Gd\tGood\n       TA\tAverage/Typical\n       Fa\tFair\n       Po\tPoor\n\t\t\nFoundation: Type of foundation\n\t\t\n       BrkTil\tBrick & Tile\n       CBlock\tCinder Block\n       PConc\tPoured Contrete\t\n       Slab\tSlab\n       Stone\tStone\n       Wood\tWood\n\t\t\nBsmtQual: Evaluates the height of the basement\n\n       Ex\tExcellent (100+ inches)\t\n       Gd\tGood (90-99 inches)\n       TA\tTypical (80-89 inches)\n       Fa\tFair (70-79 inches)\n       Po\tPoor (<70 inches\n       NA\tNo Basement\n\t\t\nBsmtCond: Evaluates the general condition of the basement\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tTypical - slight dampness allowed\n       Fa\tFair - dampness or some cracking or settling\n       Po\tPoor - Severe cracking, settling, or wetness\n       NA\tNo Basement\n\t\nBsmtExposure: Refers to walkout or garden level walls\n\n       Gd\tGood Exposure\n       Av\tAverage Exposure (split levels or foyers typically score average or above)\t\n       Mn\tMimimum Exposure\n       No\tNo Exposure\n       NA\tNo Basement\n\t\nBsmtFinType1: Rating of basement finished area\n\n       GLQ\tGood Living Quarters\n       ALQ\tAverage Living Quarters\n       BLQ\tBelow Average Living Quarters\t\n       Rec\tAverage Rec Room\n       LwQ\tLow Quality\n       Unf\tUnfinshed\n       NA\tNo Basement\n\t\t\nBsmtFinSF1: Type 1 finished square feet\n\nBsmtFinType2: Rating of basement finished area (if multiple types)\n\n       GLQ\tGood Living Quarters\n       ALQ\tAverage Living Quarters\n       BLQ\tBelow Average Living Quarters\t\n       Rec\tAverage Rec Room\n       LwQ\tLow Quality\n       Unf\tUnfinshed\n       NA\tNo Basement\n\nBsmtFinSF2: Type 2 finished square feet\n\nBsmtUnfSF: Unfinished square feet of basement area\n\nTotalBsmtSF: Total square feet of basement area\n\nHeating: Type of heating\n\t\t\n       Floor\tFloor Furnace\n       GasA\tGas forced warm air furnace\n       GasW\tGas hot water or steam heat\n       Grav\tGravity furnace\t\n       OthW\tHot water or steam heat other than gas\n       Wall\tWall furnace\n\t\t\nHeatingQC: Heating quality and condition\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tAverage/Typical\n       Fa\tFair\n       Po\tPoor\n\t\t\nCentralAir: Central air conditioning\n\n       N\tNo\n       Y\tYes\n\t\t\nElectrical: Electrical system\n\n       SBrkr\tStandard Circuit Breakers & Romex\n       FuseA\tFuse Box over 60 AMP and all Romex wiring (Average)\t\n       FuseF\t60 AMP Fuse Box and mostly Romex wiring (Fair)\n       FuseP\t60 AMP Fuse Box and mostly knob & tube wiring (poor)\n       Mix\tMixed\n\t\t\n1stFlrSF: First Floor square feet\n \n2ndFlrSF: Second floor square feet\n\nLowQualFinSF: Low quality finished square feet (all floors)\n\nGrLivArea: Above grade (ground) living area square feet\n\nBsmtFullBath: Basement full bathrooms\n\nBsmtHalfBath: Basement half bathrooms\n\nFullBath: Full bathrooms above grade\n\nHalfBath: Half baths above grade\n\nBedroom: Bedrooms above grade (does NOT include basement bedrooms)\n\nKitchen: Kitchens above grade\n\nKitchenQual: Kitchen quality\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tTypical/Average\n       Fa\tFair\n       Po\tPoor\n       \t\nTotRmsAbvGrd: Total rooms above grade (does not include bathrooms)\n\nFunctional: Home functionality (Assume typical unless deductions are warranted)\n\n       Typ\tTypical Functionality\n       Min1\tMinor Deductions 1\n       Min2\tMinor Deductions 2\n       Mod\tModerate Deductions\n       Maj1\tMajor Deductions 1\n       Maj2\tMajor Deductions 2\n       Sev\tSeverely Damaged\n       Sal\tSalvage only\n\t\t\nFireplaces: Number of fireplaces\n\nFireplaceQu: Fireplace quality\n\n       Ex\tExcellent - Exceptional Masonry Fireplace\n       Gd\tGood - Masonry Fireplace in main level\n       TA\tAverage - Prefabricated Fireplace in main living area or Masonry Fireplace in basement\n       Fa\tFair - Prefabricated Fireplace in basement\n       Po\tPoor - Ben Franklin Stove\n       NA\tNo Fireplace\n\t\t\nGarageType: Garage location\n\t\t\n       2Types\tMore than one type of garage\n       Attchd\tAttached to home\n       Basment\tBasement Garage\n       BuiltIn\tBuilt-In (Garage part of house - typically has room above garage)\n       CarPort\tCar Port\n       Detchd\tDetached from home\n       NA\tNo Garage\n\t\t\nGarageYrBlt: Year garage was built\n\t\t\nGarageFinish: Interior finish of the garage\n\n       Fin\tFinished\n       RFn\tRough Finished\t\n       Unf\tUnfinished\n       NA\tNo Garage\n\t\t\nGarageCars: Size of garage in car capacity\n\nGarageArea: Size of garage in square feet\n\nGarageQual: Garage quality\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tTypical/Average\n       Fa\tFair\n       Po\tPoor\n       NA\tNo Garage\n\t\t\nGarageCond: Garage condition\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tTypical/Average\n       Fa\tFair\n       Po\tPoor\n       NA\tNo Garage\n\t\t\nPavedDrive: Paved driveway\n\n       Y\tPaved \n       P\tPartial Pavement\n       N\tDirt/Gravel\n\t\t\nWoodDeckSF: Wood deck area in square feet\n\nOpenPorchSF: Open porch area in square feet\n\nEnclosedPorch: Enclosed porch area in square feet\n\n3SsnPorch: Three season porch area in square feet\n\nScreenPorch: Screen porch area in square feet\n\nPoolArea: Pool area in square feet\n\nPoolQC: Pool quality\n\t\t\n       Ex\tExcellent\n       Gd\tGood\n       TA\tAverage/Typical\n       Fa\tFair\n       NA\tNo Pool\n\t\t\nFence: Fence quality\n\t\t\n       GdPrv\tGood Privacy\n       MnPrv\tMinimum Privacy\n       GdWo\tGood Wood\n       MnWw\tMinimum Wood/Wire\n       NA\tNo Fence\n\t\nMiscFeature: Miscellaneous feature not covered in other categories\n\t\t\n       Elev\tElevator\n       Gar2\t2nd Garage (if not described in garage section)\n       Othr\tOther\n       Shed\tShed (over 100 SF)\n       TenC\tTennis Court\n       NA\tNone\n\t\t\nMiscVal: $Value of miscellaneous feature\n\nMoSold: Month Sold (MM)\n\nYrSold: Year Sold (YYYY)\n\nSaleType: Type of sale\n\t\t\n       WD \tWarranty Deed - Conventional\n       CWD\tWarranty Deed - Cash\n       VWD\tWarranty Deed - VA Loan\n       New\tHome just constructed and sold\n       COD\tCourt Officer Deed/Estate\n       Con\tContract 15% Down payment regular terms\n       ConLw\tContract Low Down payment and low interest\n       ConLI\tContract Low Interest\n       ConLD\tContract Low Down\n       Oth\tOther\n\t\t\nSaleCondition: Condition of sale\n\n       Normal\tNormal Sale\n       Abnorml\tAbnormal Sale -  trade, foreclosure, short sale\n       AdjLand\tAdjoining Land Purchase\n       Alloca\tAllocation - two linked properties with separate deeds, typically condo with a garage unit\t\n       Family\tSale between family members\n       Partial\tHome was not completed when last assessed (associated with New Homes)\n\nDownloaded from openml.org.", 'details': {'id': '42165', 'name': 'house_prices', 'version': '1', 'description_version': '1', 'format': 'arff', 'creator': 'OkCupid', 'collection_date': 'NA', 'upload_date': '2019-10-04T12:57:32', 'language': 'English', 'licence': 'NA', 'url': 'https://api.openml.org/data/v1/download/21754539/house_prices.arff', 'parquet_url': 'http://openml1.win.tue.nl/dataset42165/dataset_42165.pq', 'file_id': '21754539', 'default_target_attribute': 'SalePrice', 'version_label': '0.1', 'visibility': 'public', 'original_data_url': 'https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview', 'paper_url': 'https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview', 'minio_url': 'http://openml1.win.tue.nl/dataset42165/dataset_42165.pq', 'status': 'active', 'processing_date': '2019-10-04 12:58:02', 'md5_checksum': 'd5ca59f8d02b1b1c127034392c0f995f'}, 'url': 'https://www.openml.org/d/42165'}

上一篇
[DAY27] 機器學習 - 決策樹(三)
下一篇
[DAY29] 機器學習 - 自然語言NLP(二)
系列文
關於我從基礎程設轉職到人工智慧入門30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言