Day7-Feature Engineering -- 2. Categorical Encoding(6)

第 12 屆 iThome 鐵人賽

DAY 7

AI & Data

Machine Learning系列第 7 篇

12th鐵人賽

tjabi

2020-09-07 23:11:24

2748 瀏覽

分享至

2.18 CatBoost encoding

將使用這個data-frame，有兩個獨立變數或特徵(features)和一個標籤(label or Target)，共有十筆資料。
Rec-No | Temperature | Color | Target |
------------- | -------------------------- | -------------
0 | Hot | Red | 1
1 | Cold | Yellow | 1
2 | Very Hot | Blue | 1
3 | Warm | Blue | 0
4 | Hot | Red | 1
5 | Warm | Yellow | 0
6 | Warm | Red | 1
7 | Hot | Yellow | 0
8 | Hot | Yellow | 1
9 | Cold | Yellow | 1

2.18 CatBoost encoding

CatBoost encoding 是俄文最大搜尋引擎公司Yandex提出的方方法，CatBoost encoding 內建在它的機器學習模型 CatBoost 模型裡。

這個方法是一個 target-based encoder，類似 target encoding。它是為了要克服 Leave One Out 標籤(target)洩漏訊息的問題。為了解決這個問題 CatBoost encoding 介紹了時間(Time)的觀念：觀測資料在資料集及出現的順序。當我們要計算觀測資料(或者是某一列資料)的對應標籤平均值時，我們只計算在這列資料前的資料的對應標籤平均值；因此，要執行 CatBoost encoding 的資料前，資料集需先進行隨機重新排列。

我們可以使用 category_encoders 或 CatBoost Library 來進行 CatBoost encoding。若使用category_encoders，我們需要先自行寫段程式對資料進行隨機重新排列；而使用 CatBoost Library則無須這道手續，因為為了防止 overfitting，CatBoost Library 在執行過程中，會自動多次對訓練資料集(train dataset)進行資料隨機重新排列，再計算的對應標籤平均值。

使用 category_encoders：

import category_encoders as ce

X = df.drop(['Target'], axis=1)
#X = df['Temperature']
y = df['Target']

ce_CBE = ce.CatBoostEncoder(cols=['Temperature'])
dfC = ce_CBE.fit_transform(X, y)
dfC
ce_CBE = ce.CatBoostEncoder(cols=['Temperature'])
ce_CBE.fit(X, y)

X_CBE = X.join(ce_CBE.transform(X, y).add_suffix('_cb'))
_CBE['Target'] = y
X_CBE

/ | Temperature | Color | Temperature_cb | Color_cb | Target
------------- | -------------------------- | -------------
0 | Hot | Red | 0.700000 | Red | 1
1 | Cold | Yellow | 0.700000 | Yellow | 1
2 | Very Hot | Blue | 0.700000 | Blue |1
3 | Warm | Blue | 0.700000 | Blue |0
4 | Hot | Red | 0.850000 | Red |1
5 | Warm | Yellow |0.350000| Yellow | 0
6 | Warm | Red | 0.233333| Red|1
7 | Hot | Yellow |0.900000| Yellow| 0
8 | Hot | Yellow |0.675000| Yellow| 1
9 | Cold | Yellow |0.850000| Yellow| 1

使用 CatBoost Library：

在這個例子中，我們建立一個類別變數的引數(argument): cat_features = [0, 1], 告訴 CatBoost "0, 1" 這三個變數(欄位)是類別欄位，並將它傳給 model.fit 函數，如此 CatBoost 就會自動將這幾個欄位轉換成數字欄位。若沒有這麼做，CatBoost 會將所有欄位視為數值型欄位。

!pip3 install catboost
from catboost import CatBoostRegressor

cat_features = [0, 1]
X = df.drop(['Target'], axis=1)
y = df['Target']

model = CatBoostRegressor(iterations=2, learning_rate=1, depth=2)
model.fit(X, y, cat_features)

上述的 Categorical Encoding 可依性質分成三類：

Classic Encoders
- One hot encoding
- Count and Frequency encoding
- Binary encoding & BaseN encoding
- Label encoding
- Ordinal encoding
- Feature hashing
- Sum Encoder(Deviation Encoding of Effect Encoding
Constrast Encoders
- Helmert encoding
- Backward Difference
- Polynomial
Bayersian Encoders
- Target encoding / Mean encoding
- Weight of Evidence
- Rare label encoding
- Leave One Out
- James_Stein
- M-estimator
- CatBoost Encoding