2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding
2.12 Sum Encoder(Deviation Encoding of Effect Encoding)
2.13 Backward Difference
2.14 Polynomial
2.15 Leave One Out
2.16 James_Stein
2.17 M-estimator
2.18 CatBoost encoding
將使用這個data-frame,有兩個獨立變數或特徵(features)和一個標籤(label or Target),共有十筆資料。
Rec-No | Temperature | Color | Target |
------------- | -------------------------- | -------------
0 | Hot | Red | 1
1 | Cold | Yellow | 1
2 | Very Hot | Blue | 1
3 | Warm | Blue | 0
4 | Hot | Red | 1
5 | Warm | Yellow | 0
6 | Warm | Red | 1
7 | Hot | Yellow | 0
8 | Hot | Yellow | 1
9 | Cold | Yellow | 1
比較一個變數中某一類別的平均值和全部類別的平均值。
import category_encoders as ce
Sum_encoder = ce.SumEncoder(cols=['Temperature'])
df_se = Sum_encoder.fit_transform(df['Temperature'])
df_se.columns = ['se_'+str(i) for i in df_se.columns]
df = pd.concat([df, df_se], axis=1)
df
/ | Temperature | Color | Target |se_intercept|se_Temperature_0| se_Temperature_1| se_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | 1.0 | 0.0 | 0.0
1 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0
2 | Very Hot | Blue | 1|1 | 0.0 | 0.0 | 1.0
3 | Warm | Blue | 0|1 | -1.0 | -1.0 | -1.0
4 | Hot | Red | 1|1 | 1.0 | 0.0 | 0.0
5 | Warm | Yellow | 0|1 | -1.0 | -1.0 | -1.0
6 | Warm | Red | 1|1 | -1.0 | -1.0 | -1.0
7 | Hot | Yellow | 0|1 | 1.0 | 0.0 | 0.0
8 | Hot | Yellow | 1|1 | 1.0 | 0.0 | 0.0
9 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0
Backward Difference Encoding 是一變數中,某一類別的平均值和其之前類別的平均值。這個方法對於名目變數(nominal)或順序變數(ordinal)較有效益。
ce_backward = ce.BackwardDifferenceEncoder(cols=['Temperature'])
df_ce = ce_backward.fit_transform(df['Temperature'])
df_ce.columns = ['bk_'+str(i) for i in df_ce.columns]
df = pd.concat([df, df_ce], axis=1)
df
/ | Temperature | Color | Target |bk_intercept|bk_Temperature_0| bk_Temperature_1| bk_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.75 | -0.5 | -0.25
1 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25
2 | Very Hot | Blue | 1|1 | 0.25 | 0.5 | -0.25
q3 | Warm | Blue | 0|1 | 0.25 | 0.5 | 0.75
4 | Hot | Red | 1|1 | -0.75 | -0.5 | -0.25
5 | Warm | Yellow | 0|1 | 0.25 | 0.5 | 0.75
6 | Warm | Red | 1|1 | 0.25 | 0.5 | 0.75
7 | Hot | Yellow | 0|1 | -0.75 | -0.5 | -0.25
8 | Hot | Yellow | 1|1 | -0.75 | -0.5 | -0.25
9 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25
Polynomial coding 是一個較少用的方法,但他卻是一個最能反映變數資訊的方法。polynomial coding的目的是要辨識 dependent 和 independent variables之間的線性和非線性關係的傾向。尋找類別變數的 linear, quadratic and cubic 傾向。
ce_poly = ce.PolynomialEncoder(cols=['Temperature'])
dfp = ce_poly.fit_transform(df['Temperature'])
dfp.columns = ['poly_'+str(i) for i in dfp.columns]
df = pd.concat([df, dfp], axis=1)
df
/ | Temperature | Color | Target | poly_intercept | poly_Temperature_0 | poly_Temperature_1 | poly_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.670820 | 0.5 |-0.223607
1 | Cold | Yellow | 1|1 |-0.223607 | -0.5 | 0.670820
2 | Very Hot | Blue | 1|1 |0.223607 | -0.5 | -0.670820
3 | Warm | Blue | 0|1 | 0.670820| 0.5| 0.223607
4 | Hot | Red | 1|1 |-0.670820 | 0.5 | -0.223607
5 | Warm | Yellow | 0|1 |0.670820| 0.5| 0.223607
6 | Warm | Red | 1|1 | 0.670820| 0.5| 0.223607
7 | Hot | Yellow | 0|1 |-0.670820 | 0.5 | -0.223607
8 | Hot | Yellow | 1|1 |-0.670820 | 0.5 |-0.223607
9 | Cold | Yellow | 1|1 | -0.223607 | -0.5 | 0.670820
類似 Target Encoding。但我們會排除目前資料列的標籤(target),當我們在計算每一類別對應的標籤(target)的平均值時。這這樣可以降低outliers效應
X = df.drop(['Target'], axis=1)
y = df['Target']
ce_leave = ce.LeaveOneOutEncoder(cols=['Temperature'])
dfl = ce_leave.fit_transform(X, y)
dfl
/ | Temperature | Color |
---|---|---|
0 | 0.666667 | Red |
1 | 1.000000 | Yellow |
2 | 0.700000 | Blue |
3 | 0.500000 | Blue |
4 | 0.666667 | Red |
5 | 0.500000 | Yellow |
6 | 0.000000 | Red |
7 | 1.000000 | Yellow |
8 | 0.666667 | Yellow |
9 | 1.000000 | Yellow |
James_Stein Encoding 是一個 target-based encoder。類似 target encoding,但它產生的值會趨向類別變數對應標籤的群體平均數,所以它是個別類別對應標籤的平均值和群體平均數的加權總數。
James_Stein estimator 有一個實際上的限制:它是為平均分配(normal distributions)設計的,所以不適合分類(classification)機器學習模型,要克服這個問題,我們可以將二進位元的標籤(Target)轉換為 log-odds ratio 或者使用 beta 分配(beta distribution)。
ce_James = ce.JamesSteinEncoder(cols=['Temperature'])
dfj = ce_James.fit_transform(X, y)
dfj
/ | Temperature | Color |
---|---|---|
0 | 0.741379 | Red |
1 | 1.000000 | Yellow |
2 | 1.000000 | Blue |
3 | 0.405229 | Blue |
4 | 0.741379 | Red |
5 | 0.405229 | Yellow |
6 | 0.405229 | Red |
7 | 0.741379 | Yellow |
8 | 0.741379 | Yellow |
9 | 1.000000 | Yellow |
M-Estimate Encoder 是一個簡單版的 Target Encoder,類似 James Stein encoder,但使用一個有額外的參數(m)的群體平均數來調整每一個類別對應標籤的平均值,這個參數的預設值是 1。
ce_M_estimator = ce.MEstimateEncoder(cols=['Temperature'])
dfM = ce_M_estimator.fit_transform(X, y)
dfM
/ | Temperature | Color |
---|---|---|
0 | 0.740 | Red |
1 | 0.900 | Yellow |
2 | 0.850 | Blue |
3 | 0.425 | Blue |
4 | 0.740 | Red |
5 | 0.425 | Yellow |
6 | 0.425 | Red |
7 | 0.740 | Yellow |
8 | 0.740 | Yellow |
9 | 0.900 | Yellow |