2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding
將使用這個data-frame,有兩個獨立變數或特徵(features)和一個標籤(label or Target),共有十筆資料。
import pandas as pd
import numpy as np
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data, columns = ['Temperature', 'Color', 'Target'])
Rec-No | Temperature | Color | Target |
---|---|---|---|
0 | Hot | Red | 1 |
1 | Cold | Yellow | 1 |
2 | Very Hot | Blue | 1 |
3 | Warm | Blue | 0 |
4 | Hot | Red | 1 |
5 | Warm | Yellow | 0 |
6 | Warm | Red | 1 |
7 | Hot | Yellow | 0 |
8 | Hot | Yellow | 1 |
9 | Cold | Yellow | 1 |
補充 2.5 Weight of Evidence 加上程式範例
首先計算 Temperature 的每個類別中,屬於 Target(1) 及 Target(0) 的百分比。例如 Hot 類別中 Target是1, 有三筆;Target是0,有一筆。所以在Hot類別中,target=1的百分比是0.75,target=0的百分比是0.25。
# target = 1 i.e. Good = 1
woe_df = df.groupby('Temperature')['Target'].mean()
woe_df = pd.DataFrame(woe_df)
# remove the column name 'Target' to 'Good'
woe_df = woe_df.rename(columns={'Target':'Good'})
# Calculate Bad probability : 1 - Good probability
woe_df['Bad'] = 1-woe_df.Good
# add a small vlaue to avoid divide by zero in denominator
# 加入一數值 避免被除數=0
woe_df['Bad'] = np.where(woe_df['Bad']==0, 0.000001, woe_df['Bad'])
# Compute the WoE
woe_df['WoE'] = np.log(woe_df.Good/woe_df.Bad)
woe_df
/ | Good | Bad | WoE |
---|---|---|---|
Temperature | |||
Cold | 1.000000 | 0.000001 | 13.815511 |
Hot | 0.750000 | 0.250000 | 1.098612 |
Very Hot | 1.000000 | 0.000001 | 13.815511 |
Warm | 0.333333 | 0.666667 | -0.693147 |
計算出WOE,我們將WOE值加入原來資料中
# Map the WOE value back to each row of data-frame
# 將 WOE 加入資料集的每一筆資料
df.loc[:, 'WoE_Encode'] = df['Temperature'].map(woe_df['WoE'])
df
/ | Temperature | Color | Target | WoE_Encode |
---|---|---|---|---|
0 | Hot | Red | 1 | 1.098612 |
1 | Cold | Yellow | 1 | 13.815511 |
2 | Very Hot | Blue | 1 | 13.815511 |
3 | Warm | Blue | 0 | -0.693147 |
4 | Hot | Red | 1 | 1.098612 |
5 | Warm | Yellow | 0 | -0.693147 |
6 | Warm | Red | 1 | -0.693147 |
7 | Hot | Yellow | 0 | 1.098612 |
8 | Hot | Yellow | 1 | 1.098612 |
9 | Cold | Yellow | 1 | 13.815511 |
Probability Ratio Encoding 類似 Weight of Evidence(WoE),唯一的不同是這個方法使用比例(Ratio)而不是自然對數(Natural Log)。
# target = 1 i.e. Good = 1
pr_df = df.groupby('Temperature')['Target'].mean()
pr_df = pd.DataFrame(pr_df)
# remove the column name 'Target' to 'Good'
pr_df = pr_df.rename(columns={'Target':'Good'})
# Calculate Bad probability : 1 - Good probability
pr_df['Bad'] = 1-pr_df.Good
# add a small vlaue to avoid divide by zero in denominator
# 加入一數值 避免被除數=0
pr_df['Bad'] = np.where(pr_df['Bad']==0, 0.000001, pr_df['Bad'])
# Compute the Probability Ratio
pr_df['PR'] = pr_df.Good/pr_df.Bad
pr_df
/ | Good | Bad | WoE |
---|---|---|---|
Temperature | |||
Cold | 1.000000 | 0.000001 | 1.000000 |
Hot | 0.750000 | 0.250000 | 3.0 |
Very Hot | 1.000000 | 0.000001 | 1.000000 |
Warm | 0.333333 | 0.666667 | 0.5 |
計算出Probability Ratio value,我們將值加入原來資料中
# Map the Probability Ratio value back to each row of data-frame
# 將 Probability Ratio value 加入資料集的每一筆資料
df.loc[:, 'PR_Encode'] = df['Temperature'].map(pr_df['PR'])
df
/ | Temperature | Color | Target | WoE_Encode |
---|---|---|---|---|
0 | Hot | Red | 1 | 3.0 |
1 | Cold | Yellow | 1 | 1.000000 |
2 | Very Hot | Blue | 1 | 1.000000 |
3 | Warm | Blue | 0 | 0.5 |
4 | Hot | Red | 1 | 3.0 |
5 | Warm | Yellow | 0 | 0.5 |
6 | Warm | Red | 1 | 0.5 |
7 | Hot | Yellow | 0 | 3.0 |
8 | Hot | Yellow | 1 | 3.0 |
9 | Cold | Yellow | 1 | 1.000000 |
這個方法給每個類別一個1到N數字,N個是類別的總數。這個方法有一個缺點是,即使類別之間沒有順序等關係,這個方法仍會認為類別間有順序或其他關係存在。例如下面例子看起來似乎有(Cold < Hot < Very Hot < Warm...0 < 1< 2 < 3)關係存在。
使用Scikit-learn
from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df
/ | Temperature | Color | Target | Temp_label_encoded |
---|---|---|---|---|
0 | Hot | Red | 1 | 1 |
1 | Cold | Yellow | 1 | 0 |
2 | Very Hot | Blue | 1 | 2 |
3 | Warm | Blue | 0 | 3 |
4 | Hot | Red | 1 | 1 |
5 | Warm | Yellow | 0 | 3 |
6 | Warm | Red | 1 | 3 |
7 | Hot | Yellow | 0 | 1 |
8 | Hot | Yellow | 1 | 1 |
9 | Cold | Yellow | 1 | 0 |
也可使用Pandas的 **factorize ** |
df.loc[:, 'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df
/ | Temperature | Color | Target | Temp_factorize_encoded |
---|---|---|---|---|
0 | Hot | Red | 1 | 0 |
1 | Cold | Yellow | 1 | 1 |
2 | Very Hot | Blue | 1 | 2 |
3 | Warm | Blue | 0 | 3 |
4 | Hot | Red | 1 | 0 |
5 | Warm | Yellow | 0 | 3 |
6 | Warm | Red | 1 | 3 |
7 | Hot | Yellow | 0 | 0 |
8 | Hot | Yellow | 1 | 0 |
9 | Cold | Yellow | 1 | 1 |