Day-4 Feature Engineering -- 2. Categorical Encoding(3)

第 12 屆 iThome 鐵人賽

DAY 4

AI & Data

Machine Learning系列第 4 篇

12th鐵人賽

tjabi

2020-09-04 22:37:15

1787 瀏覽

分享至

2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding

將使用這個data-frame，有兩個獨立變數或特徵(features)和一個標籤(label or Target)，共有十筆資料。

import pandas as pd
import numpy as np
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target':[1,1,1,0,1,0,1,0,1,1]}

df = pd.DataFrame(data, columns = ['Temperature', 'Color', 'Target'])

Rec-No	Temperature	Color	Target
0	Hot	Red	1
1	Cold	Yellow	1
2	Very Hot	Blue	1
3	Warm	Blue	0
4	Hot	Red	1
5	Warm	Yellow	0
6	Warm	Red	1
7	Hot	Yellow	0
8	Hot	Yellow	1
9	Cold	Yellow	1

補充 2.5 Weight of Evidence 加上程式範例
首先計算 Temperature 的每個類別中，屬於 Target(1) 及 Target(0) 的百分比。例如 Hot 類別中 Target是1，有三筆；Target是0，有一筆。所以在Hot類別中，target=1的百分比是0.75，target=0的百分比是0.25。

# target = 1 i.e. Good = 1
woe_df = df.groupby('Temperature')['Target'].mean()
woe_df = pd.DataFrame(woe_df)
# remove the column name 'Target' to 'Good'
woe_df = woe_df.rename(columns={'Target':'Good'})
# Calculate Bad probability :  1 - Good probability
woe_df['Bad'] = 1-woe_df.Good
# add a small vlaue to avoid divide by zero in denominator 
# 加入一數值 避免被除數=0
woe_df['Bad'] = np.where(woe_df['Bad']==0, 0.000001, woe_df['Bad'])
# Compute the WoE
woe_df['WoE'] = np.log(woe_df.Good/woe_df.Bad)
woe_df

/	Good	Bad	WoE
Temperature
Cold	1.000000	0.000001	13.815511
Hot	0.750000	0.250000	1.098612
Very Hot	1.000000	0.000001	13.815511
Warm	0.333333	0.666667	-0.693147

計算出WOE，我們將WOE值加入原來資料中

# Map the WOE value back to each row of data-frame
# 將 WOE 加入資料集的每一筆資料
df.loc[:, 'WoE_Encode'] = df['Temperature'].map(woe_df['WoE'])
df

/	Temperature	Color	Target	WoE_Encode
0	Hot	Red	1	1.098612
1	Cold	Yellow	1	13.815511
2	Very Hot	Blue	1	13.815511
3	Warm	Blue	0	-0.693147
4	Hot	Red	1	1.098612
5	Warm	Yellow	0	-0.693147
6	Warm	Red	1	-0.693147
7	Hot	Yellow	0	1.098612
8	Hot	Yellow	1	1.098612
9	Cold	Yellow	1	13.815511

2.8 Probability Ratio Encoding

Probability Ratio Encoding 類似 Weight of Evidence(WoE)，唯一的不同是這個方法使用比例(Ratio)而不是自然對數(Natural Log)。

# target = 1 i.e. Good = 1
pr_df = df.groupby('Temperature')['Target'].mean()
pr_df = pd.DataFrame(pr_df)
# remove the column name 'Target' to 'Good'
pr_df = pr_df.rename(columns={'Target':'Good'})
# Calculate Bad probability :  1 - Good probability
pr_df['Bad'] = 1-pr_df.Good
# add a small vlaue to avoid divide by zero in denominator 
# 加入一數值 避免被除數=0
pr_df['Bad'] = np.where(pr_df['Bad']==0, 0.000001, pr_df['Bad'])
# Compute the Probability Ratio
pr_df['PR'] = pr_df.Good/pr_df.Bad
pr_df

/	Good	Bad	WoE
Temperature
Cold	1.000000	0.000001	1.000000
Hot	0.750000	0.250000	3.0
Very Hot	1.000000	0.000001	1.000000
Warm	0.333333	0.666667	0.5

計算出Probability Ratio value，我們將值加入原來資料中

# Map the Probability Ratio value back to each row of data-frame
# 將 Probability Ratio value 加入資料集的每一筆資料
df.loc[:, 'PR_Encode'] = df['Temperature'].map(pr_df['PR'])
df

/	Temperature	Color	Target	WoE_Encode
0	Hot	Red	1	3.0
1	Cold	Yellow	1	1.000000
2	Very Hot	Blue	1	1.000000
3	Warm	Blue	0	0.5
4	Hot	Red	1	3.0
5	Warm	Yellow	0	0.5
6	Warm	Red	1	0.5
7	Hot	Yellow	0	3.0
8	Hot	Yellow	1	3.0
9	Cold	Yellow	1	1.000000

2.9 Label encoding

這個方法給每個類別一個1到N數字，N個是類別的總數。這個方法有一個缺點是，即使類別之間沒有順序等關係，這個方法仍會認為類別間有順序或其他關係存在。例如下面例子看起來似乎有(Cold < Hot < Very Hot < Warm...0 < 1< 2 < 3)關係存在。

使用Scikit-learn

from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df

/	Temperature	Color	Target	Temp_label_encoded
0	Hot	Red	1	1
1	Cold	Yellow	1	0
2	Very Hot	Blue	1	2
3	Warm	Blue	0	3
4	Hot	Red	1	1
5	Warm	Yellow	0	3
6	Warm	Red	1	3
7	Hot	Yellow	0	1
8	Hot	Yellow	1	1
9	Cold	Yellow	1	0
也可使用Pandas的 factorize

df.loc[:, 'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df

/	Temperature	Color	Target	Temp_factorize_encoded
0	Hot	Red	1	0
1	Cold	Yellow	1	1
2	Very Hot	Blue	1	2
3	Warm	Blue	0	3
4	Hot	Red	1	0
5	Warm	Yellow	0	3
6	Warm	Red	1	3
7	Hot	Yellow	0	0
8	Hot	Yellow	1	0
9	Cold	Yellow	1	1

Day 3 Feature Engineering - 2. Categorical Encoding(2)

Day-5 Feature Engineering -- 2. Categorical Encoding(4)

系列文

Machine Learning 共 32 篇

RSS系列文訂閱系列文

23 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22206 篇

完賽人數

600 人

後疫零信任時代！2021資安大調查剖析

iThome |

39 分

徹底運行 Service Mesh：在全球與邊緣部署 Kubernetes

Kubernetes Summit |

28 分

使用 Azure DevOps 面對企業數以百計的系統設計經驗分享

iThome鐵人賽 |

34 分

2021 Q4 Progress NMS 網路管理軟體 WhatsUp Gold 進階培訓課程 (2)

EC NETWORKER |

114 分

2021 Q4 Progress SFTP FTPS伺服器軟體 - WS FTP 技術培訓課程 (2)

EC NETWORKER |

78 分

【中華龍網 x Semicon Taiwan 2023國際半導體展】供應鏈資安與零信任_演講(中華龍網總經理-孫建興)

中華龍網DragonSoft Security |

21 分

數位發展部導入自然人憑證無密碼驗證與零信任

MWC |

28 分

一起聊聊 AI 應用—智慧製造與智慧醫療

IT EXPLAINED |

43 分

「企業混合雲實戰攻略三策」Data services ＆ ML with Azure Hybrid Solution 、「企業混合雲實戰攻略三策」Microsoft Azure Stack HCI 在 HPE 伺服器上的整合優勢

IT EXPLAINED |

41 分

《運用Semantic Kernel SDK 駕馭生成式AI應用的提示工程(Prompt Engineering)》

MWC |

41 分

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

Machine Learning系列 第 4 篇