iT邦幫忙

2021 iThome 鐵人賽

DAY 12
0
AI & Data

使用python學習Machine Learning系列 第 12

Day 12 [Python ML、特徵工程] 特徵工程整理

Categorical Encoding

Encoding Describe
One hot encoding 根據columns中的狀態,例如Sex中有Male及Female,One hot encoding會將資料拆成兩個Columns,分別為Sex_male及Sex_female,並且將這兩個column轉為二分類屬性,若資料中有太多不同的狀態,會造成資料的維度過大,會讓處理時間變得非常的久,建議catrgory數量<4使用
Label Encoding 會根據資料集中的數據,將資料做fit,也就是正規化(Normalization),處理完後再做transform
Count Encoding 會將資料類別(categorical)的標籤(label)用資料出現的次數(frequency)替代,意義在於,稀有的資料(rare values)若是類別屬性,計算時跟其餘資料都是用一樣的方式在做計算,而count encoding則可以將資料作權重上面的處理,因此這個方式對類別屬性資料來說是有效的
Target Encoding 取出某一個column的標籤(label),計算出該狀態target的比例佔多少,並用這個值取代掉原始資料,建議catrgory數量>4使用
CatBoost Encoding 類似於target encoding

:::warning
在count encoding的時候,因為有使用到target,為了避免洩漏資料給驗證資料,因此在fit的時候只能使用train的資料,不能使用valid的資料做fit

  • count encoding
  • catboost encoding
    :::
    :::warning
    若使用target encoding或是catboost encoding的時候,由於會將target拿來做encoding,若將ip這個屬性拿來encoding,因為每個ip都會對到一個target,會導致ip_target這個feature預測得太準,若測試資料(test data)中沒有找到相同的ip,模型就會不知道改如何預測這個row,因此需要將ip去掉
    try to remove ip encoding
    Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the "true" mean, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren't in the training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying different encodings.
    :::

One hot encoding

get_dummies可以將dataframe自動轉為one hot encoding的模式,就可以跑這些資料了

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])

原始資料

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

處理後資料

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

LabelEncoder(將類別屬性資料轉為數值)

目的是為了讓一些model能夠做訓練

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

資料再fit的時後,會將每筆資料都給一個標籤,再transform的時候會根據標籤將資料做轉換
LabelEncoder
axis=0 -> columns
axis=1 -> rows
apply是對每一個資料作處理
fit_transform會先對資料做fit,在將資料transform
fit將資料做正規化

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

若要每一個columns獨立運作的話,可用以下方法
在fit_transform中丟入資料集

from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = LabelEncoder()
for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Count Encoding

先import category_encoders
建立encoder ce.CountEncoder()
將要transform的資料丟入 Encoder
再利用add_suffix(加入後綴詞)在資料後面加上_count
處理完的資料利用join將其加入原始data資料中

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder
count_enc = ce.CountEncoder()

# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))

# Train a model 
train, valid, test = get_data_splits(data)
train_model(train, valid)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Target Encoding

# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

# Train a model
train_model(train_TE, valid_TE)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

CatBoost Encoding

# Create the encoder
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename columns with _cb suffix, and join to dataframe
train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))

# Train a model
train_model(train_CBE, valid_CBE)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Feature Generation

這邊提供一些產生特徵的方法

Feature Selection

Introduction

再經過特徵編碼(feature encodings)和特徵產生(feature generation)後,我們會發現特徵太多了,可能會造成過擬和(overfitting)或是需要訓練的時間很久,因此我們需要一些方法來篩選特徵

Univariate Feature Selection

baseline_data.columns.size
14

原始資料中有14個feature,我們利用這個方法取5個columns出來

:::danger
記得要將資料切割成訓練(Train)、測試(Test)、驗證(Valid)後再做處理
:::

feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)

# Keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new
array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
       [2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
       ...,
       [2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
       [2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])
       

這時會發現選擇的feature的columns跟原本的不一樣,因此需要將資料轉回原本的型態之後,在將0的部份去掉

這個時候可以使用.inverse_transform去取得轉換前的資料

# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)
selected_features.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

然後將0的值去掉

# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]

# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

L1 regularization(李弘毅老師 1 Regression有提到,讓線變平滑的方式)

上面的方法是使用單變量對資料做處理,每一個feature對target的影響

L1 regularization是利用全部的資料對target的影響去做判斷

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

train, valid, _ = get_data_splits(baseline_data)

X, y = train[train.columns.drop("outcome")], train['outcome']

# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)

X_new = model.transform(X)
X_new
array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
        1.409e+03],
       [3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
        9.570e+02],
       [4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
        7.390e+02],
       ...,
       [2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
        5.150e+02],
       [2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
        1.306e+03],
       [2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
        1.084e+03]])

跟前面單變量的資料一樣,會回傳選擇的columns

將columns為0的值去掉後,就會得到選擇的columns

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=X.index,
                                 columns=X.columns)

# Dropped columns have values of all 0s, keep other columns 
selected_columns = selected_features.columns[selected_features.var() != 0]

整理

Feature Selection AUC score 適用情形 效果
不處理 0.7446
Univariate Feature Selection 0.6010 資料量大,特徵多 效果差
L1 regularization 0.7462 資料量小,特徵小 效果好

Other Feature Enginner Method

取得資料中的狀態

# 取得某一個資料的全部狀態
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']

drop掉某一個columns

當取得了所有的columns,可以利用專家模式來判斷哪一個是不需要的資料,可以將其去掉
ks為dataframe的資料型態
query可以在裡面放入boolean或是SQL(詳細用法看到再補)

# Drop live projects
ks = ks.query('state != "live"')

將target轉為數字(其中的一個狀態)

原始資料為

Unique values in `state` column: ['failed', 'canceled', 'successful', 'undefined', 'suspended']

想將successful轉為1,其餘轉為0
assign為 為dataframe分配一個新的列(column)

feature = ['state', 'outcome']
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
ks[feature].head(6)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

drop掉缺失值(missing values)

# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

Convert timestamps(將時間資料轉換為columns)

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

上一篇
Day 11 [Python ML、特徵工程] 特徵選擇
下一篇
Day 13 [Python ML、Pandas] 創建、讀取和寫入
系列文
使用python學習Machine Learning29

尚未有邦友留言

立即登入留言