Encoding | Describe |
---|---|
One hot encoding | 根據columns中的狀態,例如Sex中有Male及Female,One hot encoding會將資料拆成兩個Columns,分別為Sex_male及Sex_female,並且將這兩個column轉為二分類屬性,若資料中有太多不同的狀態,會造成資料的維度過大,會讓處理時間變得非常的久,建議catrgory數量<4使用 |
Label Encoding | 會根據資料集中的數據,將資料做fit,也就是正規化(Normalization),處理完後再做transform |
Count Encoding | 會將資料類別(categorical)的標籤(label)用資料出現的次數(frequency)替代,意義在於,稀有的資料(rare values)若是類別屬性,計算時跟其餘資料都是用一樣的方式在做計算,而count encoding則可以將資料作權重上面的處理,因此這個方式對類別屬性資料來說是有效的 |
Target Encoding | 取出某一個column的標籤(label),計算出該狀態target的比例佔多少,並用這個值取代掉原始資料,建議catrgory數量>4使用 |
CatBoost Encoding | 類似於target encoding |
:::warning | |
在count encoding的時候,因為有使用到target,為了避免洩漏資料給驗證資料,因此在fit的時候只能使用train的資料,不能使用valid的資料做fit |
get_dummies可以將dataframe自動轉為one hot encoding的模式,就可以跑這些資料了
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
原始資料
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
處理後資料
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
目的是為了讓一些model能夠做訓練
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
資料再fit的時後,會將每筆資料都給一個標籤,再transform的時候會根據標籤將資料做轉換LabelEncoder
axis=0
-> columnsaxis=1
-> rowsapply
是對每一個資料作處理fit_transform
會先對資料做fit,在將資料transformfit
將資料做正規化
from sklearn.preprocessing import LabelEncoder
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)
若要每一個columns獨立運作的話,可用以下方法
在fit_transform中丟入資料集
from sklearn.preprocessing import LabelEncoder
cat_features = ['ip', 'app', 'device', 'os', 'channel']
# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = LabelEncoder()
for feature in cat_features:
encoded = encoder.fit_transform(clicks[feature])
clicks[feature + '_labels'] = encoded
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
先import category_encoders
建立encoder ce.CountEncoder()
將要transform的資料丟入 Encoder
再利用add_suffix(加入後綴詞)在資料後面加上_count
處理完的資料利用join將其加入原始data資料中
import category_encoders as ce
cat_features = ['category', 'currency', 'country']
# Create the encoder
count_enc = ce.CountEncoder()
# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))
# Train a model
train, valid, test = get_data_splits(data)
train_model(train, valid)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])
# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))
# Train a model
train_model(train_TE, valid_TE)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Create the encoder
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])
# Transform the features, rename columns with _cb suffix, and join to dataframe
train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))
# Train a model
train_model(train_CBE, valid_CBE)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
這邊提供一些產生特徵的方法
再經過特徵編碼(feature encodings)和特徵產生(feature generation)後,我們會發現特徵太多了,可能會造成過擬和(overfitting)或是需要訓練的時間很久,因此我們需要一些方法來篩選特徵
baseline_data.columns.size
14
原始資料中有14個feature,我們利用這個方法取5個columns出來
:::danger
記得要將資料切割成訓練(Train)、測試(Test)、驗證(Valid)後再做處理
:::
feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)
# Keep 5 features
selector = SelectKBest(f_classif, k=5)
X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new
array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
[2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
...,
[2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
[2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])
這時會發現選擇的feature的columns跟原本的不一樣,因此需要將資料轉回原本的型態之後,在將0的部份去掉
這個時候可以使用.inverse_transform
去取得轉換前的資料
# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
index=train.index,
columns=feature_cols)
selected_features.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
然後將0的值去掉
# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]
# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
上面的方法是使用單變量對資料做處理,每一個feature對target的影響
L1 regularization是利用全部的資料對target的影響去做判斷
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
train, valid, _ = get_data_splits(baseline_data)
X, y = train[train.columns.drop("outcome")], train['outcome']
# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)
X_new = model.transform(X)
X_new
array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
1.409e+03],
[3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
9.570e+02],
[4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
7.390e+02],
...,
[2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
5.150e+02],
[2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
1.306e+03],
[2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
1.084e+03]])
跟前面單變量的資料一樣,會回傳選擇的columns
將columns為0的值去掉後,就會得到選擇的columns
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new),
index=X.index,
columns=X.columns)
# Dropped columns have values of all 0s, keep other columns
selected_columns = selected_features.columns[selected_features.var() != 0]
Feature Selection | AUC score | 適用情形 | 效果 |
---|---|---|---|
不處理 | 0.7446 | ||
Univariate Feature Selection | 0.6010 | 資料量大,特徵多 | 效果差 |
L1 regularization | 0.7462 | 資料量小,特徵小 | 效果好 |
# 取得某一個資料的全部狀態
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']
當取得了所有的columns,可以利用專家模式來判斷哪一個是不需要的資料,可以將其去掉
ks為dataframe的資料型態
query可以在裡面放入boolean或是SQL(詳細用法看到再補)
# Drop live projects
ks = ks.query('state != "live"')
原始資料為
Unique values in `state` column: ['failed', 'canceled', 'successful', 'undefined', 'suspended']
想將successful轉為1,其餘轉為0
assign為 為dataframe分配一個新的列(column)
feature = ['state', 'outcome']
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
ks[feature].head(6)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
ks = ks.assign(hour=ks.launched.dt.hour,
day=ks.launched.dt.day,
month=ks.launched.dt.month,
year=ks.launched.dt.year)