再經過特徵編碼(feature encodings)和特徵產生(feature generation)後,我們會發現特徵太多了,可能會造成過擬和(overfitting)或是需要訓練的時間很久,因此我們需要一些方法來篩選特徵
%matplotlib inline
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
ks = pd.read_csv('./ks-projects-201801.csv',
parse_dates=['deadline', 'launched'])
# Drop live projects
ks = ks.query('state != "live"')
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
day=ks.launched.dt.day,
month=ks.launched.dt.month,
year=ks.launched.dt.year)
# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)
data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)
cat_features = ['category', 'currency', 'country']
interactions = pd.DataFrame(index=ks.index)
for col1, col2 in itertools.combinations(cat_features, 2):
new_col_name = '_'.join([col1, col2])
# Convert to strings and combine
new_values = ks[col1].map(str) + "_" + ks[col2].map(str)
label_enc = LabelEncoder()
interactions[new_col_name] = label_enc.fit_transform(new_values)
baseline_data = baseline_data.join(interactions)
launched = pd.Series(ks.index, index=ks.launched, name="count_7_days").sort_index()
count_7_days = launched.rolling('7d').count() - 1
count_7_days.index = launched.values
count_7_days = count_7_days.reindex(ks.index)
baseline_data = baseline_data.join(count_7_days)
def time_since_last_project(series):
# Return the time in hours
return series.diff().dt.total_seconds() / 3600.
df = ks[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.max())
baseline_data = baseline_data.join(timedeltas.rename({'launched': 'time_since_last_project'}, axis=1))
def get_data_splits(dataframe, valid_fraction=0.1):
valid_fraction = 0.1
valid_size = int(len(dataframe) * valid_fraction)
train = dataframe[:-valid_size * 2]
# valid size == test size, last two sections of the data
valid = dataframe[-valid_size * 2:-valid_size]
test = dataframe[-valid_size:]
return train, valid, test
def train_model(train, valid):
feature_cols = train.columns.drop('outcome')
dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])
param = {'num_leaves': 64, 'objective': 'binary',
'metric': 'auc', 'seed': 7}
print("Training model!")
bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid],
early_stopping_rounds=10, verbose_eval=False)
valid_pred = bst.predict(valid[feature_cols])
valid_score = metrics.roc_auc_score(valid['outcome'], valid_pred)
print(f"Validation AUC score: {valid_score:.4f}")
return bst
我們可以利用統計學上的幾個方法來分析
F-values測量feature變數和target之間的線性相關(linear dependency)
這代表說若資料不是線性的(nonlinear),測量的分數可能會低估他們(Feature變數和target)之間的關係
這個時候因為mutural information score是nonparametric(非參數)的,因此能夠測量非線性資料之間的關係
使用feature_slelction.SelectKBest
,可以定義我們想要保留多少參數,使用.fit_transform(features, target)
,我們可以取得選擇之後的features
baseline_data.columns.size
14
beseline_data中目前有14個feature
from sklearn.feature_selection import SelectKBest, f_classif
feature_cols = baseline_data.columns.drop('outcome')
# Keep 5 features
selector = SelectKBest(f_classif, k=5)
X_new = selector.fit_transform(baseline_data[feature_cols], baseline_data['outcome'])
X_new
array([[2015., 5., 9., 18., 1409.],
[2017., 13., 22., 31., 957.],
[2013., 13., 22., 31., 739.],
...,
[2010., 13., 22., 31., 238.],
[2016., 13., 22., 31., 1100.],
[2011., 13., 22., 31., 542.]])
將資料保留5個特徵
但是以上的方法有個問題,並沒有將資料分為訓練(train)、測試(test)、驗證(valid),會導致將target計算到test和valid資料中,會導致訓練出來的模型不好,因此在使用此方法之前要先將資料做切割
feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)
# Keep 5 features
selector = SelectKBest(f_classif, k=5)
X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new
array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
[2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
...,
[2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
[2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])
這時會發現選擇的feature的columns跟原本的不一樣,因此需要將資料轉回原本的型態之後,在將0的部份去掉
這個時候可以使用.inverse_transform
去取得轉換前的資料
# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
index=train.index,
columns=feature_cols)
selected_features.head()
然後將0的值去掉
# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]
# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()
特徵選擇後的AUC分數為0.6010
train_model(train[selected_columns].join(train['outcome']), valid[selected_columns].join(valid['outcome']))
Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007036 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 335
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.6010
<lightgbm.basic.Booster at 0x7fbb8b7685c0>
原始資料的AUC分數為0.7446
train_model(train, valid)
Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007786 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1553
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 13
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.7446
<lightgbm.basic.Booster at 0x7fbb8729c5f8>
上面的方法是使用單變量對資料做處理,每一個feature對target的影響
L1 regularization是利用全部的資料對target的影響去做判斷
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
train, valid, _ = get_data_splits(baseline_data)
X, y = train[train.columns.drop("outcome")], train['outcome']
# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)
X_new = model.transform(X)
X_new
array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
1.409e+03],
[3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
9.570e+02],
[4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
7.390e+02],
...,
[2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
5.150e+02],
[2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
1.306e+03],
[2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
1.084e+03]])
跟前面單變量的資料一樣,會回傳選擇的columns
將columns為0的值去掉後,就會得到選擇的columns
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new),
index=X.index,
columns=X.columns)
# Dropped columns have values of all 0s, keep other columns
selected_columns = selected_features.columns[selected_features.var() != 0]
train_model(train[selected_columns].join(train['outcome']), valid[selected_columns].join(valid['outcome']))
Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007739 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1298
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.7462
<lightgbm.basic.Booster at 0x7fbb8729b128>
經過L1 regularization的AUC分數為0.7462
在這個case中,我們將time_since_last_project
這個column去掉
在實際的情形中,L1 regularization是比Univariate測試的方法好,但若是feature很多,這個方法會跑得很慢
Univariate test在資料量大的時候跑得會比較快,但這個方法的效果並沒有那麼好