Day30 - Feature Selection -- 3. Embedded methods(嵌入法) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 30

AI & Data

Machine Learning系列第 30 篇

Day30 - Feature Selection -- 3. Embedded methods(嵌入法)

12th鐵人賽

tjabi

2020-09-30 23:24:19

7857 瀏覽

分享至

3.Embedded methods(嵌入法)

嵌入法(Embedded methods)是指在機器學習模型訓練的同時，執行特徵選擇。

優點：
嵌入法結合了過濾器法和包裝器法的優點來解決我們在使用這兩個方法時遭遇到的問題：
和包裝器法一樣，能偵測變數之間的相互影響(interaction)。
和過濾器法一樣，執行速度較快。
結果比過濾器法正確。
為訓練演算法找特徵子集合。
較不會傾向 overfitting。

步驟：
首先，執行一個機器學習模型訓練。
從模型中獲得特徵重要性數值，這個重要性是衡量當進行預測時，每一個特徵對這個預測的重要性。
最後，移除不重要的特徵。

方法：
Regularization methods
Tree-base methods

Regularization methods
Regularization methods是最常見的方法，在機器學習上Regularization會對模型的參數加入懲罰或不利條件來降低它的自由度(freedom)，這個懲罰是加在係數(coefficient)上，系數在線性模型上會和特徵相乘，也就是對係數加入門檻來懲罰特徵；這樣做可以避免overfitting，並增加模型的generalization。

就線性模型而言，有三種主要型態的regularization：
Lasso regression 或 L1 regularization
Ridge regression 或 L2 regularization
Elastic nets 或 L1/L2 regularization

Lasso(Least Absolute Shrinkage and Selection Operator) regression
使用L1 regularization 讓不重要的係數值變成0

假如一個特徵是不重要的，Lasso會懲罰它的係數，使他的係數為0。因此，系數為0(coefficient=0)的特徵會被移除，剩下的就是我們選取的。

程式範例：

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

x=load_boston()
df=pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"]=x.target
x=df.drop("MEDV",axis=1)
y=df["MEDV"]

from sklearn.linear_model import Lasso

reg=LassoCV()
reg.fit(x,y)

coef=pd.Series(reg.coef_,index=x.columns)

CRIM	-0.063437
ZN	0.049165
INDUS	-0.000000
CHAS	0.000000
NOX	-0.000000
RM	0.949811
AGE	0.020910
DIS	-0.668790
RAD	0.264206
TAX	-0.015212
PTRATIO	-0.722966
B	0.008247
LSTAT	-0.761115
dtype: float64

keep_cols = [feature for feature, weight in zip (x.columns, reg.coef_) if weight != 0]
keep_cols

['CRIM', 'ZN', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Ridge regression
使用L2 regularization 讓係數值分散的更平均並且不會讓係數縮小成0

from sklearn.linear_model import RidgeCV, Ridge

ridgecv=RidgeCV()
ridgecv.fit(x,y)

coef_rc=pd.Series(ridgecv.coef_,index=x.columns)
coef_rc

CRIM	-0.107474
ZN	0.046572
INDUS	0.015999
CHAS	2.670019
NOX	-16.684645
RM	3.818233
AGE	-0.000269
DIS	-1.459626
RAD	0.303515
TAX	-0.012421
PTRATIO	-0.940759
B	0.009368
LSTAT	-0.525966
dtype: float64

keep_cols = [feature for feature, weight in zip (x.columns, ridgecv.coef_) if weight != 0]
keep_cols

['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

elastic nets
elastic nets是結合了lasso和ridge regression的模型

from sklearn.linear_model import ElasticNet

e_net = ElasticNet(alpha = 1)
e_net.fit(x, y)

coef=pd.Series(e_net.coef_,index=x.columns)
coef

CRIM	-0.080371
ZN	0.053240
INDUS	-0.012657
CHAS	0.000000
NOX	-0.000000
RM	0.933936
AGE	0.020579
DIS	-0.762044
RAD	0.301569
TAX	-0.016439
PTRATIO	-0.748046
B	0.008339
LSTAT	-0.758426
dtype: float64

keep_cols = [feature for feature, weight in zip (x.columns, e_net.coef_) if weight != 0]
keep_cols

['CRIM', 'ZN', 'INDUS', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Tree-base methods
Tree-base演算法和模型也提供特徵重要性(feature importance)來讓我們做特徵選擇，我們可以使用任何樹狀基礎的學習模型，gradient boosting algorithms(如：XGBoost，CatBoost等)是較好的模型，因為它提供正確的特徵重要性。

from sklearn.ensemble import RandomForestRegressor

# create the random forest with your hyperparameters.
model = RandomForestRegressor(n_estimators=340)

# fit the model to start training.
model.fit(x, y)

# get the importance of the resulting features.
importances = model.feature_importances_

# create a data frame for visualization.
final_df = pd.DataFrame({"Features": x.columns, "Importances":importances})
final_df.set_index('Importances')