Day29 - Feature Selection -- 2. Wrapper Methods(包裝器法) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 29

AI & Data

Machine Learning系列第 29 篇

Day29 - Feature Selection -- 2. Wrapper Methods(包裝器法)

12th鐵人賽

tjabi

2020-09-29 22:43:14

11084 瀏覽

分享至

2. Wrapper Methods(包裝器法)

包裝器法(Wrapper Methods) 是使用機器學習模型和搜尋策略來評估每個特徵子集合，這個方法也被稱為貪婪演算法(greedy algorithms)，因為它的目的是找到在模型訓練下達到最佳結果的特徵組合，這樣做需要用到大量電腦運算資源，且通常沒能力實踐竭盡式搜尋(exhaustive search)。

基本上，任何一個搜尋策略和機器學習眼演算法的組合就是一個可以使用的包裝器(Wrapper)。

優點
相對於過濾器法(filter methods)，包裝器法(wrapper methods)有下列兩個優點：
能偵測變數之間的相互影響(interaction)。
能為我們期望使用的機器學習演算法發現最佳的特徵子集合。
相較於過濾器法(filter methods），包裝器法(wrapper methods)通常能得到更正確的預測。

步驟
搜尋一個特徵子集合：使用搜尋方法從資料集中選取一個特徵子集合。
建立一個機器學習模型：選擇一個機器學習演算法訓練前一個步驟選取的特徵子集合。
評估模型效能。
重複前面三步驟直到符合預期的條件。

停止的標準
在某一時間點我們需要停止搜尋特徵子集合，所以我們會預設一些停止搜尋的條件，例如：模型效能降低、模型效能增加、達到預先定義的特徵數目。

預設衡量標準可以是，用於分類的 ROC-AUC 或線性回歸的 RMSE。

搜尋方法
向前特徵選取法(Forward Feature Selection)：又稱為 step forward feature selection 或循序向前選取法(sequential forward feature selection— SFS)，這個方法剛開始時，特徵子集合是空集合，然後依序一次加入一個特徵。

向後特徵淘汰法(Backward Feature Elimination)：又稱為step backward feature selection 或循序向後選擇法(sequential backward feature selection — SBS)，這個方法剛開始時特徵子集合包刮資料集的所有特徵，然後依序一次淘汰一個特徵。

竭盡式特徵選取法(Exhaustive Feature Selection)：這個方法測試所有可能的特徵組合。

雙向搜尋(Bidirectional Search)：為了得到獨一的解決方案，這個方法同時同時進行向前和向後特徵選取。

使用Mlxtend來執行包裝器法用RandomForestClassifier來評估特徵子集合：

向前特徵選取法：

from mlxtend.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(RandomForestClassifier(), 
           k_features=10, 
           forward=True, 
           floating=False,
           scoring='accuracy',
           cv=2)

向後特徵淘汰法：

from mlxtend.feature_selection import SequentialFeatureSelector

sbs = SequentialFeatureSelector(RandomForestClassifier(), 
           k_features=10, 
           forward=False,  #使用向後特徵淘汰法，設為False
           floating=False,
           scoring='accuracy',
           cv=2)

竭盡式特徵選取法：

from mlxtend.feature_selection import ExhaustiveFeatureSelector

efs = ExhaustiveFeatureSelector(RandomForestClassifier(), 
           min_features=4,
           max_features=10, 
           scoring='roc_auc',
           cv=2)

尋找內建資料集－wine dataset最重要的特徵。

import numpy as np 
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EXS

wine = load_wine()
data = wine['data']
target = wine['target']
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=0)

機器學習模型：使用DecisionTreeClassifier

m = DecisionTreeClassifier(min_samples_leaf=20)

特徵選取：
使用 Mlxtend 的 SequentialFeatureSelector，進行向後特徵選取。
Mlxtend 使用 cross validation(cv)，我們設cv=0。

sfs = SFS(m, forward=False, cv=10, k_features = (2, 6), scoring='accuracy', verbose=False, n_jobs=-1)

sfs.fit(X_train, y_train, custom_feature_names=wine['feature_names'])

SequentialFeatureSelector(cv=10,
estimator=DecisionTreeClassifier(min_samples_leaf=20),
forward=False, k_features=(2, 6), n_jobs=-1,
scoring='accuracy', verbose=False)

print(f"Best score achieved: {sfs.k_score_}, Feature's names: {sfs.k_feature_names_}")

Best score achieved: 0.8871794871794872, Feature's names: ('alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'od280/od315_of_diluted_wines')

display(pd.DataFrame(sfs.get_metric_dict()))

/|13 |12 |11| 10| 9| 8 |7 |6| 5| 4| 3| 2
------------- | -------------
feature_idx |(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)| (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12)| (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11)| (0, 1, 2, 3, 4, 5, 7, 8, 10, 11)| (0, 1, 2, 3, 4, 5, 7, 8, 11)| (0, 1, 2, 3, 4, 5, 7, 11)| (0, 1, 2, 3, 4, 5, 11)| (0, 1, 2, 3, 4, 11)| (0, 1, 2, 3, 11)| (0, 1, 2, 11)| (0, 1, 11) |(0, 11)
cv_scores| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923... |[0.7692307692307693, 0.8461538461538461, 0.923... |[0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...| [0.7692307692307693, 0.8461538461538461, 0.923...
avg_score| 0.877564| 0.885897| 0.885897 |0.887179 |0.887179| 0.887179| 0.887179| 0.887179| 0.887179| 0.887179| 0.887179| 0.887179
feature_names| (alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ... |(alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ... |(alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ...| (alcohol, malic_acid, ash, alcalinity_of_ash, ... |(alcohol, malic_acid, ash, od280/od315_of_dilu...| (alcohol, malic_acid, od280/od315_of_diluted_w...| (alcohol, od280/od315_of_diluted_wines)
ci_bound |0.0499994 |0.0493767| 0.0493767| 0.0539568 |0.0539568 |0.0539568| 0.0539568| 0.0539568| 0.0539568 |0.0539568| 0.0539568| 0.0539568
std_dev |0.0673199| 0.0664815| 0.0664815| 0.0726483| 0.0726483| 0.0726483| 0.0726483| 0.0726483 |0.0726483 |0.0726483| 0.0726483| 0.0726483
std_err| 0.02244 |0.0221605| 0.0221605 |0.0242161 |0.0242161 |0.0242161 |0.0242161| 0.0242161 |0.0242161| 0.0242161| 0.0242161| 0.0242161

使用 Mlxtend 的 ExhaustiveFeatureSelector，執行竭盡式特徵選取法。

efs = EXS(m, min_features = 2, max_features=6, cv=10, scoring='accuracy')
efs.fit(X_train, y_train, custom_feature_names=wine['feature_names'])

Features: 4082/4082
ExhaustiveFeatureSelector(cv=10,
estimator=DecisionTreeClassifier(min_samples_leaf=20),
max_features=6, min_features=2)

print(f"Best score achieved: {efs.best_score_}, Feature's names: {efs.best_feature_names_}")

Best score achieved: 0.8871794871794872, Feature's names: ('alcohol', 'od280/od315_of_diluted_wines')

display(pd.DataFrame(efs.get_metric_dict()))

/|0| 1| 2| 3| 4| 5|...|4075| 4076| 4077| 4078| 4079| 4080 |4081
------------- | -------------
feature_idx| (0, 1)| (0, 2)| (0, 3)| (0, 4)| (0, 5)| (0, 6)|...|(6, 7, 8, 9, 10, 11)| (6, 7, 8, 9, 10, 12)| (6, 7, 8, 9, 11, 12)| (6, 7, 8, 10, 11, 12)| (6, 7, 9, 10, 11, 12) |(6, 8, 9, 10, 11, 12)| (7, 8, 9, 10, 11, 12)
cv_scores |[0.6153846153846154, 0.6923076923076923, 0.769... |[0.46153846153846156, 0.6923076923076923, 0.53...| [0.7692307692307693, 0.6923076923076923, 0.769...| [0.38461538461538464, 0.6923076923076923, 0.61...| [0.7692307692307693, 0.6923076923076923, 0.769...| [0.7692307692307693, 0.7692307692307693, 0.769...| ...|[0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923... |[0.6153846153846154, 0.7692307692307693, 0.846...| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923...| [0.8461538461538461, 0.9230769230769231, 0.923...
avg_score| 0.783974| 0.670513| 0.807692| 0.670513| 0.832051| 0.823077| ...|0.877564| 0.877564| 0.877564| 0.807051| 0.877564| 0.877564| 0.885897
feature_names| (alcohol, malic_acid) |(alcohol, ash) |(alcohol, alcalinity_of_ash) | (alcohol, magnesium)| (alcohol, total_phenols)| (alcohol, flavanoids)| ...|(flavanoids, nonflavanoid_phenols, proanthocya...| (flavanoids, nonflavanoid_phenols, proanthocya...| (flavanoids, nonflavanoid_phenols, proanthocya...| (flavanoids, nonflavanoid_phenols, proanthocya...| (flavanoids, nonflavanoid_phenols, color_inten...| (flavanoids, proanthocyanins, color_intensity,...| (nonflavanoid_phenols, proanthocyanins, color_...
ci_bound | 0.0727371| 0.0941471| 0.0591206| 0.100843| 0.0711221 |0.0509759|...|0.0499994 |0.0499994| 0.0499994| 0.0748383| 0.0499994| 0.0499994| 0.0493767
std_dev| 0.0979344| 0.126761| 0.0796008| 0.135776| 0.0957599| 0.0686347|..|0.0673199| 0.0673199| 0.0673199| 0.100763| 0.0673199| 0.0673199| 0.0664815
std_err| 0.0326448| 0.0422537| 0.0265336| 0.0452588| 0.03192| 0.0228782| ...|0.02244| 0.02244| 0.02244| 0.0335878| 0.02244| 0.02244| 0.0221605
7 rows × 4082 columns

雖然包裝器較複雜，但它是一個很好的特徵選取方法，應該被使用於過濾器方法去除一些特徵之後。