[Day 23] Hyperparameter tuning / 調校超參數 part II

第 11 屆 iThome 鐵人賽

DAY 23

AI & Data

跟top kaggler學習如何贏得資料分析競賽系列第 23 篇

11th鐵人賽

madeleine

2019-09-24 22:50:31

5582 瀏覽

分享至

Tree beased models

GBDT: XGBoost, LightGBM, CatBoost
RandomForest/ExtraTrees

XGBoost, LightGBM 是當今最當紅, RandomForest, ExtraTrees 也是執行 gradient boosting 很強的工具. 還有一種叫 regularized Greedy Forest, 跑得慢但適合小型資料集分析,

Model	Library
GBDT	XGBoost (dmlc/xgboost), LightGBM (Microsoft/LightGBM), CatBoost (catboost/catboost)
RandomForest, ExtraTrees	scikit-learn
Others	RGF(baidu/fast_rgf)

GBDT

max_depth : 指的是控制樹的深度, 最佳化有可能是 2 或 27, 建議作法是隨著 validation 而加深深度, 過程也要注意新的特徵可被擷取. max_depth 可以從 7 開始, 深度跟學習時間成正比.

max_depth/num_leaves : LightGBM 則是控制葉子數量

subsample、bagging_fraction : 值在 0 與 1 之間, 逐次匯入小部分數據以控制 overfitting 狀況, 這項目比較像正規化的做法

colsample_bytree、colsample_bylevel : 一旦遇到 overfitting 就降低這些參數

min_child_weight,lambda,alpha : 也都是正規化的做法

min_child_weight : 此項是最重要的參數, 增減此項會讓 model 趨近更無拘束/彈性(減)或更沈穩(增), 最佳化的數值在 0, 5, 15, 300.

eta、num_round : 這兩個是配對使用, eta 是學習權重(weight), 就像梯度下降(gradient decent), num_round 則是學習的步驟數量, 隨著迭代建樹, eta 權重會加入 model. 也可以將 eta 固定在極小值 0.1 或 0.01, 然後訓練到 over fits, 藉此知道學習要多少回合. <小秘訣> 將 num_round 乘以 α, eta 則是除以 α, 通常 model 的分數因此迅速上升.

seed : random seed 一般來說對 model 影響不大, 若影響很大, 就試著調整 validation 架構為隨機.

XGBoost	LightGBM
max_depth	max_depth/num_leaves
subsample	bagging_fraction
colsample_bytree, colsample_bylevel	feature_fraction
min_child_weight, lambda, alpha	min_data_in_leaf, lambda_l1, lambda_l2
eta num_round	learning_rate numiterations
Others : seed	Others : *_seed

sklearn.RandomForest/ExtraTrees

N_estimators (the highest the better) : 從較小數字 10 開始, 若時間可以接受就改設 300 或更大的數值, 下圖顯示 50 棵樹就夠了.
max_depth : 跟 XGBoost 一樣, 差別只在於 RandomForest/ExtraTrees 可以設 none 變成無限深度 (unlimited depth), 幾乎是即時就可達 overfitting, 效果顯著. 建議是從 7 開始, 勁量勇敢去試, 試試看 10, 20 或更高數值.
max_features : 跟 XGBoost 的 colsample 參數定義一樣, 分割越多訓練越快.
min_samples_leaf : 同 XGBoost 的 min_child_weight 跟 LightGBM 的 min_data_in_leaf

Others