第 13 天：模型訓練第八步｜模型調參

2024 iThome 鐵人賽

DAY 13

生成式 AI

從 0 到 1 學習生成式 AI 模型建立以及 Prompt 技巧系列第 13 篇

16th鐵人賽

John Wu

2024-09-04 21:39:42

751 瀏覽

分享至

再選擇完模型之後，接下來就是要優化我們選擇的模型也就是「調整參數」的部分。調整參數的目的是要提升「模型的準確性」，因此我們需要做到「確認最佳的參數是多少、調整前調整後差了多少、以及調整後的準確性有多少」，這些細項才可以幫助我們判斷模型已經準備好了！

from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBRegressor
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# XGBoost 參數網格
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

# 創建 XGBoost 模型
xgb_model = XGBRegressor(random_state=42)

# 進行網格搜索
xgb_grid_search = GridSearchCV(xgb_model, xgb_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)
xgb_grid_search.fit(X_train, y_train)

# 打印最佳參數和得分
print("XGBoost 最佳參數：", xgb_grid_search.best_params_)
print("XGBoost 最佳交叉驗證 RMSE：", np.sqrt(-xgb_grid_search.best_score_))

# 使用最佳參數創建新模型
best_xgb = XGBRegressor(**xgb_grid_search.best_params_, random_state=42)
best_xgb.fit(X_train, y_train)

# 在測試集上評估模型
y_pred = best_xgb.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
test_r2 = r2_score(y_test, y_pred)

print("\n在測試集上的表現：")
print(f"RMSE: {test_rmse:.4f}")
print(f"R2 Score: {test_r2:.4f}")

# 特徵重要性
feature_importance = pd.DataFrame({
    'feature': [f'Feature_{i}' for i in range(X_processed.shape[1])],
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

print("\nXGBoost 特徵重要性（前 10 個）：")
print(feature_importance.head(10))

# 可視化特徵重要性
plt.figure(figsize=(10, 6))
feature_importance.head(10).plot(x='feature', y='importance', kind='bar')
plt.title('XGBoost 特徵重要性（前 10 個）')
plt.tight_layout()
plt.show()

透過調整參數後，也可以知道最重要的特徵是哪一些，你也可以考慮把表現很差的特徵給移除借此來提升準確性。

這個步驟我認為是訓練模型最重要的步驟，因為單純只是把數據丟給 AI 就不理他的話，有時候訓練結果很差不是模型的問題，而是最一開始「我們特徵的選擇或是數據的選擇可能就是錯的」，因此這也可以算是一個重新判斷過往優化結果的一個機會，而完成這步驟後，就可以來進行「模型評估」，實際來看看他的表現如何吧！