[DAY 28]AutoML實戰：如何利用貝葉斯優化找到超參數最佳解？

2024 iThome 鐵人賽

DAY 28

自我挑戰組

30 天程式學習筆記：我的自學成長之路系列第 28 篇

16th鐵人賽

lafeeleaf

2024-09-28 15:37:42

163 瀏覽

分享至

用 Python 實現 AutoML 和貝葉斯優化

這裡我們將使用 optuna 來進行貝葉斯優化，並將其集成到一個 AutoML 工作流中，來自動選擇和優化模型的超參數。

先安裝 optuna 和 scikit-learn

 pip install optuna
 pip install scikit-learn

執行此段程式碼

import optuna
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.svm

# 下載示例數據集
data = sklearn.datasets.load_breast_cancer()
X = data.data
y = data.target

# 定義目標函數
def objective(trial):
    classifier_name = trial.suggest_categorical("classifier", ["RandomForest", "SVC"])

    if classifier_name == "RandomForest":
        n_estimators = trial.suggest_int("n_estimators", 10, 100)
        max_depth = trial.suggest_int("max_depth", 2, 32, log=True)
        classifier_obj = sklearn.ensemble.RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    else:
        C = trial.suggest_loguniform("C", 1e-10, 1e10)
        classifier_obj = sklearn.svm.SVC(C=C, gamma="auto")

    # 使用交叉驗證評估模型
    score = sklearn.model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()

    return accuracy

# 創建研究以進行超參數搜索
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

# 輸出最佳參數
print("Best trial:")
trial = study.best_trial

print(f"  Accuracy: {trial.value}")
print("  Best hyperparameters: ", trial.params)

我們使用 optuna 來進行貝葉斯優化。在這裡，我們優化兩個模型的超參數：隨機森林（RandomForest）和支持向量機（SVC）。
每個 trial 都會隨機選擇模型並嘗試不同的超參數組合，最終會找到一組最佳參數，使得模型的準確度達到最大。
我們使用 cross_val_score 來進行交叉驗證，從而評估每次的模型表現。

實驗結果

[I 2024-09-16 10:35:01,297] A new study created in memory with name: no-name-9e42bdc7-c73e-4e82-80e7-af0461d45672
[I 2024-09-16 10:35:01,844] Trial 0 finished with value: 0.6274204028589994 and parameters: {'classifier': 'SVC', 'C': 9.012114180566522}. Best is trial 0 with value: 0.6274204028589994.
[I 2024-09-16 10:35:02,105] Trial 1 finished with value: 0.6274204028589994 and parameters: {'classifier': 'SVC', 'C': 62678.40611887763}. Best is trial 0 with value: 0.6274204028589994.
[I 2024-09-16 10:35:02,486] Trial 2 finished with value: 0.9437668244685788 and parameters: {'classifier': 'RandomForest', 'n_estimators': 92, 'max_depth': 2}. Best is trial 2 with value: 0.9437668244685788.
[I 2024-09-16 10:35:02,788] Trial 3 finished with value: 0.9525851666202544 and parameters: {'classifier': 'RandomForest', 'n_estimators': 16, 'max_depth': 3}. Best is trial 3 with value: 0.9525851666202544.
[I 2024-09-16 10:35:03,057] Trial 4 finished with value: 0.6274204028589994 and parameters: {'classifier': 'SVC', 'C': 72965996.5695198}. Best is trial 3 with value: 0.9525851666202544.
[I 2024-09-16 10:35:03,453] Trial 5 finished with value: 0.9560753736192332 and parameters: {'classifier': 'RandomForest', 'n_estimators': 58, 'max_depth': 14}. Best is trial 5 with value: 0.9560753736192332.
...
...
...
[I 2024-09-16 10:35:12,654] Trial 95 finished with value: 0.9543024227234754 and parameters: {'classifier': 'RandomForest', 'n_estimators': 74, 'max_depth': 9}. Best is trial 44 with value: 0.9718926946997123.
[I 2024-09-16 10:35:12,718] Trial 96 finished with value: 0.9525758841548315 and parameters: {'classifier': 'RandomForest', 'n_estimators': 19, 'max_depth': 15}. Best is trial 44 with value: 0.9718926946997123.
[I 2024-09-16 10:35:12,829] Trial 97 finished with value: 0.9595841455490578 and parameters: {'classifier': 'RandomForest', 'n_estimators': 47, 'max_depth': 13}. Best is trial 44 with value: 0.9718926946997123.
[I 2024-09-16 10:35:12,940] Trial 98 finished with value: 0.9560660911538105 and parameters: {'classifier': 'RandomForest', 'n_estimators': 53, 'max_depth': 11}. Best is trial 44 with value: 0.9718926946997123.
[I 2024-09-16 10:35:13,021] Trial 99 finished with value: 0.959574863083635 and parameters: {'classifier': 'RandomForest', 'n_estimators': 29, 'max_depth': 10}. Best is trial 44 with value: 0.9718926946997123.
Best trial:
  Accuracy: 0.9718926946997123
  Best hyperparameters:  {'classifier': 'RandomForest', 'n_estimators': 36, 'max_depth': 6}

找到的最佳解為：

Accuracy: 0.9718926946997123
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 36, 'max_depth': 6}

結論

AutoML 減少了對人力和時間的依賴，特別是在高維度的超參數空間中，貝葉斯優化的高效性尤為突出。隨著這些技術的進步，研究者和開發者能夠更加專注於模型的結構和應用本身，而非花費大量時間在超參數調整上。想像一下，透過 AutoML 和貝葉斯優化，電商平台可以自動優化推薦系統的參數，將點擊率提升 15%，或者金融機構可以自動調整風控模型，將壞帳率降低 10%。這些技術的進步，讓企業能夠更快速地將機器學習應用到實際業務中，創造更大的價值。我們可以預見，未來將會有更多領域受益於自動化機器學習，釋放出更大的潛力。