樹選手2號：random forest [python實例]

2021 iThome 鐵人賽

DAY 5

AI & Data

Python 機器學習實驗室 ʘ ͜ʖ ʘ系列第 5 篇

13th鐵人賽 random forest python3

nancysunnn

2021-09-19 02:19:01

5558 瀏覽

分享至

今天來用前幾天使用判斷腫瘤良性惡性的例子來執行random forest，一開始我們一樣先建立score function方便之後比較不同models：

#from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def score(m, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred=m.predict(x_train)
        print('Train Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_train, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_train, pred)*100:.2f}%")
1. 1.         print(f"Recall Score: {recall_score(y_train, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_train, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_train, pred)}")
    elif train == False:
        pred=m.predict(x_test)
        print('Test Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_test, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_test, pred)*100:.2f}%")
        print(f"Recall Score: {recall_score(y_test, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_test, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_test, pred)}")

在random forest的模型裡，重要的參數包括：

n_estimators：想種幾棵樹
max_features：要包括的參數數量，可以輸入數量或是“auto”, “sqrt”, “log2”
- “auto”>> max_features=sqrt(n_features).
- “sqrt”>> then max_features=sqrt(n_features) (same as “auto”).
- “log2”>> then max_features=log2(n_features).
max_depth(default=None): 限制樹的最大深度，是非常常用的參數
min_samples_split(default=2):限制一個中間節點最少要包含幾個樣本才可以被分支（產生一個yes/no問題）
min_samples_leaf(default=1):限制分支後每個子節點要最少要包含幾個樣本

隨後我們先來建一個最簡單的random forest，並看看testing後的結果：

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=1000, random_state= 42)
forest = forest.fit(x_train,y_train)
score(forest, x_train, y_train, x_test, y_test, train=False)

接下來試試看tuning，這裡我們用cross validation來尋找最適合的參數組合，使用的function為RandomizedSearchCV，可以把想要調整的參數們各自設定區間，接下來會隨機在這些區間裡選出參數組合去建模，用cross validation來衡量結果並回傳最好的參數組合，RandomizedSearchCV重要的參數有：

n_iter：想要試幾種參數組合，
cv: cross validation的切割數量
數字越大當然可以獲得更好的參數組合，但選擇的同時要考量運行效率，機器學習最大的兩難就是performance VS time!

from sklearn.model_selection import RandomizedSearchCV

＃建立參數的各自區間
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}
random_grid

forest2 = RandomForestClassifier(random_state=42)
rf_random = RandomizedSearchCV(estimator = forest2, param_distributions=random_grid,
                              n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)

rf_random.fit(x_train,y_train)
rf_random.best_params_

接下來使用回傳的參數組合來建最後的model囉！

forest3 = RandomForestClassifier(bootstrap=True,
                                 max_depth=20, 
                                 max_features='sqrt', 
                                 min_samples_leaf=2, 
                                 min_samples_split=2,
                                 n_estimators=1200)
forest3 = forest3.fit(x_train, y_train)
score(forest3, x_train, y_train, x_test, y_test, train=False)