樹選手1號：decision tree [python實例]

2021 iThome 鐵人賽

DAY 3

AI & Data

Python 機器學習實驗室 ʘ ͜ʖ ʘ系列第 3 篇

13th鐵人賽 decisiontree python3

nancysunnn

2021-09-17 12:05:01

3571 瀏覽

分享至

今天來用decision tree做一個預測腫瘤是惡性還是良性的應用，在這裡就略過前期的資料處理與分割，直接從model應用開始，如果對這個分析有興趣，我有把kaggle連結放在下方可以參考。

剛開始我先寫了一個score function可以把model後續的訓練和測試跑完，最後回傳多種的準確率與Confusion Matrix來判斷模型的好壞。

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def score(m, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred=m.predict(x_train)
        print('Train Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_train, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_train, pred)*100:.2f}%")
        print(f"Recall Score: {recall_score(y_train, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_train, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_train, pred)}")
    elif train == False:
        pred=m.predict(x_test)
        print('Test Result:\n')
        print(f"Accuracy Score: {accuracy_score(y_test, pred)*100:.2f}%")
        print(f"Precision Score: {precision_score(y_test, pred)*100:.2f}%")
        print(f"Recall Score: {recall_score(y_test, pred)*100:.2f}%")
        print(f"F1 score: {f1_score(y_test, pred)*100:.2f}%")
        print(f"Confusion Matrix:\n {confusion_matrix(y_test, pred)}")

我們先建立一棵樹，參數完全使用原本預設的值，接下來看看training結果。

from sklearn import tree

tree1 = tree.DecisionTreeClassifier()
tree1 = tree1.fit(x_train, y_train)
score(tree1, x_train, y_train, x_test, y_test, train=True)

居然拿到了100%正確的預測，但testing的表現才是我們真正在意的，接下來看一下testing的結果，從回傳的結果可以看出model似乎有些overfitting的問題。

score(tree1, x_train, y_train, x_test, y_test, train=True)

要如何解決decision tree overfitting的問題呢？主要可以從參數上來做限制:
1.max_depth(default=None): 限制樹的最大深度，是非常常用的參數
2.min_samples_split(default=2):限制一個中間節點最少要包含幾個樣本才可以被分支（產生一個yes/no問題）
3.min_samples_leaf(default=1):限制分支後每個子節點要最少要包含幾個樣本

來用loop選擇最適合的max_depth：

#decide the tree depth!
depth_list = list(range(2,15))
depth_tuning = np.zeros((len(depth_list), 4)) 
depth_tuning[:,0] = depth_list

for index in range(len(depth_list)):
    mytree = tree.DecisionTreeClassifier(max_depth=depth_list[index]) 
    mytree = mytree.fit(x_train, y_train)
    pred_test_Y = mytree.predict(x_test)
    depth_tuning[index,1] = accuracy_score(y_test, pred_test_Y) 
    depth_tuning[index,2] = precision_score(y_test, pred_test_Y) 
    depth_tuning[index,3] = recall_score(y_test, pred_test_Y)
    
col_names = ['Max_Depth','Accuracy','Precision','Recall'] 
print(pd.DataFrame(depth_tuning, columns=col_names))

從上面的結果可以發現max_depth＝3的時候就可以達到不錯的效果，接下來建一顆新的樹來看看：

tree2 = tree.DecisionTreeClassifier(max_depth=3)
tree2 = tree2.fit(x_train,y_train)
score(tree2, x_train, y_train, x_test, y_test, train=True)

score(tree2, x_train, y_train, x_test, y_test, train=False)

比起本來的樹，設定max_depth＝3之後可以獲得更好的預測結果，tuning 成功！

reference:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
http://www.taroballz.com/2019/05/15/ML_decision_tree_detail/
https://www.kaggle.com/nancysunxx/breast-cancer-prediction

樹選手1號：decision tree

樹選手２號：random forest

系列文

Python 機器學習實驗室 ʘ ͜ʖ ʘ 共 30 篇

RSS系列文訂閱系列文

6 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22202 篇

完賽人數

600 人

行動身分識別新趨勢線上說明會

全景軟體 - 專注於人、事、物認證 |

77 分

金融業於雲端轉型過程中應用系統大規模上雲策略

Cloud Summit 臺灣雲端大會 |

26 分

Aruba 安全網路架構保護公共資料

2023 數位政府高峰會 |

25 分

通往金融網路安全之事件反應和數位鑑識路徑

臺灣資安大會 |

37 分

DevOps 困局與平台工程

DevOpsDays |

40 分

我也是看了報紙才知道公司上了雲端 - 匿名的 SRE 如此說

Cloud Summit 臺灣雲端大會 |

25 分

吃瓜群眾也能懂的 AI 入門

MWC |

32 分

無密碼的領航者 – 匯智安全從雲到點全面賦能

IT EXPLAINED |

50 分

Elastic Security : 監測 x 告警，揪出潛在威脅！【Webinar：Elastic 系列講座】｜歐立威科技

歐立威科技 |

59 分

由大數據驅動的雲端之旅：如何使用 Splunk 深入解析 Microsoft Azure 和 M365

IT EXPLAINED |

40 分

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

Python 機器學習實驗室 ʘ ͜ʖ ʘ系列 第 3 篇