Day 12 模型表現判斷、分析與 Random Forest

2022 iThome 鐵人賽

DAY 12

AI & Data

語言學與NLP系列第 12 篇

14th鐵人賽 # performance metrics # confusion matrix # random forest #machine learning

cjom06991

團隊KnULPers_from_NCCU

2022-09-27 14:45:56

952 瀏覽

分享至

還記得上一篇教了大家該如何簡單的用決策樹模型來進行分類任務嗎？這篇要來簡單的語大家講解如何判斷模型的表現與 Performance Metrics 裡面各個數值代表的意義喔！

Performance Metrics & Confusion Matrix

在了解每個數值的意義之前，我們必須要先看看什麼是 confusion matrix。

confusion

圖片來源：(http://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix/)

上圖分為 4 格，TP, FN, FT, TN：我分別用上一篇的分類任務來做講解

True Positive：情況為真且模型判斷為真（真的是 Bad Movie 且模型判斷是 Bad Movie）
False Negative：情況為真但模型判斷為假（真的是 Bad Movie 但模型判斷成 Good Movie）
False Positive：情況為假但模型判斷為真（不是 Bad Movie 但模型判斷是 Bad Movie）
True Negative：情況為假且模型判斷為假（不是 Bad Movie 且模型沒有判斷是 Bad Movie ）

也就是說，只有右上角和左下角（紅色框框）是模型真正判斷正確的 cases。

那麼知道 TP, FN, FT, TN 是什麼意思之後，我們就能來進一步了解 Performace Metrics 裡面最常用的 4 種判斷模型好壞的數值了！

Accuracy 正確率：在所有情況中，模型正確判斷真假的比例。

$ac$

Precision 精確率：模型判斷為真的情況下，有多少真的是「真」。

$pc$

Recall 召回率：為真的情況下，有多少被模型正確判斷出來。

$rc$

F1-score (F-score)：又稱為平衡F分數（Balanced Score），是精確率和召回率的調和平均數。

$f1$

瞭解了這些數字的意義是不是對於模型表現分析又多了一層認識了呢？接下來我們就沿用昨天的資料，換成用 Random Forest 模型來進行分類任務吧！

Random Forest

Random Forest 就是隨機森林。簡單來說，隨機森林就是以決策數為基礎，不過是跑了非常非常多次的決策樹（就是不停的種樹，直到成為一片森林？？）來做訓練，直到數值達到穩定，不再變動後，選擇那個多次試驗、平均的結果回傳。以下就示範一下該如何完成



library(tm)
library(tokenizers)
library(caret) 

# 此段是與昨天一樣的前置作業

movie = read.csv("IMDB-Movie-Data.csv") # 讀取資料

movie<-movie[!is.na(movie$Metascore), ]
movie <- movie[!is.na(movie$Revenue..Millions.), ] # 刪除 NA


result <-c()
for(i in seq_along(movie$Rating)){
  
  if ((movie$Rating[i] >= '6.0')){
  result[i]= "Good movie"
  }else{
     result[i]= "Bad movie"
    }
  }
movie$Comments = result # 第一項特徵

g_type1 <- gsub("\\,\\w+", "", movie$Genre)
g_type2 <- gsub("\\-\\w+", "", g_type1)
g_type3 <- gsub("\\s", "", g_type2)
movie$Genre_type = g_type3 # 提取電影類別

g_clean<-sapply(strsplit(movie$Genre, ","), length)
movie$Genre = g_clean # 根據提取的電影類別轉換為類別數量，放進 Genre 這個欄位，即第二項特徵


# 第三項特徵
d1<-grepl('family|parents|daughter|son|children|wife|husband', movie$Description)
movie$Description_family = d1 # family 相關特徵

d2<-grepl('lonely|sad|depressed|fear|hopeless|grief|isolated|lifeless|lost|frustrated|terrible', movie$Description)
movie$Description_ng = d2 # negative sentiment 特徵



# Features + Model Training

set.seed(53) # 隨意設定種子數量
shuffle_index <- sample(1:nrow(movie)) 
movie <- movie[shuffle_index, ] # 打散資料

movie <- subset(movie, select = -c(Rank, Title, Actors, Director, Year, Runtime..Minutes., Rating, Description)) # 刪除不需要放進去訓練的欄位

movie$Genre <- as.character(movie$Genre)
movie$Description_family <- as.factor(movie$Description_family)
movie$Description_ng <- as.factor(movie$Description_ng)
movie$Comments <- as.factor(movie$Comments)
movie$Genre_type <- as.factor(movie$Genre_type) # 把要放進去訓練的「非」數字特徵換成 factor 讓電腦理解

set.seed(111)

#將資料分成 訓練集和測試集 8:2，百分之 80 的資料拿來訓練模型，百分之 20 的資料拿來測試模型的表現

trainIndex <- createDataPartition(movie$Comments, p=0.8, list=FALSE) # 以 Comments 這欄作為標準答案
    
train_set <- movie[trainIndex,]
test_set <- movie[-trainIndex,]

prop.table(table(train_set$Comments)) 
prop.table(table(test_set$Comments)) # 分別看看 train set & test set 資料分割的情況

接下來載入 random forest package 用其訓練。


set.seed(1117)


randomforest <- randomForest(Comments ~ ., data = train_set, importane = T, proximity = T, do.trace = 100)

plot(randomforest) # 把 forest 畫出來

round(importance(randomforest), 2) # 看看放進去訓練的特徵對 random forest 做分類時的重要程度

執行結果為：


predict_labelsx <- predict(randomforest, test_set, type = 'class')

table_matrix <- confusionMatrix(predict_labelsx, test_set$Comments, mode='prec_recall')

table_matrix # 看一下 confusion matrix

執行結果為：

rfr