iT邦幫忙

2022 iThome 鐵人賽

DAY 17
0
AI & Data

語言學與NLP系列 第 17

Day 17 羅吉斯迴歸 Logistic Regression 實作篇

  • 分享至 

  • xImage
  •  

昨天介紹完 logistic regression 之後,今天當然要來實作一下了!今天的實作一樣會分成兩部分,上半部分為 Python,下半部分 R 的簡單實作練習。那麼,廢話不多說,我們直接開始吧!

Python Logistic Regression 實作

本篇,我們要建立糖尿病預測模型。使用 logistic regression classifier 預測糖尿病。我們先從 kaggle 下載 Pima Indian Diabetes 資料集(https://www.kaggle.com/uciml/pima-indians-diabetes-database),再使用 pandas 讀取 Pima Indian Diabetes。


#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("/content/diabetes.csv", header=None, names=col_names)
pima = pima.iloc[1: , :] # drop the first row because we've changed the column names
pima.head()

執行結果為:

db

接下來,我們需要將給定的列分為依變量(或目標變量)和自變量(或特徵變量)兩種類型。


#split dataset in features and target variable

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

分為 train set & test set,random_state 大致上等於 set.seed(隨機種子)


# split X and y into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=16)

建立模型


# import the logistic regression
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

建立 confusion matrix


# import the metrics 
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

執行結果為:

array([[116, 9],
[ 26, 41]])

用 heat map 來看 confusion matrix


# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

執行結果為:

hm

看一下整個模型的 performance metrics (accuracy, precision, recall & F1-score)


from sklearn.metrics import classification_report
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_pred, target_names=target_names))

執行結果為:

pmp

R Logistic Regression 實作



library(text2vec)
library(data.table)
library(magrittr)
data = read.csv("/Users/biaoyun/Documents/Ithome/diabetes.csv")

head(data)

執行結果為:

dbr

分 train $ test set


library(caret)

set.seed(16)

trainIndex <- createDataPartition(data$Outcome, p=0.8, list=FALSE)
train_set <- data[trainIndex,]
test_set <- data[-trainIndex,]



library(glmnet)

NFOLDS = 10 # k-folds cross validation

glmnet_classifier = cv.glmnet(as.matrix(train_set), y = train_set$Outcome, 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = NFOLDS,
                 thresh = 1e-3,
                 maxit = 1e3)




preds = predict(glmnet_classifier, as.matrix(test_set), type = 'response')[,1]
glmnet:::auc(test_set$Outcome, preds) # using accuracy as the evaluation



執行結果為:

[1] 1

Confusion matrix & Performance metrics


assigner <- function(prediction){
  pred_class = c()
  for (i in seq_along(prediction)){
    if(prediction[i]>0.37){
      pred_class[i] <- 1
    }else{
      pred_class[i] <- 0
    }
  }
  return(pred_class)
}

confusionMatrix(as.factor(assigner(preds)),as.factor(test_set$Outcome))


執行結果為:

pmr

今天也感謝大家的收看xd 明天見~


上一篇
Day 16 羅吉斯迴歸 Logistic Regression 介紹篇
下一篇
Day 18 羅吉斯迴歸 Logistic Regression 語言相關特徵實作篇
系列文
語言學與NLP30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言