Day 20 : 線性迴歸與羅吉斯迴歸

2021 iThome 鐵人賽

DAY 20

自我挑戰組

Python資料分析學習地圖系列第 20 篇

13th鐵人賽 python系列文章機器學習線性回歸羅吉斯迴歸

皮卡丘打排球

2021-10-02 00:03:57

4941 瀏覽

線性迴歸(Linear Regression)

如果我們有數據 (x, y) ，假設 x 是年資、y 是薪資，我們想找出其中的關聯 w 和 b (y = w * x + b)

我們就可以依照這些數據繪製出一條線，來描述這些數據

而這些線是我們透過學習找到一個最小 error 去擬合訓練資料

產生資料

import numpy as np
import matplotlib.pyplot as plt

# 我們自己隨機產資料
np.random.seed(0)
noise = np.random.rand(100, 1)
x = np.random.rand(100, 1)
y = 8 * x + 100 + noise
# plot
plt.scatter(x, y, s=10)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

建立 Linear Regression 模型

from sklearn.linear_model import LinearRegression
# 建立模型
linearMmodel = LinearRegression(fit_intercept=True)
# 使用訓練資料訓練模型
linearMmodel.fit(x, y)
# 使用訓練資料預測
predicted = linearMmodel.predict(x)

from sklearn import metrics
print('R2 score: ', linearMmodel.score(x, y))
mse = metrics.mean_squared_error(y, predicted)
print('MSE score: ', mse)
>>> R2 score:  0.9831081424561687
    MSE score:  0.08275457812228725

模型預測長相

plt.scatter(x, y, s=10, label='True')
plt.scatter(x, predicted, color="r",s=10, label='Predicted')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

# 分別儲存在linearMmodel.coef_[0] 和 linearMmodel.intercept_中
coef = linearMmodel.coef_ 
intercept = linearMmodel.intercept_

print("斜率w = ", coef[0][0])
print("截距b = ", intercept[0])
>>> 斜率w =  7.931123354540897
    截距b =  100.50916633941445

羅吉斯迴歸(Logistic Regression

雖然跟這裡探討的是迴歸模型，但是大家必須釐清一點的是羅吉斯迴歸是應用在分類問題

用以下表格來說明

模型名稱	預測標籤	應用場景	公式
線性迴歸	數值	適用於預測數值型，例如預測物價指數	$\sum_\limits{i}({w}_i{x}_i + b)$
羅吉斯迴歸	介於0到1的機率、布林值	適用於二元分類，例如吸菸是否會得到癌症的機率、信用卡評分模型等等	$\sigma(\sum_\limits{i}({w}_i{x}_i + b))$

原理

在介紹羅吉斯的公式前，我們需要先了解「勝算比」(odds radio)是什麼，它是指對特定事件出現的比率。

公式是：

P表示「正事件」發生的機率，然而正事件不一定代表是好事情，也可以指的是出現癌症的事件(想預測的事件)。

羅吉斯回歸的公式：
其中 sigmoid 的函數為：

我們利用線性迴歸輸出的結果來進行二元分類(輸出大於0.5分到1、小於0.5就分到0)。

圖片來源網址

優點：
- 不需要假設分配類型
- 快速可以得到結果
- 了解各類別的分類機率
缺點：
- 無法解決非線性問題
- 不太能處理大量的特徵，容易造成過度擬合

Sigmoid 函數

先來看看 Sigmoid 到底長什麼樣子

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(z):
    return 1/(1+np.exp(-z))

z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z)
plt.plot(z, phi_z)
plt.axhline(y=1, ls='dotted', color='black')
plt.axhline(y=0.5, ls='dotted', color='black')
plt.axhline(y=0, ls='dotted', color='black')
plt.show()

實作羅吉斯迴歸模型

接著來實作羅吉斯回歸，這邊會應用到鳶尾花資料集，使用 sklearn 就可以拿到資料集。

羅吉斯迴歸程式碼

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
# 取出鳶尾花資料
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_name'] = iris_df['target'].map({0: "setosa", 1: "versicolor", 2: "virginica"})
# 定義 X 和 Y
X = iris_df.drop(labels=['target_name', 'target'] ,axis=1)
y = iris_df['target'].values
# 進行資料集分類
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=0)
# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# 模型擬合
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
# 模型預測
y_pred = classifier.predict(X_test)
# 預測成功的比例
print('訓練集: ', classifier.score(X_train,y_train))
print('測試集: ', classifier.score(X_test,y_test))
>>> 訓練集:  0.9642857142857143
    測試集:  1.0

混淆矩陣

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
>>> array([[15,  0,  0],
          [ 0, 11,  0],
          [ 0,  0, 12]])

F1-Score

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

github 程式碼

更詳細可以請參考連結

Day 19 : KNN 與 K-means

Day 21 : SVM

系列文

Python資料分析學習地圖共 30 篇

RSS系列文訂閱系列文

30 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1123 組

團體組數

52 組

累計文章數

23096 篇

完賽人數

656 人

15th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 2018鐵人賽 javascript 2017鐵人賽 python windows php c# windows server linux css 程式設計 react vue.js

IT邦幫忙

Python資料分析學習地圖系列 第 20 篇