iT邦幫忙

2024 iThome 鐵人賽

DAY 14
0
自我挑戰組

從零開始學Python系列 第 14

[Day14] Python機器學習(多元線性回歸 Multiple Linear Regression)

  • 分享至 

  • xImage
  •  

在單元線性回歸的例子中,用到年資去預測薪水,但實際上會考慮到更多條件,所以這個時候就可以用多元線性回歸,囊括更多的feature去預測結果。

單元線性回歸:y = w.x + b
多元線性回歸:y = w1.x1 + w2.x2 + w3.x3 + ... + b


  1. 資料預處理:分為label encodingone hot encoding
    因為資料不一定都像年資一樣,是數字類型的資料,如果是像學歷的話(高中以下、大學、碩士以上),則需要預先將資料轉換,像是依照學歷高低給予不同數字(高中以下:0、大學:1、碩士以上:2)。
import pandas as pd
url = "https://raw.githubusercontent.com/GrandmaCan/ML/main/Resgression/Salary_Data2.csv"
data = pd.read_csv(url) 
data

https://ithelp.ithome.com.tw/upload/images/20240904/20168811yVmnuO3nQt.png
label encoding =>資料處理:依照學歷高低給予不同數字(高中以下:0、大學:1、碩士以上:2)

data["EducationLevel"] = data["EducationLevel"].map({"高中以下":0, "大學":1, "碩士以上":2})
data

https://ithelp.ithome.com.tw/upload/images/20240904/2016881186JYA81t1b.png
one hot encoding =>資料處理:城市,因為無法清楚區分高低,所以拆成多個特徵來儲存資料
此資料裡面包涵cityA, cityB, and cityC,假設樣本為cityA =>
則在資料拆分後,可變成(cityA, cityB, cityC) = (1, 0, 0),
其實可以刪掉其中一個特徵,若把cityC刪除,cityA = (1, 0), cityB = (0, 1), cityC = (0, 0)。
<不一定要刪掉某個特徵>

from sklearn.preprocessing import OneHotEncoder 

onehot_encoder = OneHotEncoder()
onehot_encoder.fit(data[["City"]]) #二維矩陣所以用[[]]
city_encoded =  onehot_encoder.transform(data[["City"]]) 
city_encoded
#輸出:
<36x3 sparse matrix of type '<class 'numpy.float64'>'
	with 36 stored elements in Compressed Sparse Row format> 

將資料加入原本的表格中:

data[["CityA", "CityB", "CityC"]] = city_encoded
data

https://ithelp.ithome.com.tw/upload/images/20240904/20168811oduE9NO5X2.png
將原本的City以及CityC列刪除:

data = data.drop(["City", "CityC"], axis = 1)
data

https://ithelp.ithome.com.tw/upload/images/20240904/201688111Dj9aBP5mD.png


  1. 訓練資料:將資料拆分成訓練集與測試集,通常訓練及佔比7-8成。
from sklearn.model_selection import train_test_split

x = data[["YearsExperience", "EducationLevel", "CityA",	"CityB"]]
y = data["Salary"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 87)
#轉換成numpy比較好計算
x_train = x_train.to_numpy()
x_test = x_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

sklearn.preprocessing 模塊中的 StandardScaler 來對數據進行標準化處理。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
import numpy as np

w = np.array([1, 2, 3, 4])
b = 1

y_pred = (x_train*w).sum(axis = 1) + b 

https://ithelp.ithome.com.tw/upload/images/20240904/20168811rgDgl8sULT.png

  1. 尋找cost function
    如同單元線性回歸,多元線性回歸也是利用尋找cost function來得到最佳的參數(w1, w2, w3, w4, b)。
    最適合資料的直線?
    如何挑選?=> 可以去尋找資料點與直線的距離,去推估最吻合的直線。
    如何推估?=> 資料點減掉直線上的值,算出平方誤差,可以寫成一個函數。
    函數?=> 成本函數(cost function),cost = (真實數據-預測值)^2
def compute_cost(x, y, w, b):
    y_pred = (x*w).sum(axis = 1) + b 
    cost = ((y - y_pred)**2).mean()
    return cost
import numpy as np

w = np.array([1, 2, 3, 4])
b = 1
compute_cost(x_train, y_train, w, b)
#輸出:1772.9485714285713 
  1. 設定optimizer:使用gradient descent法(根據斜率改變參數)
    https://ithelp.ithome.com.tw/upload/images/20240904/20168811Cj0OueyCq5.png
    圖片來源:截圖自GrandmaCan -我阿嬤都會教學影片
    如圖:前面的乘以2可以交由學習率來計算,故乘以2也可以刪掉再來做運算
  • 計算gradient
y_pred = (x_train*w).sum(axis = 1) + b 
b_gradient = (y_pred - y_train).mean()
w_gradient = np.zeros(x_train.shape[1]) 

for i in range(x_train.shape[1]):
     w_gradient[i] = (x_train[:, i]*(y_pred - y_train)).mean() 

w_gradient, b_gradient
  • 寫成函式
np.set_printoptions(formatter = {"float": "{: .2e}".format})
def gradient_descent(x, y, w_init, b_init, learning_rate, cost_function, gradient_function, run_iter, p_iter = 1000):
    c_hist = []
    w_hist = []
    b_hist = []
    
    w = w_init
    b = b_init
    for i in range(run_iter):
        w_gradient,  b_gradient = gradient_function(x, y, w, b)
        
        w = w -  w_gradient*learning_rate 
        b = b -  b_gradient*learning_rate
        cost = cost_function(x, y, w, b)
        
        w_hist.append(w)
        b_hist.append(b)
        c_hist.append(cost)
        
        if i%p_iter == 0:
            print(f"Ieration {i:5}: cost{cost: .4e}, w:{w}, b:{b: .2e}, w_gradient:{w_gradient }, b_gradient:{b_gradient: .2e}")

    return w, b, w_hist, b_hist, c_hist       
w_init = np.array([1, 2, 2, 4])
b_init = 0
learning_rate = 1.0e-2
run_iter = 10000

w_final, b_final, w_hist, b_hist, c_hist = gradient_descent(x_train, y_train, w_init, b_init, learning_rate, compute_cost, compute_gradient, run_iter, p_iter = 1000)
y_pred = (w_final*x_test).sum(axis = 1) + b_final

pd.DataFrame({
    "y_pred" : y_pred, 
    "y_test" : y_test
})

https://ithelp.ithome.com.tw/upload/images/20240904/20168811zCl7pwdaub.png

# 5.3 碩士以上 城市A
# 7.2 高中以下 城市B
x_real = np.array([[5.3, 2, 1, 0], [7.2, 0, 0, 1]])
x_real = scaler.transform(x_real)
y_real = (w_final*x_real).sum(axis=1) + b_final
y_real

上一篇
[Day13] Python機器學習(簡單線性回歸 Simple Linear Regression)-2
下一篇
[Day15] Python機器學習(邏輯回歸 Logistic Regression)
系列文
從零開始學Python30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言