iT邦幫忙

1

Python 演算法 學習日誌 Day 2

機器學習九大步驟:

https://ithelp.ithome.com.tw/upload/images/20210621/20138527bNvtSwZ6Fc.png

https://yourfreetemplates.com/free-machine-learning-diagram/

監督式學習:

  1. 迴歸 (regression):用於預測"連續型"的 output,如:房價預測。
  2. 分類 (classification):用於預測"離散型"的 output,如:鐵達尼號生存機率預測。

以下範例 1. 鳶尾花 (classification):

  1. Data Set
import pandas as pd
import numpy as np
from sklearn import datasets     # 引用 Scikit-Learn 中的 套件 datasets

ds = datasets.load_iris()        # dataset: 引用 datasets 中的函數 load_iris
print(ds.DESCR)                  # DESCR: description,描述載入內容

# 建立 data 關係 (ds.data: 值, ds.feature_names: 項目名)
X =pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target
  1. Data clean (missing value check)
print(X.isna().sum())

結果:
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
dtype: int64

  1. Feature Engineering
    如:黃豆長/寬/高 → 黃豆體積
    此範例沒有,故不需要整理 data。

  2. Data Split (Training data & Test data)
    將資料拆分成 建模用/測試用

from sklearn.model_selection import train_test_split    

# test_size=0.2 : 測試用資料為 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(X_train.shape, y_train.shape)
>> (120, 4) (120,)
  1. Define and train the KNN model
from sklearn.neighbors import KNeighborsClassifier

# n_neighbors=: 超參數 (hyperparameter)
clf = KNeighborsClassifier(n_neighbors = 3)

# 適配 (訓練),迴歸/分類/降維...皆用 fit(x_train, y_train)
clf.fit(X_train, y_train)

# algorithm.score: 使用 test 資料 input,並根據結果評分
print(f'score={clf.score(X_test, y_test)}')
>> score=1.0

# 驗證答案
print(' '.join(y_test.astype(str)))
print(' '.join(clf.predict(X_test).astype(str)))

# result: 全對
>> 0 2 1 1 2 2 0 0 0 0 0 0 2 0 0 2 2 1 1 2 0 1 2 2 0 0 0 2 0 1
>> 0 2 1 1 2 2 0 0 0 0 0 0 2 0 0 2 2 1 1 2 0 1 2 2 0 0 0 2 0 1

# 查看預測的機率
print(clf.predict_proba(X_test))  # 預測每個 x_test 機率

以下範例 2. 乳癌 (classification):

import pandas as pd
import numpy as np
from sklearn import datasets

# 1. Dataset
ds = datasets.load_breast_cancer()
X =pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target

# 2. Data clean
print(X.isna().sum())

# 3. Feature Engineering

# 4. Split
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# 5. Define and train the KNN model
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 3)

# 適配(訓練),迴歸/分類/降維...皆用 fit(x_train, y_train)
clf.fit(X_train, y_train)

# algorithm.score: 使用 test 資料 input,並根據結果評分
print(f'score={clf.score(X_test, y_test)}')
>> score=0.9385964912280702

# 驗證答案
print(' '.join(y_test.astype(str)))
print(' '.join(clf.predict(X_test).astype(str)))

# 查看預測的機率
print(clf.predict_proba(X_test))

以下範例 3. 波士頓房價 (regression):

import pandas as pd
import numpy as np
from sklearn import datasets

# 1. Dataset
ds = datasets.load_boston()
X =pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target

# 2. Data clean
print(X.isna().sum())

# 3. Feature Engineering

# 4. Split
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
>> (404, 13) (404,)

# 5. Define and train the LinearRegression model
from sklearn.linear_model import LinearRegression

clf = LinearRegression()

# 適配(訓練),迴歸/分類/降維...皆用 fit(x_train, y_train)
clf.fit(X_train, y_train)

# algorithm.score: 使用 test 資料 input,並根據結果評分
print(f'score={clf.score(X_test, y_test)}')
>> score=0.6008214413101689
# 迴歸分數:計算 1-(SSE/SST)

# 驗證答案
print(list(y_test))
b = [float(f'{i:.2}') for i in clf.predict(X_test)]
print(b)

Homework 1. 小費 (regression):

'tips.csv' 資料如下:
https://ithelp.ithome.com.tw/upload/images/20210621/20138527HWDjRZ7hzZ.png


尚未有邦友留言

立即登入留言