Python 演算法 Day 11 - Feature Extraction

自學筆記

s790502ss 2021-12-01 20:08:02 ‧ 5321 瀏覽

分享至

Chap.II Machine Learning 機器學習

https://yourfreetemplates.com/free-machine-learning-diagram/

Part 2. Feature Engineering 特徵工程

2-3. Feature Extraction 特徵萃取

同特徵選擇，特徵萃取一樣有幾種方式。以下介紹 PCA、LDA & Kernal PCA

A. Principal Component Analysis (PCA) 主成分分析

為非監督式學習（不需要有 Y），可線性分離。
將原數據轉換，投影到較低維度的特徵空間，使該座標軸保留最多資訊為前提的壓縮方式。
因為其非監督式學習的特性，對於預測資料來說泛用度較高，是大企業常用的工具。

*下圖為例，將 x1+ x2 轉換為 PC1 + PC2。

以下使用 wine 作為範例：

# 1. Datasets
import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

# 定義 y
df_wine['Class label'].value_counts()

>>  2    71
    1    59
    3    48
    Name: Class label, dtype: int64

# 2. Clean Data
# 不需要

# 3. Split Data
# 拆 30% 出來作測試資料
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)

4-1. Standardlization
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

# 有 13 個特徵
X_train_std.shape, X_test_std.shape

>>  ((124, 13), (54, 13))

Sklearn 內建 PCA

(自行開發 PCA 見補充 1.)

from sklearn.decomposition import PCA

pca1 = PCA()
X_train_pca = pca1.fit_transform(X_train_std)
eigen_vals = pca1.explained_variance_ratio_
eigen_vals

>>  array([0.36951469, 0.18434927, 0.11815159, 0.07334252, 0.06422108,
           0.05051724, 0.03954654, 0.02643918, 0.02389319, 0.01629614,
           0.01380021, 0.01172226, 0.00820609])

使用權重 w = 2 項

pca2 = PCA(n_components=2)
X_train_pca = pca2.fit_transform(X_train_std)
X_test_pca = pca2.transform(X_test_std)

X_train_pca.shape, X_test_pca.shape
>>  ((124, 2), (54, 2))

作圖

plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1])
plt.xlabel('PC 1')
plt.ylabel('PC 2')

plt.show()

也可以指定若要達到 90% 覆蓋率，要取多少項

pca5 = PCA(0.9)
X_train_pca = pca5.fit_transform(X_train_std)
pca5.explained_variance_ratio_

PS. 只是這樣就沒辦法畫出圖片了~畢竟維度提升到了 5 維。

B. Linear Discriminant Analysis (LDA)

是一種監督式學習（需要有 Y），也可線性分離。
因為為監督式，故在求取「類別內散布矩陣（Sw）」望小，「類別外散布矩陣（Sb）」則望大。
類似 PCA，也有降冪排序。

圖中藍圈內部為 Sw，藍圈與黃圈互為 Sb

我們沿用上面 wine 的前處理，直接從使用 LDA 開始。

Sklearn 內建 LDA

(自行開發 LDA 見補充 2.)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train_std, y_train)

接著選用一個演算法來評分：

# LDA 畫 train data 圖
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train_lda, y_train)

plot_decision_regions(X_train_lda, y_train, classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower left')
plt.tight_layout()

plt.show()

效果顯著的比 PCA 還好~（當然是因為 LDA 多考慮了 Y）

C. Kernal PCA (KPCA)

以下使用 make_circles 作為 datasets：

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)

plt.scatter(X[y == 0, 0], X[y == 0, 1], color='red', marker='^', alpha=0.5)
plt.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', marker='o', alpha=0.5)
plt.tight_layout()

plt.show()

Part I. 使用原本 PCA 來轉換

from sklearn.decomposition import PCA

scikit_pca = PCA(n_components=2)
X_spca = scikit_pca.fit_transform(X)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))

ax[0].scatter(X_spca[y == 0, 0], X_spca[y == 0, 1],
              color='red', marker='^', alpha=0.5)
ax[0].scatter(X_spca[y == 1, 0], X_spca[y == 1, 1],
              color='blue', marker='o', alpha=0.5)

ax[1].scatter(X_spca[y == 0, 0], np.zeros((500, 1)) + 0.02,
              color='red', marker='^', alpha=0.5)
ax[1].scatter(X_spca[y == 1, 0], np.zeros((500, 1)) - 0.02,
              color='blue', marker='o', alpha=0.5)

ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[1].set_ylim([-1, 1])
ax[1].set_yticks([])
ax[1].set_xlabel('PC1')

plt.tight_layout()
plt.savefig('Pic/KernalPCA (numpy) Ans02-1.png', dpi=300)

plt.show()

Part II. Sklearn 內建 KPCA

(自行開發 KPCA 見補充 3.)

額外提下，gamma 值表示距離決策邊界越近的點的權重。
gamma 越大，決策邊界附近的點會大幅度影響決策邊界的形狀，可能導致 overfitting。
PS. 此處有 KPCA gamma 值的影片。

from sklearn.decomposition import KernelPCA

# kernel: 'linear'=線性，即 PCA。'poly'=多項式。'rbf'=以半徑當基準。'sigmoid'=邏輯式迴歸。
# n_components: 降維至多少維度。
# gamma: Kernel coefficient for linear, poly, rbf or sigmoid。
clf = KernelPCA(kernel='rbf', n_components=2, gamma=15)
X_kpca2 = clf.fit_transform(X)

fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(7,3))
ax[0].scatter(X_kpca2[y == 0, 0], X_kpca2[y == 0, 1],
              color='red', marker='^', alpha=0.5)
ax[0].scatter(X_kpca2[y == 1, 0], X_kpca2[y == 1, 1],
              color='blue', marker='o', alpha=0.5)

ax[1].scatter(X_kpca2[y == 0, 0], np.zeros((500, 1)) + 0.02,
              color='red', marker='^', alpha=0.5)
ax[1].scatter(X_kpca2[y == 1, 0], np.zeros((500, 1)) - 0.02,
              color='blue', marker='o', alpha=0.5)

ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[1].set_ylim([-1, 1])
ax[1].set_yticks([])
ax[1].set_xlabel('PC1')

plt.tight_layout()

plt.show()

結論：

在 3 種資料萃取的方法中，PCA 最泛用，可以處理大部分的資料轉換。
LDA 則是 PCA 強化版，但條件嚴苛。
實作中常遇到沒收集到齊全的 Y，即使有，也可能沒多餘資源去把資料一筆一筆標記上。
最後，KPCA 則可以應對較為複雜的，非線性分離（多項式、球形、邏輯迴歸）的資料類型。
.
.
.
.
.
*補充1.：
自行開發 PCA

# 取 X_train_std 的 eigenvalues & eigenvector
import numpy as np
cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

# 把 eigenvalues 依照比例排序
tot = sum(eigen_vals)

import matplotlib.pyplot as plt
# 柱狀圖
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
plt.bar(range(1, 14), var_exp, alpha=0.5, align='center', label='individual explained variance')

# 階梯圖
cum_var_exp = np.cumsum(var_exp)
plt.step(range(1, 14), cum_var_exp, where='mid', label='cumulative explained variance')

plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.axhline(0.9, color='r', linestyle='--', linewidth=1)

plt.show()

# 把 eigenvalue & eigenvector 合成一個 list
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))]

# 由大至小排序 list
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

定義權重!!!

# 為 eigen_pairs 增加一個維度，並黏合成為一個 13x2 的矩陣
w2 = np.hstack((eigen_pairs[0][1][:, np.newaxis],
                eigen_pairs[1][1][:, np.newaxis]))
                
print('Matrix W:\n', w2)

>>  Matrix W:
     [[-0.13724218  0.50303478]
     [ 0.24724326  0.16487119]
     [-0.02545159  0.24456476]
     [ 0.20694508 -0.11352904]
     [-0.15436582  0.28974518]
     [-0.39376952  0.05080104]
     [-0.41735106 -0.02287338]
     [ 0.30572896  0.09048885]
     [-0.30668347  0.00835233]
     [ 0.07554066  0.54977581]
     [-0.32613263 -0.20716433]
     [-0.36861022 -0.24902536]
     [-0.29669651  0.38022942]]

轉換特徵值，由 13 to 2

X_train_pca = X_train_std.dot(w2)
X_train_std.shape, X_train_pca.shape

>>  ((124, 13), (124, 2))

作圖，看看能否以 2 項權重分辨出 3 種酒類。

colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    # 把 X_train_pca 的第一項作為 x 軸，第二項作為 y 軸作圖
    x = X_train_pca[y_train == l, 0]
    y = X_train_pca[y_train == l, 1]
    print(x.shape, y.shape)
    
    plt.scatter(x, y, c=c, label=l, marker=m)

plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()

plt.show()

*補充 2.：
自行開發 LDA

np.set_printoptions(precision=4)

mean_vecs = []
for label in range(1, 4):
    mean_vecs.append(np.mean(X_train_std[y_train == label], axis=0))
    print('MV %s: %s\n' % (label, mean_vecs[label - 1]))
    
d = 13 # 原特徵數
S_W = np.zeros((d, d))
for label, mv in zip(range(1, 4), mean_vecs):
    class_scatter = np.zeros((d, d))  # scatter matrix for each class
    for row in X_train_std[y_train == label]:
        row, mv = row.reshape(d, 1), mv.reshape(d, 1)  # make column vectors
        class_scatter += (row - mv).dot((row - mv).T)
    S_W += class_scatter                          # sum class scatter matrices

print('Within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))
>>  Within-class scatter matrix: 13x13

d = 13  # 原特徵數
S_W = np.zeros((d, d))
for label, mv in zip(range(1, 4), mean_vecs):
    class_scatter = np.cov(X_train_std[y_train == label].T)
    S_W += class_scatter
print('Scaled within-class scatter matrix: %sx%s' % (S_W.shape[0],
                                                     S_W.shape[1]))

>>  Scaled within-class scatter matrix: 13x13

mean_overall = np.mean(X_train_std, axis=0)
d = 13  # 原特徵數
S_B = np.zeros((d, d))
for i, mean_vec in enumerate(mean_vecs):
    n = X_train[y_train == i + 1, :].shape[0]
    mean_vec = mean_vec.reshape(d, 1)  # make column vector
    mean_overall = mean_overall.reshape(d, 1)  # make column vector
    S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T)

print('Between-class scatter matrix: %sx%s' % (S_B.shape[0], S_B.shape[1]))

>>  Between-class scatter matrix: 13x13

# 為新特徵子空間選擇線性判別式
eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

# (eigenvalue, eigenvector) tuples
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i])
               for i in range(len(eigen_vals))]

# Sort (eigenvalue, eigenvector)
eigen_pairs = sorted(eigen_pairs, key=lambda k: k[0], reverse=True)

# 通過減少特徵值直觀地確認列表是否正確排序
print('Eigenvalues in descending order:\n')
for eigen_val in eigen_pairs:
    print(eigen_val[0])
    
>>  Eigenvalues in descending order:
    349.617808905994
    172.76152218979385
    3.478228588635107e-14
    2.842170943040401e-14
    2.0792193804944213e-14
    2.0792193804944213e-14
    1.460811844224635e-14
    1.460811844224635e-14
    1.4555923097122117e-14
    7.813418013637288e-15
    7.813418013637288e-15
    6.314269790397111e-15
    6.314269790397111e-15

# 做出權重
w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,
              eigen_pairs[1][1][:, np.newaxis].real))
print('Matrix W:\n', w)
>> Matrix W:
     [[-0.1481 -0.4092]
     [ 0.0908 -0.1577]
     [-0.0168 -0.3537]
     [ 0.1484  0.3223]
     [-0.0163 -0.0817]
     [ 0.1913  0.0842]
     [-0.7338  0.2823]
     [-0.075  -0.0102]
     [ 0.0018  0.0907]
     [ 0.294  -0.2152]
     [-0.0328  0.2747]
     [-0.3547 -0.0124]
     [-0.3915 -0.5958]]

# 作圖
X_train_lda = X_train_std.dot(w)
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_lda[y_train == l, 0],
                X_train_lda[y_train == l, 1] * (-1),
                c=c, label=l, marker=m)

plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower right')
plt.tight_layout()

plt.show()

*補充 3.：
自行開發 KPCA

from scipy.spatial.distance import pdist, squareform
from scipy import exp
from scipy.linalg import eigh
import numpy as np

def rbf_kernel_pca(X, gamma, n_components):
    """
    RBF kernel PCA implementation.

    Parameters
    ------------
    X: {NumPy ndarray}, shape = [n_samples, n_features]
        
    gamma: float
      Tuning parameter of the RBF kernel
        
    n_components: int
      Number of principal components to return

    Returns
    ------------
     X_pc: {NumPy ndarray}, shape = [n_samples, k_features]
       Projected dataset

    """
    # 計算 MxN 維數據集中的 pairwise squared Euclidean distances
    sq_dists = pdist(X, 'sqeuclidean')

    # 將 pairwise distances 轉換為 square matrix
    mat_sq_dists = squareform(sq_dists)

    # 計算 symmetric kernel matrix
    K = exp(-gamma * mat_sq_dists)

    # 將 kernel matrix 居中。
    N = K.shape[0]
    one_n = np.ones((N, N)) / N
    K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)

    # 從 centered kernel matrix 獲得 obtaining eigenpairs
    # scipy.linalg.eigh: 按升序返還
    eigvals, eigvecs = eigh(K)
    eigvals, eigvecs = eigvals[::-1], eigvecs[:, ::-1]

    # 收集前 k 個 eigenvectors (projected samples)
    X_pc = np.column_stack((eigvecs[:, i]
                            for i in range(n_components)))

    return X_pc

X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))
ax[0].scatter(X_kpca[y == 0, 0], X_kpca[y == 0, 1],
              color='red', marker='^', alpha=0.5)
ax[0].scatter(X_kpca[y == 1, 0], X_kpca[y == 1, 1],
              color='blue', marker='o', alpha=0.5)

ax[1].scatter(X_kpca[y == 0, 0], np.zeros((500, 1)) + 0.02,
              color='red', marker='^', alpha=0.5)
ax[1].scatter(X_kpca[y == 1, 0], np.zeros((500, 1)) - 0.02,
              color='blue', marker='o', alpha=0.5)

ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[1].set_ylim([-1, 1])
ax[1].set_yticks([])
ax[1].set_xlabel('PC1')

plt.tight_layout()

plt.show()

.
.
.
.
.

Homework Answer：

試著用 sklearn 的資料集 breast_cancer，操作 Featuring Selection (by RandomForest)。

# Datasets
from sklearn.datasets import load_breast_cancer
import pandas as pd

ds = load_breast_cancer()

df_X = pd.DataFrame(ds.data, columns=ds.feature_names)
df_y = pd.DataFrame(ds.target, columns=['Cancer or Not'])

df_X.head()

# define y
df_y['Cancer or Not'].unique()

>>  array([0, 1])

# Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3)
X_train.shape, X_test.shape

>>  ((398, 30), (171, 30))

# 隨機森林演算法
from sklearn.ensemble import RandomForestClassifier
import numpy as np

rfc = RandomForestClassifier(n_estimators=500, random_state=1)
rfc.fit(X_train, y_train)

# 把每一個變數特徵的重要性列出，從大排到小
ipt = rfc.feature_importances_
ipt_sort = np.argsort(ipt)[::-1]

for i in range(X_train.shape[1]):
    print(f'{i+1:>2d}) {ds.feature_names[ipt_sort[f]]:<30s} {ipt[ipt_sort[i]]:.4f}')
    
>>   1) worst perimeter                0.1442
     2) worst radius                   0.1199
    ...
    ..
    .
    30) concave points error           0.0035

# 只取兩項 feature
X_train_2 = X_train[['worst radius', 'worst perimeter']]
X_test_2 = X_test[['worst radius', 'worst perimeter']]

X_train_2.shape, X_test_2.shape

>>  ((398, 2), (171, 2))

# Modeling (by LogisticRegression)
from sklearn.linear_model import LogisticRegression as lr

clf = lr(solver='liblinear')
clf.fit(X_train_2, y_train)

print(clf.score(X_test_2, y_test))

>>  0.9064327485380117

.
.
.
.
.

Homework：

試著用 sklearn 的資料集 breast_cancer，操作 Featuring Extraction (by PCA)。

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙