Day 26 [Python ML、資料清理] 資料縮放以及標準化

2021 iThome 鐵人賽

DAY 26

AI & Data

使用python學習Machine Learning系列第 26 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-10-09 11:20:04

3551 瀏覽

分享至

在這邊我們會學到如何將資料做正規化(Normalization)及縮放(Scaling)

取得環境

# modules we'll use
import pandas as pd
import numpy as np

# for Box-Xoc Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(0)

縮放和標準化的差別

Scaling跟Normalization時常會讓人搞混

這兩個東西的主要差異在

Scaling - 變換data的範圍(range)
Normalization - 變換資料形狀的分布(shape of the distrubution)

現在我們來更深入的看這兩件事情

資料縮放

表示說我們將資料做一個特別的縮放，像是0-100或0-1

若我們想縮放資料(scale data)，而且我們要使用的methods使根據點與點之間的距離，像是support vector machines(SVM)或是k-nearest neighbors(KNN)。

在這些algorithms，將1變換成任何數值都會是一樣重要的

舉例說，若我們有一筆資料中有Yen跟US Dollor，1US dollor的價值大約等於100Yen，但若我們沒有將資料做縮放的話，SVM或KNN會以為1US dollor跟1Yen依樣重要

縮放變數，可以協助比較在同一個立足點中不同的變數

## generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size=1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Scaled data')

fig, ax = plt.subplots(1,2)是代表說在同一個row中畫兩個column的圖
minmax_scaling(original_data, columns=[0])將一筆資料縮放到0跟1之間

從上圖我們可以看出，我們將資料從0-8，scale到0-1

標準化

Scaling只是換掉data的距離，Normalization則是比較激進的作法

標準化(Normalization)的重點是替換掉觀察的角色，讓資料被描述為常態分布(normal distribution)

常態分布(Normal Distribution)被稱作為鐘形曲線(bell curve)，這是一種特別的統計分布(statistical distribution)，觀察的結果會高於或低於平均值(mean)，平均值(mean)和中位數(median)是一樣的，有較多的觀察者是靠近平均數。常態分佈(Normal Distribution)也被稱為高斯分布(Gaussian distribution)

通常來說，如果要使用假設數據是常態分佈(Normal Distribution)的機器學習(Machine Learning)或統計技術(statistics technique)，則要將數據標準化

舉例來說，線性判別分析(linear discriminant analysis)(LDA)或是高斯貝式分類(Gaussian maiva Bayes)

Pro tip:任何方法有關於"高斯(Guassian)"通常都需要將資料標準化

轉換的方式我們稱作為Box-Cox Transformation

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Normalized data')