在這邊我們會學到如何將資料做正規化(Normalization)及縮放(Scaling)
# modules we'll use
import pandas as pd
import numpy as np
# for Box-Xoc Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
# set seed for reproducibility
np.random.seed(0)
Scaling跟Normalization時常會讓人搞混
這兩個東西的主要差異在
現在我們來更深入的看這兩件事情
表示說我們將資料做一個特別的縮放,像是0-100或0-1
若我們想縮放資料(scale data),而且我們要使用的methods使根據點與點之間的距離,像是support vector machines(SVM)或是k-nearest neighbors(KNN)。
在這些algorithms,將1變換成任何數值都會是一樣重要的
舉例說,若我們有一筆資料中有Yen跟US Dollor,1US dollor的價值大約等於100Yen,但若我們沒有將資料做縮放的話,SVM或KNN會以為1US dollor跟1Yen依樣重要
縮放變數,可以協助比較在同一個立足點中不同的變數
## generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size=1000)
# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Scaled data')
fig, ax = plt.subplots(1,2)
是代表說在同一個row中畫兩個column的圖minmax_scaling(original_data, columns=[0])
將一筆資料縮放到0跟1之間從上圖我們可以看出,我們將資料從0-8,scale到0-1
Scaling只是換掉data的距離,Normalization則是比較激進的作法
標準化(Normalization)的重點是替換掉觀察的角色,讓資料被描述為常態分布(normal distribution)
常態分布(Normal Distribution)被稱作為鐘形曲線(bell curve),這是一種特別的統計分布(statistical distribution),觀察的結果會高於或低於平均值(mean),平均值(mean)和中位數(median)是一樣的,有較多的觀察者是靠近平均數。常態分佈(Normal Distribution)也被稱為高斯分布(Gaussian distribution)
通常來說,如果要使用假設數據是常態分佈(Normal Distribution)的機器學習(Machine Learning)或統計技術(statistics technique),則要將數據標準化
舉例來說,線性判別分析(linear discriminant analysis)(LDA)或是高斯貝式分類(Gaussian maiva Bayes)
Pro tip:任何方法有關於"高斯(Guassian)"通常都需要將資料標準化
轉換的方式我們稱作為Box-Cox Transformation
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Normalized data')