DAY 13
0
AI & Data

## [改善資料品質]Part-3 正規化與標準化資料_Z-score normalization

z = (x - μ) / σ

• z是正規化後的數值
• x是正規化前的數值
• μ是該批資料的算術平均數
• σ是該批資料的標準差

``````#各欄位的資料類型
column_types={'PassengerId':'category',
'Survived':int,
'Pclass':int,
'Name':'category',
'Sex':'category',
'Age':float,
'SibSp':int,
'Parch':int,
'Fare':float,
'Cabin':'category',
'Embarked':'category'}
#訓練集
train_set = train_set.drop(['Cabin'], axis=1)
train_set = train_set.dropna()
train_set = pd.get_dummies(train_set)
``````

``````#所有Fare的算術平均數
mu = fare_data_sample.mean()
#標準差
std = fare_data_sample.std()
#標準化後之結果
z_score_normalized = (fare_data_sample - mu) / std

print(z_score_normalized)
``````

### 使用Scikit-learn API

``````#z-score 函式 - pandas version
def z_score_normalization(df, cols):
"""Normalize a dataframe with specified columns

Keyword arguments:
df -- the input dataframe (pandas.DataFrame)
cols -- the specified columns to be normalized (list)

"""
train_set_normalized = train_set.copy()
for col in cols:
all_col_data = train_set_normalized[col].copy()
print(all_col_data)
mu = all_col_data.mean()
std = all_col_data.std()

z_score_normalized = (all_col_data - mu) / std
train_set_normalized[col] = z_score_normalized
return train_set_normalized

normalized = pd.DataFrame(z_score_normalization(train_set,
train_set.keys()))
``````

``````#z-score 函式 - sklearn version
from sklearn.preprocessing import StandardScaler

scale = StandardScaler() #z-scaler物件
train_set_scaled = pd.DataFrame(scale.fit_transform(train_set),
columns=train_set.keys())
``````

pandas version 以及 sklearn version產生的結果是完全一樣的，差別只在於使用sklearn api可以輕鬆寫意的做完Z-score normalization。