Day16 Numerical Data 2/2 reduce skewness 數值型特徵 2/2 去除偏態

第 11 屆 iThome 鐵人賽

DAY 16

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 16 篇

11th鐵人賽 missing valu outlier 離群值缺失值

kyt

2019-09-17 07:03:45

2416 瀏覽

分享至

在Day14的文章中我們討論到判讀資料的偏態，當資料中離群資料比例很高，或平均值沒有代表性時，便可考慮使用以下面幾種方式去除偏態：

對數去偏 - 使用自然對數去除偏態，先加1再取對數，還原時先取指數後再減1。對於可能出現等於零的資料使用。
方跟去偏 - 將資料減去最小值後開根號，最大值有限時適用。例如成績轉換。
分布去偏 - lmbda參數需介於0至0.5之間，注意轉換前資料不可小於等於0。(lmbda參數為0時等於log函數，為0.5時等於開根號)
去除偏態目的在於讓資料更接近常態分布，左右對稱、平均值更具有代表性。

In the Day14 article we talked about skewness. When the ratio of outlier is high or the mean cannot represent the data well, we could use the following methods to reduce skewness.

.log1p() - plus 1 and get log, then get exp and minus 1. Used when 0 might occure in the data.
.sqrt() - minus the smallest value and get root. Used when there's a maximum in the data, such as transforming score.
.boxcox() - lmbda should be between 0 and 0.5, and no data equals to 0 before transforming. lmbda=0 equals to .log1p() and lmbda=0.5 equals to .sqrt().
The purpose of reducing skewness is to make the data more normally distributed.

以Kaggle競賽Titanic: Machine Learning from Disaster作為使用的資料集演示。
We will use the data downloaded from Titanic: Machine Learning from Disaster for the example.

import pandas as pd
import numpy as np
import copy

df = pd.read_csv('data/train.csv') # 讀取檔案 read in the file
df.head() # 顯示前五筆資料 show the first five rows

# 只取int64, float64兩種數值型欄位存到 num_features中 
# save the columns that only contains int64, float64 datatypes into num_features
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}\n')

# 去掉文字型欄位，只留數值型欄位 only keep the numeric columns
df = df[num_features]
df = df.fillna(0)
df.head()

# 顯示Fare的分布圖 plot out the distribution of Fare
import seaborn as sns
import matplotlib.pyplot as plt
sns.distplot(df['Fare'])
plt.show()

# 將Fare取log1p，看分佈圖 plot out Fare after log1p
df_fixed = copy.deepcopy(df)

df_fixed['Fare'] = np.log1p(df_fixed['Fare'])
sns.distplot(df_fixed['Fare'])
plt.show()

# 取boxcox後看分佈圖 plot out Fare after boxcox
from scipy import stats
df_fixed = copy.deepcopy(df)

df_fixed['Fare'] = df_fixed['Fare'] +1 # 最小值接近-1，先加1做平移 minimum close to -1, add 1 first
df_fixed['Fare'] = stats.boxcox(df_fixed['Fare'], lmbda=0.3)

sns.distplot(df_fixed['Fare'])
plt.show()