Day 1. Feature Engineering(特徵工程) - 1. Missing Data Imputation(遺失資料插補) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 1

AI & Data

Machine Learning系列第 1 篇

Day 1. Feature Engineering(特徵工程) - 1. Missing Data Imputation(遺失資料插補)

12th鐵人賽

tjabi

2020-09-01 23:45:37

7119 瀏覽

分享至

Feature Engineering是開始機機器學習(Machine Learning)分析前必須進行的一項重要工作。
Feature engineering是運用數學、統計和資料的領域知識(domain knowledge)，轉換現有的features 或從現有的features建立新的變數(variables)，讓原始資料能機器學習模型讀取。

選擇正確的格式features輸入機器學習模型，可以獲得更好的結果，還有下列優點:

讓我們可以使用較簡單的模型，獲得較好的結果
使用較簡單的模型增加模型的透明度因此讓我們較容易了解模型如何進行預測
降低使用Ensemble Learning的需求
降低執行Hyperparameters Optimization的需求

主要的Feature Engineering技術包括：

Missing Data Imputation(遺失資料插補)
Categorical Encoding
Variable Transformation
Discretisation
Outlier Engineering
Feature Scaling
Date and Time Engineering
Feature Creation
Aggregating Transaction Data

1. Missing Data Imputation(遺失資料插補)
真實世界的資料通常都有遺失的數值(missing values)，所以必須使用資料插補產生一個能給機器學習模型使用的完整的資料集(dataset)

兩種形態的Imputation:
Numerical Imputation
Categorical Imputation

missing data imputation(遺失資料插補)的技術：

Complete Case Analysis
Mean / Median / Mode Imputation
Random Sample Imputation
Replacement by Arbitrary Value
End of Distribution Imputation

1.1 Complete Case Analysis
只分析含有完整values的變數也就是移除所有含有遺失資料的資料列。當遺失的資訊量很小時，這是一個可接受的方法。

df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df= df[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']]
df.isnull().mean()

LotFrontage	0.177397
MasVnrArea	0.005479
GarageYrBlt	0.055479
dtype: float64

用dropna()移除所有含有遺失資料的資料列, 設定axis=0

df_cc = df.dropna(axis=0)

1.2 Mean / Median / Mode Imputation
使用features的平均數(mean)、中位數(median)或眾數(mode)來插補遺失資料。這個方法適用在資料遺失是隨機( random)且整體而言遺失量很小，假如遺失量很大，這個方法會破壞這變數的分布，以及和其他變數的關係。破壞變數的分布會影響線性模型(linear models)的結果。
假如變數是高斯分布(常態分布)，用平均數(mean)來插補；假如變數是偏態分布，用來插補中位數(median)。

median = df['LotFrontage'].median()
df.loc[:, 'LotFrontage_median'] = df['LotFrontage'].fillna(median)

對於categorical variables而言，以眾數(mode)來插補遺失資料也被視為是以最頻繁的類別插補遺失資料。

1.3 Random Sample imputation
從變數中隨機選擇數值(values)來插補遺失資料。這個方法保留變數的分布，適用在資料遺失是隨機( random)的。

df = pd.read_csv('../input/titanic/train.csv',
                 usecols = ['Age', 'Fare', 'Survived'])
                 
df['Age_random'] = df['Age']
random_sample = df['Age'].dropna().sample(df['Age'].isnull().sum(), random_state=0)
random_sample.index = df[df['Age'].isnull()].index
df.loc[df['Age'].isnull(), 'Age_random'] = random_sample

1.4 Replacement by Arbitrary Value
使用任意值來插補遺失資料，且是用同一任意值來插補同一變數的所有遺失資料。這個方法適用在資料遺失不是隨機( random)且整體而言遺失量很大。假如這個變數的所有數值(values)都是正數典型的方式是插補上-1，或者是999或 -999，也就是不常出現在這變數中的數值。這個方法不適用在線性模型(linear models)，這個方法會破壞這變數的分布，進而不合模型的假設(model assumptions)。
對於categorical variables而言，這個方法等於是將遺失資料插補上“遺失(Missing)”的標籤。

df['Age_99'] = df['Age'].fillna(99)

1.5 End of Distribution Imputation
使用變數分布最尾端的值來插補遺失資料，機基本上類似以任意值來插補遺失資料。這個方法適用tree-based演算法，但是他會影響線性模型(linear models)的結果，因為他會破壞變數的分布會。。

df['Age_imputed'] = df['Age'].fillna(df.Age.mean()+3*df.Age.std())

Day-2 Feature Engineering - 2. Categorical Encoding(1)

系列文

Machine Learning 共 32 篇

RSS系列文訂閱系列文

23 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22201 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Machine Learning系列 第 1 篇

Day 1. Feature Engineering(特徵工程) - 1. Missing Data Imputation(遺失資料插補)

尚未有邦友留言

標記使用者

Machine Learning系列第 1 篇