Day15 Numerical Data 1/2 replace N/A or outlier 數值型特徵 1/2 填補N/A與離群值

第 11 屆 iThome 鐵人賽

DAY 15

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 15 篇

11th鐵人賽數值型特徵 numerical data outlier 離群值

kyt

2019-09-16 10:46:11

890 瀏覽

分享至

在Day04的文章中介紹了幾種常見可供替補N/A或離群值的數值，本日文章來實際操做，以Kaggle競賽Titanic: Machine Learning from Disaster作為使用的資料集演示。

In the Day04 article we talked about several values that could be used to fill N/As and Outliers. Today, we are going to show how to actually replace missing and extreme data with those values using the data downloaded from Titanic: Machine Learning from Disaster.

import pandas as pd
import numpy as np
import copy

df = pd.read_csv('data/train.csv') # 讀取檔案 read in the file
df.head() # 顯示前五筆資料 show the first five rows

# 只取int64, float64兩種數值型欄位存到 num_features中 
# save the columns that only contains int64, float64 datatypes into num_features
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}')

# 去掉文字型欄位，只留數值型欄位 only keep the numeric columns
df = df[num_features]
df.head()

# 檢查欄位缺值數量 check N/As
df.isnull().sum().sort_values(ascending=False)

以平均值填補空值

df_mn = df.fillna(df.mean())
df_mn['Age']

以中位數填補空值

df_md = df.fillna(df.median())
df_md['Age']

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] 第二屆機器學習百日馬拉松內容

[2] Titanic: Machine Learning from Disaster

Day14 Feature Engineering, Kurtosis and Skewness 淺談特徵工程、峰度與偏度

Day16 Numerical Data 2/2 reduce skewness 數值型特徵 2/2 去除偏態

系列文

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作共 30 篇

RSS系列文訂閱系列文

25 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 15 篇