[Day 11] 資料清理及補充資料連結 / Dataset cleaning and additional materials and links

第 11 屆 iThome 鐵人賽

DAY 11

AI & Data

跟top kaggler學習如何贏得資料分析競賽系列第 11 篇

11th鐵人賽 kaggle

madeleine

2019-09-12 23:18:31

1295 瀏覽

分享至

Dataset cleaning

Constant features
重複特徵 Duplicated features

Constant features (只有一個value的feature. e.g., when encode = 'onehot' and certain bins do not contain any data)

斷捨離無用的特徵, train set 跟 test set 的特徵值一模一樣,也只是在佔記憶體位置, 對 model 沒有幫助下, 最好方法就是移除它, 例如 f0 欄位(特徵)在train set 跟 test set 的值都一樣

截圖自coursera

traintest.nunique(axis=1) == 1

Duplicated features

沿用上圖, f2, f3 則是兩個一模一樣的特徵, 一樣的, 最好方法就是移除它

traintest.T.drop_duplicates()

沿用上圖, f4, f5 是重複的類別型 categorical features, 例如 f5 的 C變A, A變B, B變C, 那就變得跟 f4 一模一樣了. 但怎麼找出來, 可以用 label encoding. 做法是 encode 但是 f4 top down, f5 bottom up

encode	f4	f5
A	1	3
B	2	2
C	3	1

for f in categorical_feats:
        traintest[f] =raintest[f].factorize()
        
traintest.T.drop_duplicates()

other things to check

重複列 Duplicated rows
檢查資料集是否被洗了 check if dataset is shuffled
藍色是我們用來觀察的平均值, 可以從圖發現, 結尾大異於起頭, 這可能是重要的線索

截圖自coursera

Visualization tools

Seaborn https://seaborn.pydata.org/
Plotly https://plot.ly/python/
Bokeh https://github.com/bokeh/bokeh
ggplot http://ggplot.yhathq.com/
Graph visualization with NetworkX https://networkx.github.io/
Others

Biclustering algorithms for sorting corrplots http://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html

[Day 10] Visualizations / 視覺化

[Day 12] Validation / 驗證 - Part I

系列文

跟top kaggler學習如何贏得資料分析競賽共 30 篇

RSS系列文訂閱系列文

21 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

跟top kaggler學習如何贏得資料分析競賽 系列 第 11 篇