Day13 Converting Continuous Variables into Discrete Values 連續型變數離散化

第 11 屆 iThome 鐵人賽

DAY 13

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 13 篇

11th鐵人賽離散化 discrete continuos features 連續型特徵

kyt

2019-09-14 06:44:34

6303 瀏覽

分享至

為什麼要把連續型變數離散化

What are the reasons of converting continuous variables into discrete values?

離散化是將多個連續型數值分箱成較少組別，進行離散化的主要原因有以下幾點：

簡化模型 - 分箱降低了變數數量、可以加快疊代速度且較方便儲存。
增加魯棒性 - 意即減低了極端值與異常值造成分析整體資料的影響程度。ex: 若將年齡特徵50歲以上年齡定義為1，其餘為0，一筆年齡為200的異常值產生的干擾較以原資料直接進行分析小。
配合模型需要 - 像決策樹、貝葉斯等模型需要使用離散特徵。
引入非線性 - 讓模型具有較佳表達能力。

We convert continuous variables into discrete values by binning them into groups to:

Simplify the model - binning decrease the amount of variables. This could reduce the epoch time and storage space.
Enhance robustness - decrease the effects caused by extreme values and outliers. ex: binning age above 50 as 1, otherwise 0. The interference of an outlier value 200 would be smaller.
Fit models - some models like decision trees and Bayes classifiers need the variable to be discrete.
Increasing non-linearity - non-linear methods offered better overall feature selection performance than linear methods in all usage conditions.

主要的方法 Main ways to convert:

等寬劃分 - 依相同寬度將資料分組，每份的間距相等。ex: 每10歲分一組。
等頻劃分 - 將資料均勻分成幾等份，每份的觀察點數相同。ex: 分為10組。
聚類劃分 - 使用聚類演算法將資料聚類劃分。
Binning by same deviation.
Binning by numbers of data in a bin.
Cluster data then bin.

範例 Example

'('表示不包含、']'表示包含。
'(' are included, ']' are not included.

# 載入套件 import packages
import pandas as pd

# 創建一些資料 create some data
ages = pd.DataFrame({"age": [18, 22, 25, 27, 7, 21, 23, 37, 30, 61, 45, 13, 11, 5, 2, 41, 9, 18, 80, 100]})

等寬劃分 Binning by same deviation

# 新增欄位對年齡做等寬劃分 create new column with the same width of age
ages["equal_width"] = pd.cut(ages["age"], 5)
print(ages["equal_width"])

# 觀察等寬劃分下各出現次數 count the amount of each bin
ages["equal_width"].value_counts() # 每個bin的範圍大小是一樣的 the range of each bin is the same

等頻劃分 Binning by numbers of data in a bin

# 新增欄位做等頻劃分 create new column with the same amount of data in each bin
ages["equal_freq"] = pd.qcut(ages["age"], 5)
print(ages["equal_freq"])

# 觀察等頻劃分下各組距各出現幾次 count 
ages["equal_freq"].value_counts() # 每個bin的資料筆數是一樣的 each bin contains same amount of data

新增一個欄位分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 五組

Create specified 5 bins

ages["exact_bins"] = pd.cut(ages["age"], bins=(0,10,20,30,50,100)) # 具體指定bin的劃分 specify bins
print(ages["exact_bins"])

# 具體指定bins的劃分 count amount in each specified bin
ages["exact_bins"].value_counts().sort_index() # 指定的bins gouped by specified bins

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] 第二屆機器學習百日馬拉松內容

[2] Continuous or discrete Variable

[3] 连续特征的离散化

[4] 特征离散化

Day12 Data Visualization Tools: Seaborn 視覺化資料工具：Seaborn

Day14 Feature Engineering, Kurtosis and Skewness 淺談特徵工程、峰度與偏度

系列文

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作共 30 篇

RSS系列文訂閱系列文

25 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 13 篇