iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 13
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 13

Day13 Converting Continuous Variables into Discrete Values 連續型變數離散化

為什麼要把連續型變數離散化

What are the reasons of converting continuous variables into discrete values?

離散化是將多個連續型數值分箱成較少組別,進行離散化的主要原因有以下幾點:

  1. 簡化模型 - 分箱降低了變數數量、可以加快疊代速度且較方便儲存。
  2. 增加魯棒性 - 意即減低了極端值與異常值造成分析整體資料的影響程度。ex: 若將年齡特徵50歲以上年齡定義為1,其餘為0,一筆年齡為200的異常值產生的干擾較以原資料直接進行分析小。
  3. 配合模型需要 - 像決策樹、貝葉斯等模型需要使用離散特徵。
  4. 引入非線性 - 讓模型具有較佳表達能力。

We convert continuous variables into discrete values by binning them into groups to:

  1. Simplify the model - binning decrease the amount of variables. This could reduce the epoch time and storage space.
  2. Enhance robustness - decrease the effects caused by extreme values and outliers. ex: binning age above 50 as 1, otherwise 0. The interference of an outlier value 200 would be smaller.
  3. Fit models - some models like decision trees and Bayes classifiers need the variable to be discrete.
  4. Increasing non-linearity - non-linear methods offered better overall feature selection performance than linear methods in all usage conditions.

主要的方法 Main ways to convert:

  1. 等寬劃分 - 依相同寬度將資料分組,每份的間距相等。ex: 每10歲分一組。

  2. 等頻劃分 - 將資料均勻分成幾等份,每份的觀察點數相同。ex: 分為10組。

  3. 聚類劃分 - 使用聚類演算法將資料聚類劃分。

  4. Binning by same deviation.

  5. Binning by numbers of data in a bin.

  6. Cluster data then bin.

範例 Example

'('表示不包含、']'表示包含。
'(' are included, ']' are not included.

# 載入套件 import packages
import pandas as pd

# 創建一些資料 create some data
ages = pd.DataFrame({"age": [18, 22, 25, 27, 7, 21, 23, 37, 30, 61, 45, 13, 11, 5, 2, 41, 9, 18, 80, 100]})

等寬劃分 Binning by same deviation

# 新增欄位對年齡做等寬劃分 create new column with the same width of age
ages["equal_width"] = pd.cut(ages["age"], 5)
print(ages["equal_width"])

https://ithelp.ithome.com.tw/upload/images/20190914/20119709jEEwFBmW5f.jpg

# 觀察等寬劃分下各出現次數 count the amount of each bin
ages["equal_width"].value_counts() # 每個bin的範圍大小是一樣的 the range of each bin is the same

https://ithelp.ithome.com.tw/upload/images/20190914/20119709WqHedp75YX.jpg

等頻劃分 Binning by numbers of data in a bin

# 新增欄位做等頻劃分 create new column with the same amount of data in each bin
ages["equal_freq"] = pd.qcut(ages["age"], 5)
print(ages["equal_freq"])

https://ithelp.ithome.com.tw/upload/images/20190914/20119709iipaiSIH4G.jpg

# 觀察等頻劃分下各組距各出現幾次 count 
ages["equal_freq"].value_counts() # 每個bin的資料筆數是一樣的 each bin contains same amount of data

https://ithelp.ithome.com.tw/upload/images/20190914/20119709obZj3uLkRT.jpg

新增一個欄位分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 五組

Create specified 5 bins

ages["exact_bins"] = pd.cut(ages["age"], bins=(0,10,20,30,50,100)) # 具體指定bin的劃分 specify bins
print(ages["exact_bins"])

https://ithelp.ithome.com.tw/upload/images/20190914/20119709Y1QrJqJASt.jpg

# 具體指定bins的劃分 count amount in each specified bin
ages["exact_bins"].value_counts().sort_index() # 指定的bins gouped by specified bins

https://ithelp.ithome.com.tw/upload/images/20190914/201197098nURpLcyEk.jpg

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] 第二屆機器學習百日馬拉松內容

[2] Continuous or discrete Variable

[3] 连续特征的离散化

[4] 特征离散化


上一篇
Day12 Data Visualization Tools: Seaborn 視覺化資料工具:Seaborn
下一篇
Day14 Feature Engineering, Kurtosis and Skewness 淺談特徵工程、峰度與偏度
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言