Day11 - Feature Engineering -- 4. 分隔方法(Discretization) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 11

AI & Data

Machine Learning系列第 11 篇

Day11 - Feature Engineering -- 4. 分隔方法(Discretization)

12th鐵人賽

tjabi

2020-09-11 22:40:33

4441 瀏覽

分享至

4. Discretization(分隔方法 or 離散化)

4.1 Equal width discretisation
4.2 Equal Frequency discretisation
4.3 Discretisation using decision trees

將變數下的資料值(可以是ordinal categorical variable or numeric variable)排序並放入所屬區間(intervals, bins or buckets)，這個過程也稱為分箱(binning)。
分隔變數(discretize variables)可用的方法如下：

4.1 Equal width discretisation

這個方法將資料值放進N個寬度相同的區間，變數下資料的範圍和區間的數目決定區間的寬度。

寬度(width) = (最大值max value - 最小值min value) / N

雖然沒有嚴格的規定如何決定N的數目，但基本上以不超過10個為原則。還要注意一點，如果原始資料的分布是偏態(skewed)分布，這個方法不會改善資料的分布狀況。

我們可以使用pandas或scikit-learn 來對資料做分隔。

以 Kaggle 的 Titanic 資料集中的"年齡"變數來說明：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pylab
import scipy.stats as stats
from sklearn.model_selection import train_test_split
# for discretization
from sklearn.preprocessing import KBinsDiscretizer

data = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Fare','Survived'])
data.head

Rec-no.|Survived | Age | Fare
------------- | -------------
0| 0 | 22.0 | 7.2500|
1 | 1 | 38.0| 71.2833|
2 | 1| 26.0 | 7.9250|
3| 1| 35.0 | 53.1000|
4 | 0| 35.0 | 8.0500|
.. | ... | ... | ...|
886| 0 | 27.0| 13.0000|
887| 1 | 19.0| 30.0000|
888 | 0 | NaN | 23.4500|
889| 1 | 26.0 | 30.0000|
890 | 0| 32.0 | 7.7500|

# first fill the missing data of the variable age, with a random sample of the variable

def impute_na(data, variable):
    # function to fill na with a random sample
    df = data.copy()
   
    # random sampling
    df[variable+'_random'] = df[variable]
    
    # extract the random sample to fill the na
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    
    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample
    
    return df[variable+'_random']
    
data['Age'] = impute_na(data, 'Age')

將資料分成訓練和測試集

X_train, X_test, y_train, y_test = train_test_split(data[['Age', 'Fare','Survived']],                                              data.Survived, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((623, 3), (268, 3))

(1)使用pandas
找出資料範圍並將其切成10個相同寬度的區間。

age_range = X_train['Age'].max() - X_train['Age'].min()
print(age_range)

# divide the range into 10 equal-width bins
print(age_range / 8)

79.58
9.9475

min_value = int(np.floor( X_train['Age'].min()))
max_value = int(np.ceil( X_train['Age'].max()))

# let's round the bin width
inter_width = int(np.round(age_range/10))

min_value, max_value, inter_width

(0, 80, 8)

找出每個區間的界線值

intervals = [i for i in range(min_value, max_value+inter_width, inter_width)]

intervals

[0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80]

把每筆資料的區間範圍寫入age_disc欄位。

# discretise Age
X_train['age_disc'] = pd.cut(x=X_train['Age'],
                               bins=intervals,
                               include_lowest=True)

print(X_train[['Age', 'age_disc']].head(10))

/| Age| age_disc
------------- | -------------
857| 51.0 | (48.0, 56.0]
52 | 49.0 | (48.0, 56.0]
386| 1.0 | (-0.001, 8.0]
124| 54.0| (48.0, 56.0]
578| 19.0 | (16.0, 24.0]
549| 8.0 | (-0.001, 8.0]
118| 24.0| (16.0, 24.0]
12 | 20.0 | (16.0, 24.0]
157| 30.0| (24.0, 32.0]
127| 24.0| (16.0, 24.0]

查看每個區間的資料數目

# check the number of observations per bin

X_train['age_disc'].value_counts()

(16.0, 24.0]	146
(24.0, 32.0]	145
(32.0, 40.0]	116
(40.0, 48.0]	62
(-0.001, 8.0]	52
(48.0, 56.0]	34
(8.0, 16.0]	34
(56.0, 64.0]	24
(64.0, 72.0]	8
(72.0, 80.0]	2
Name: age_disc, dtype: int64

繪製每個區間資料數量的成長條圖

# plot the number of observations per bin

X_train.groupby('age_disc')['Age'].count().plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

對測試資料做區隔

# discretise the variables in the test set
X_test['age_disc'] = pd.cut(x=X_test['Age'],
                              bins=intervals,
                              include_lowest=True)
X_test.head()

比較訓練集和測試集的區間資料分布情形。

# determine proportion of observations in each bin
t1 = X_train['age_disc'].value_counts() / len(X_train)
t2 = X_test['age_disc'].value_counts() / len(X_test)

# concatenate aggregated views
tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']

# plot
tmp.plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

原始資料中年齡和存活的比較圖。

fig = plt.figure()
fig = X_train.groupby(['Age'])['Survived'].mean().plot()
fig.set_title('Normal relationship between Age and Survived')
fig.set_ylabel('Survived')

做分隔後的資料中年齡和存活的比較圖。

fig = plt.figure()
fig = X_train.groupby(['age_disc'])['Survived'].mean().plot(figsize=(12,6))
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('Survived')

(2)使用scikit-learn

disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
disc.fit(X_train[['Age']])

KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='uniform')
各區間的界線值儲存在disc.bin_edges 中

disc.bin_edges_

array([array([ 0.42 , 8.378, 16.336, 24.294, 32.252, 40.21 , 48.168, 56.126,
64.084, 72.042, 80. ])], dtype=object)

train_t = disc.transform(X_train[['Age']])
train_t = pd.DataFrame(train_t, columns = ['Age'])
test_t = disc.transform(X_test[['Age']])
test_t = pd.DataFrame(test_t, columns = ['Age'])
train_t.head()

/	Age
0
1	6.0
2	0.0
3	6.0
4	2.0

比較訓練集和測試集的區間資料分布情形。

t1 = train_t.groupby(['Age'])['Age'].count() / len(train_t)
t2 = test_t.groupby(['Age'])['Age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')