第 11 屆 iThome 鐵人賽

DAY 3

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 3 篇

Day03 Pandas DataFrame, Label Encoding and One Hot Encoding. Pandas基本資料類型、標籤編碼與獨熱編碼

11th鐵人賽 one-hot encoding label encoding pandas dataframe

kyt

2019-09-04 07:10:02

7834 瀏覽

分享至

編寫程式碼時，載入已經存在的套件能幫我們節省下許多時間，又載入套件時，常有許多約定成俗的縮寫方式，需要稍微注意一下。不僅是為了簡短程式碼，也為了日後能順利閱讀討論串鋪墊。舉例來說，今天介紹的Pandas套件(取名自套件主要提供的三種資料結構：Panel、DataFrame 與 Series)，通常會載入後縮寫成pd方便使用。(初次使用記得先到命令提示字元輸入pip install pandas進行安裝。)接著使用套件中DataFrame()功能將字典的資料結構轉換為資料框架(對於Dictionary的資料結構補充傳送門。)

When coding, we import packages to save our time. At the same time we import them, we normally abbreviate the name so we can use it easier. For example, the package we are going to import is Pandas, and is often abbreviated as pd.(Make sure you type 'pip install pandas’ in cmd to install the package on the first time.)Then, we use the function DataFrame() to construct a dictionary into DataFrame. Check out this link to learn what dictionaries are.

import pandas as pd # 載入套件並縮寫 import package and abbreviate the name
d = {'col1': [1, 2], 'col2': [3, 4]} # 先建立一個字典 create a dictionary
df = pd.DataFrame(data=d) # 運用函數將字典轉換為資料框架 constructing a dictionary into data frame
df # 呼叫轉換完成的資料框架 call the transformed dataframe

在Pandas DataFrame中，常見的欄位變數資料類型有三種：

float64 - 浮點數，可用來表示離散或連續變數。
int64 - 整數，可用來表示離散或連續變數。
object - 包含字串，用來表示類別型變數。

There are three main data types of data:

float64 - float, could be used for both discrete and continuous variables.
int64 - integer, could be used for both discrete and continuous variables.
object - strings included, used to for categorical data.

變數說明：

離散變數：只能用整數單位計算的變數(ex: 房子的房間數量、人數、國家等)
連續變數：在一定區間內能任意取值的變數(ex: 身高、起飛到降落所花費的時間、車速等)
字串或類別
還有許多日期、布林值等格式，可待實務遇到再搜尋如何處理

Different types of Variables:

Discrete Variable: a variable whose value is obtained by counting(ex: rooms of a house).
ContinuousVariable: a variable whose value is obtained by measuring(ex: heights, speed).
String or Object
Others: Date, Boolean etc.

編碼 Encoding

資料是字串或類別型要做進一步的分析時(如訓練模型)，一般需要先轉換為數值資料類型，較常見的轉換方式有兩種：

標籤編碼：把每個類別轉換到某個整數，不會增加新欄位，使用時機通常是該資料中不同類別是有序的，例如以年齡分組，類別有小孩、年輕人、老人，使用標籤編碼表示為0, 1, 2是合理的，因為年齡上老人 > 年輕人、年輕人 > 小孩。
獨熱編碼：為每個類別新增一個欄位，用 0/1 表示是否。使用時機通常是該資料中不同類別是無序的，例如國家、地區等。較花費儲存空間。

If the original data contains string or object, we will need to convert them into numerical data type. There are two main way to achieve it:

Label encoding - Transform categories into integers, no new column created. Normally used when there is an order between different types(ex: using 0, 1, 2 to represent Old men, Young men, Kids make sense because there’s an order of the age range).
One Hot encoding - Add columns for every single category, use 0/1 to show if the data belongs to certain category. Normally used when there are no order between data(ex: country, region). More storage space needed due to the added columns.

Numpy支援大量的陣列與矩陣運算，並提供大量的數學函式庫。

Numpy contains many mathematical functions for matrix operations.

# 載入套件 import packages
import numpy as np 
import pandas as pd

ppl = ['kid', 'elder', 'youth', 'youth', 'kid', 'elder'] 
age = [5, 67, 25, 29, 7, 76]
height = [100, 158, 160, 175, 120, 168]
dic = {'People':ppl, 'Age':age, 'Height':height} 
# 建立一個字典把剛剛的資料存進去 save the data into a dictionary

data = pd.DataFrame(dic) # 把剛剛建立的字典轉換為資料框架 change the dictionary into DataFrame
data

標籤編碼 Label Encoding

from sklearn.preprocessing import LabelEncoder # 載入標籤編碼功能 import labelencoder
labelencoder = LabelEncoder() 
data_le = pd.DataFrame(dic) 
#不要弄亂剛剛的DataFrame，建一個新的來編碼 create a new dataframe for labelencoding
data_le['People'] = labelencoder.fit_transform(data_le['People']) # 以標籤編碼完的資料取代原欄位 replace the column with encoded data
data_le

獨熱編碼 One-Hot Encoding

使用Pandas中get_dummies()函數可以輕易將DataFrame進行獨熱編碼。

Using the get_dummies() function in Pandas to easily One-Hot encode DataFrame.

data_dum = pd.get_dummies(data)
pd.DataFrame(data_dum)

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] 第二屆機器學習百日馬拉松內容

[2] Engeneering Statistic Handbook

[3] 機器學習中的Label Encoder和One Hot Encoder

[4] 選擇正確的編碼方法—Label vs OneHot Encoder

[5] 常用屬性或方法（3）Data Frame

[6] Discrete and Continuous Random Variables

[7] 初學Python手記#3-資料前處理

Day02 What is EDA (Exploratory Data Analysis)? 淺談何謂探索式資料分析

Day04 Outlier and some Numpy. 離群值與Numpy操作

系列文

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作共 30 篇

RSS系列文訂閱系列文

25 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22211 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 3 篇