類別型資料與數值型資料的前處理流程

2025 iThome 鐵人賽

DAY 3

AI & Data

初學ML系列第 3 篇

17th鐵人賽

konchok3po

2025-09-17 19:24:57

100 瀏覽

分享至

在機器學習中，資料大致可以分為兩種型態：

數值型資料 (Numerical data)

特徵是可以「加減乘除」的數字。

例如：年齡（Age = 30）、薪水（Salary = 48000）。

這些數值的大小、差距都有實際意義：30 歲比 20 歲年長，薪水 48000 比 30000 高。

類別型資料 (Categorical data)

特徵是「文字或分類」，不能直接做數學運算。

例如：國家（Country = France/Spain/Germany）、顏色（Red/Blue/Green）。

這些值之間沒有大小關係：Spain 不是比 France 大，也不是比 Germany 小，它只是不同的類別。

👉 簡單理解：

數值型資料 = 數字，有數學意義

類別型資料 = 分類，僅代表不同種類

缺值處理 (Missing Data Imputation)

現實中的資料常會有缺漏，例如 Salary 欄位可能有 NaN。
處理方式會依型態不同：

數值型資料：常用平均值（mean）、中位數（median）、或最常見值（mode）來補。例如這裡用平均值：

SimpleImputer(strategy="mean")

→ 缺少的薪水就補上已知薪水的平均。

類別型資料：通常用最常出現的值補。例如 Country 缺值，就補上最常出現的國家：

SimpleImputer(strategy="most_frequent")

這樣能確保所有欄位沒有空白，模型才不會出錯。

類別資料的編碼 (Encoding)

電腦和數學模型無法直接理解文字，所以類別資料必須轉換成數字，這個過程叫 encoding。

在這裡使用 One-Hot Encoding：
把每一個國家變成一個新欄位（值為 0 或 1）。

例如原本 Country 是：

France, Spain, Germany

經過 One-Hot 會變成：

France Spain Germany
1 0 0
0 1 0
0 0 1

這樣模型就能處理不同國家的差異。

前處理流水線：數值與類別分開處理

在 Data.csv 中，我們有三個輸入特徵：

Country → 類別（文字）

Age → 數值

Salary → 數值

因為型態不同，所以需要不同的處理方式：

數值：補缺值（平均數），再做標準化（平均=0，標準差=1）

類別：補缺值（最常出現的值），再做 One-Hot Encoding

程式示意：

categorical_features = ['Country']
numeric_features = ['Age', 'Salary']

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features),
    ],
    remainder="drop"
)

這段程式就是告訴電腦：「Country 用文字的規則處理，Age/Salary 用數字的規則處理。」

透過這樣的前處理流程，模型就能正確理解數值和類別資訊，後續再進行訓練與評估