第 1 篇：AI 資料科學家養成筆記 — 從原始資料到智慧決策 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 1

AI & Data

AI的世界包羅萬象-從數據分析、預測型到生成式系列第 1 篇

第 1 篇：AI 資料科學家養成筆記 — 從原始資料到智慧決策

17th鐵人賽

superwilly1122

2025-09-15 23:59:54

207 瀏覽

分享至

📌 Part 1：資料清理與探索

目標學會處理缺失值、異常值與基本統計分析。
技術：pandas, numpy, matplotlib
流程圖描述： CSV檔 → 載入 → 檢查缺失值 → 處理遺漏/異常 → 基本統計 → 視覺化
程式碼 python import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('data.csv') print(df.info()) print(df.describe()) df = df.fillna(df.mean()) plt.hist(df['age']) plt.show() ---

📌 Part 2：特徵工程與資料前處理

目標將原始資料轉換為模型可解讀格式。
技術：scikit-learn 的 StandardScaler, OneHotEncoder
流程圖描述：原始欄位 → 數值標準化 → 類別獨熱編碼 → 特徵合併
程式碼 python from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer ct = ColumnTransformer([ ('num', StandardScaler(), ['age','salary']), ('cat', OneHotEncoder(), ['gender']) ]) X = ct.fit_transform(df) --- ###

📌 **Part 3：建模與訓練 **
目標建立並訓練機器學習模型進行預測。
技術：scikit-learn 的 RandomForestClassifier
流程圖描述：特徵矩陣 → 訓練集/測試集 → 建模 → 評估準確率
程式碼 python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X_train, X_test, y_train, y_test = train_test_split(X, df['target'], test_size=0.2) model = RandomForestClassifier().fit(X_train, y_train) pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, pred)) --- ###

📌** Part 4：模型部署與監控 **
目標將模型 API 化，支援即時預測。
技術：Flask, MLflow
流程圖描述：模型 → API 伺服器 → 輸入資料 → 回傳預測 → 記錄日誌
程式碼 python from flask import Flask, request, jsonify import joblib app = Flask(__name__) model = joblib.load('model.pkl') @app.route('/predict', methods=['POST']) def predict(): data = request.json result = model.predict([data['features']]) return jsonify({'prediction': int(result[0])}) app.run(port=5000) ---

結論與效益 完成資料科學全流程，能在企業中快速驗證模型、降低導入門檻。