Day24｜應用端的心跳魔法 🪄：FastAPI 監控實戰

2025 iThome 鐵人賽

DAY 24

AI & Data

論文流浪記：我與AI 探索工具、組合流程、挑戰完整平台系列第 25 篇

17th鐵人賽

冒牌者症候群的軟體攻城獅

團隊等待阿毛參賽中

2025-10-01 08:23:47

99 瀏覽

分享至

前言：系統比我還需要安慰劑

欸，老實說，工程師的人生有時候跟健身環環相扣。
比如健身的時候，你知道要記錄心跳，避免突然暴斃。系統也是一樣，沒有人想在凌晨三點 PagerDuty 把你叫起來，結果發現是因為「欸，Redis 早就死透了」這種低級事故。
結果，你的心跳還在跳，系統卻死了。諷刺吧。

這就是今天要講的：心跳（Heartbeat）監控。
但放心，不是那種手環只會告訴你「你今天又懶了」的東西。
我們講的是應用端的心跳：API 呼叫、Redis 操作、資料庫查詢，這些全部都能插上監控的探針。

為什麼要搞「心跳魔法」？

先不要急著看程式碼，先來一點哲學。
你知道那種戀愛裡的狀態嗎？表面上對方說「我很好啊」，但訊息回覆卻越來越慢，見面次數逐漸下降，最後就消失。
系統也是這樣。
一開始 status: "ok"，然後延遲開始拉長，錯誤逐漸變多，結果最後 —— 直接消失在監控視野裡。

所以監控就是我們的「感情諮商師」，提早告訴你：「欸，她快跑了，準備好。」
只不過在這裡，我們不是監控男女友，而是監控 API 與資料庫。
（雖然說實話，這兩者出事的機率大概差不多高。）

API 呼叫監控 —— 你的「健康檢查」

API 就是門口的 receptionist，你不記錄他處理幾個人、花多少時間，那等於健身房只記重訓重量，不記會員流失率。

所以我們需要：

Counter：計算呼叫次數。（Traffic）。
Histogram：分佈統計延遲。（Latency）。
Error：錯誤次數（Errors）。
Saturation：服務繁忙程度（通常從併發數 / queue length 來推估）。

👉 這四個就是 Google SRE Bible 裡的「四個黃金訊號」：
Latency、Traffic、Errors、Saturation，只要抓到這四個，大部分問題都能提前被發現。

程式碼（FastAPI + Prometheus client）：

import asyncio
import time
from functools import wraps

from prometheus_client import Counter, Histogram, Gauge

# 紀錄各個 endpoint 的 metrics
_METRICS_REGISTRY = {}


def observe_api(func):
    """
    FastAPI API 監控，涵蓋四個黃金訊號：
    - Latency: Histogram
    - Traffic: Counter
    - Errors: Counter
    - Saturation: Gauge (併發中請求數)
    """

    service_name = "rag-api"
    endpoint_name = func.__name__

    # 如果還沒創建，就創建
    if endpoint_name not in _METRICS_REGISTRY:
        counter = Counter(
            f"{endpoint_name}_total",
            f"Total requests to {endpoint_name}",
            ["endpoint", "app_service"],
        )
        error_counter = Counter(
            f"{endpoint_name}_error_total",
            f"Error requests to {endpoint_name}",
            ["endpoint", "app_service", "error_type"],
        )        
        histogram = Histogram(
            f"{endpoint_name}_latency_seconds",
            f"Latency for {endpoint_name}",
            ["endpoint", "app_service"],
            buckets=[0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 5],
        )
        in_flight = Gauge(
            f"{endpoint_name}_in_flight",
            f"In-flight requests for {endpoint_name}",
            ["endpoint", "app_service"],
        )
        
        _METRICS_REGISTRY[endpoint_name] = {
            "counter": counter.labels(endpoint=endpoint_name, app_service=service_name),
            "error_counter": error_counter.labels(endpoint=endpoint_name, app_service=service_name, error_type="unknown"),
            "histogram": histogram.labels(endpoint=endpoint_name, app_service=service_name),
            "in_flight": in_flight.labels(endpoint=endpoint_name, app_service=service_name),
        }

    metrics = _METRICS_REGISTRY[endpoint_name]
    def record_metrics(e=None):
        if e is not None:
            metrics["error_counter"].labels(
                endpoint=endpoint_name,
                app_service=service_name,
                error_type=type(e).__name__
            ).inc()
            
    async def async_wrapper(*args, **kwargs):
        metrics["counter"].inc()
        metrics["in_flight"].inc()
        start = time.time()
        try:
            result = await func(*args, **kwargs)
            return result
        except Exception as e:
            record_metrics(e)
            raise
        finally:
            metrics["histogram"].observe(time.time() - start)
            metrics["in_flight"].dec()

    def sync_wrapper(*args, **kwargs):
        metrics["counter"].inc()
        metrics["in_flight"].inc()
        start = time.time()
        try:
            result = func(*args, **kwargs)
            return result
        except Exception as e:
            record_metrics(e)
            raise
        finally:
            metrics["histogram"].observe(time.time() - start)
            metrics["in_flight"].dec()

    if asyncio.iscoroutinefunction(func):
        return wraps(func)(async_wrapper)
    return wraps(func)(sync_wrapper)

_METRICS_REGISTRY 就像一個備忘錄，存放每個 API endpoint 的 Counter 與 Histogram，避免每次呼叫都重複創建 metric。
func.__name__ 取得函數名稱，對應 endpoint 名稱。
service_name 可以用於區分不同微服務或不同應用。
labels 讓 metric 更有結構化，方便在 Prometheus + Grafana 中做篩選。
sync function 版本，同樣保證 metric 正確收集。

使用方式

from fastapi import FastAPI

app = FastAPI()

@app.get("/health")
@observe_api
async def health_check():
    return {"status": "ok"}

看吧，這不就是情侶吵架的記錄嗎？

Counter：今天吵架第幾次。
Histogram：每次吵架從冷戰到和好的時間分佈。
差別只在於，Prometheus 不會跟你冷戰。

Redis 操作監控 —— 你的小祕書心情狀態

Redis 其實就像辦公室裡的小祕書，平常快到誇張，幾乎不用等。
但有一天，她突然開始回應很慢，甚至丟文件丟錯（Cache miss），整個團隊都 GG。
程式碼：

# Redis metrics
import time

from prometheus_client import Counter, Histogram, Gauge

REDIS_GET_COUNT = Counter("redis_get_total", "Total Redis GET requests")
REDIS_SET_COUNT = Counter("redis_set_total", "Total Redis SET requests")
REDIS_LATENCY = Histogram("redis_latency_seconds", "Redis operation latency")
REDIS_ERROR = Counter("redis_error_total", "Redis operation errors", ["error_type"])
REDIS_IN_FLIGHT = Gauge("redis_in_flight", "Number of in-flight Redis operations")


def monitored_redis(func):
    """Redis decorator"""

    def wrapper(*args, **kwargs):
        REDIS_IN_FLIGHT.inc()
        start = time.time()
        try:
            result = func(*args, **kwargs)
            # 判斷是 GET 還是 SET
            if func.__name__.startswith("get"):
                REDIS_GET_COUNT.inc()
            else:
                REDIS_SET_COUNT.inc()
            return result
        except Exception as e:
            REDIS_ERROR.labels(error_type=type(e).__name__).inc()
            raise
        finally:
            REDIS_LATENCY.observe(time.time() - start)
            REDIS_IN_FLIGHT.dec()

    return wrapper

REDIS_GET_COUNT、REDIS_SET_COUNT：分別統計 GET 與 SET 呼叫次數。
REDIS_LATENCY：記錄每次操作耗時，能幫你快速找到「小祕書慢半拍」的時間點。
REDIS_ERROR：統計錯誤次數，告訴你小祕書今天是不是罷工了。

使用方式

@monitored_redis
def get_cache(key):
    return redis_client.get(key)

@monitored_redis
def set_cache(key, value):
    return redis_client.set(key, value)

Redis 爆炸就像祕書辭職：大家都在原地乾等文件，整個公司動不了。
然後你只能站起來說：「沒關係啦，我自己去翻櫃子。」（然後三小時後崩潰。）

資料庫查詢監控 —— 那個永遠的瓶頸

資料庫就是公司的財務系統。
所有人都想要它快，但每次查詢都慢到靠北。
你只能禱告索引還健在。

資料庫是系統瓶頸常見來源，觀測 DB 查詢數量、延遲與錯誤，有助於優化 SQL 與索引。

import time

from prometheus_client import Counter, Histogram, Gauge

# DB metrics
DB_QUERY_COUNT = Counter("db_query_total", "Total number of DB queries")
DB_QUERY_LATENCY = Histogram("db_query_latency_seconds", "DB query latency")
DB_QUERY_ERROR = Counter("db_query_error_total", "Total DB query errors", ["error_type"])
DB_IN_FLIGHT = Gauge("db_in_flight", "Number of in-flight DB queries")


def monitored_db(func):
    """DB query decorator"""

    def wrapper(*args, **kwargs):
        DB_IN_FLIGHT.inc()
        DB_QUERY_COUNT.inc()
        start = time.time()
        try:
            return func(*args, **kwargs)
        except Exception as e:
            DB_QUERY_ERROR.labels(error_type=type(e).__name__).inc()
            raise
        finally:
            DB_QUERY_LATENCY.observe(time.time() - start)
            DB_IN_FLIGHT.dec()

    return wrapper

每次查詢先計數，測量耗時，發生異常自動累計錯誤。
wrapper 對原本業務邏輯完全透明，方便裝飾所有 DB 查詢函數。

使用方式：

@monitored_db
def fetch_user(user_id):
    return db.query(User).filter(User.id == user_id).first()

資料庫延遲就像月底發薪延遲，所有人都還在跑，但心裡已經準備離職。

Prometheus 配置

global:
  scrape_interval: 15s  # 每 15 秒抓取一次

scrape_configs:
  - job_name: 'noteserver'
    static_configs:
      - targets: ['noteserver:8000']
    relabel_configs:
      - source_labels: ['__address__']
        target_label: 'service'
        replacement: 'noteserver'

這裡的 scrape_interval: 15s 就像健身教練每 15 秒看你有沒有在偷懶。
只不過，Prometheus 永遠不會遲到。