2024 iThome 鐵人賽

DAY 22

DevOps

時間序列資料庫探討 - Prometheus系列第 22 篇

Prometheus - 監控指標舉例

16th鐵人賽

拍拍

團隊天堂製造

2024-10-06 23:56:45

446 瀏覽

分享至

上篇提問

常見的監控指標和 PromQL
Prometheus Sever 的讀寫流程

本篇來舉例一些常見的監控指標和 PromQL。

監控指標舉例

CPU / Memory 使用率第二高的容器

Auto Scaling 是依據資源使用狀態，自動分配資源的過程。而分配的資源通常就是增減一模一樣的容器數量。常見的做法是以腳本定期查詢 CPU 使用率，若超過設定的域值，就自動增減。
在 Kubernetes 環境下，可以裝 cAdvisor 來提供 docker 容器的指標。如果我們以 CPU 使用率第二高的容器來概括 CPU 資源的使用狀況，可以用以下的 PromQL：

topk(2, sum(rate(container_cpu_usage_seconds_total[5m] / machine_cpu_cores) * 100) by (name)

Message Queue 的 Message 數量警報

以 Kafka 為例，監控指標可以參考 Kafka Exporter

groups:
  - name: kafka-alerts
    rules:
      - alert: KafkaConsumerGroupLagHigh
        expr: kafka_consumergroup_lag > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Kafka Consumer Group Lag High"
          description: "Consumer group {{ $labels.group }} has a lag of more than 100 for more than 5 minutes."

Http 上下游的流量數量

有時我們的伺服器會有很多個 http 端口，都會請求同一個下游。當下游的流量過大時，我們可能會想知道是哪個端口造成的。於是要把上下游的標籤都加上。
在沒有其他工具（如 OpenTelemetry 的 trace）的情況下，可以自己用 prometheus client 實作。

// 宣告一個 Histogram
var (
    requestDuration = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name:    "http_client_request_count",
            Help:    "HTTP requests count",
        },
        []string{"incoming-endpoint", "outgoing-endpoint", "status"},
    )
)

//  作一個 http client 的中間件
type prometheusMiddleware struct{}
func (m *prometheusMiddleware) RoundTrip (next http.RoundTripper)
    return func(req *http.Request) (*http.Response, error) {
        resp, err := next.RoundTrip(req)
        requestDuration.WithLabelValues(req.Context().Get("incoming-endpoint"), req.URL.Path, resp.StatusCode()).Inc()
        return resp, err
    }
}

func main() {
    // 將中間件加入 HTTP Client
    client := &http.Client{
        Transport: prometheusMiddleware(http.DefaultTransport),
    }

    http.Handle("/order/create", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // 記錄上游的端點
        ctx = context.WithValue(r.Context(), "incoming-endpoint", "/order/create"))

        // 將記有 incoming-endpoint 的 Context 傳入下游
        req = http.RequestWithContext(ctx)
        req, _ = http.NewRequest("GET", "http://stock-service/check-stock", nil).
        req = req.WithContext(ctx)

        // 發送請求給下游，client 的中間件會記觸發寫入 Counter
        client.Do(req)
    })
}

如此一來，我們就可以查詢每個上游端點到下游端點的請求數量。

sum(rate(http_client_request_count[5m])) by (incoming-endpoint, outgoing-endpoint) # N - M 個上下游之間的請求數量
sum(rate(http_client_request_count{outgoing-endpoint="check-stock"}[5m])) by (incoming-endpoint) # 各上游到 check-stock 的請求數量
sum(rate(http_client_request_count{incomeing-endpoing="/order/create")[5m])) by (incoming-endpoint, outgoing-endpoint) # /order/create 到各下游的請求數量

單一 API 的錯誤率和 P99 延遲

Http 回應狀態可以透過 Gateway（如 Nginx 或 Apache）的 Exporter 提供的 http_requests_total 和 http_requests_duration_seconds 來計算。

http_responses_total{code!="2xx"} / http_responses_total

然而為了輕量化，它們通常只有 http_response_time_average_seconds 的 Gauge，而不會有 Histogram。
所以通常還是看容器本身收集的指標。（比照上面的例子實作，只是要計時，並多一個 HistogramVec 記錄 http_request_duration_seconds）

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="my-job", api="my-api"}[5m])) by (le)

業務指標的分佈

有時我們會關心業務指標的分佈有沒有明顯的變化。例如訂單的金額分佈突然變動，我們可能會擔心是不是有商品價格設定錯誤。
對於「分佈」的監控，就是使用 Histogram。
用 Histogram 記錄業務指標的分佈，最麻煩的是使用前需要人工切分數值範圍。比如：

from prometheus_client import Histogram, REGISTRY

order_price_histogram = Histogram('order_price_dollar', 'The price of order by dollars', buckets=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000])

@app.route('/order')
def hello_world(req):
    # 處理訂單
    REQUEST_DURATION.observe(req.price))
    return 'Order placed!'

有了 Histogram 之後，我們可以查詢每個價格區間的訂單數量。

sum(rate(order_price_dollar[1m])) by (le) # 每個區間的訂單增率
sum(rate(order_price_dollar[1m])) by (le) offset 1d # 昨天同時間的每個區間的訂單增率

並比較不同時間點的分佈。

若使用 Grafana 的 Heat Map 可以顯示直方圖隨時間的變化趨勢。

sum(increase(order_price_dollar_bucket[1m])) by (le)

Prometheus - 常見的監控需求和讀寫流程

Prometheus - 監控指標寫入流程

系列文

時間序列資料庫探討 - Prometheus 共 30 篇

RSS系列文訂閱系列文

8 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19840 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

時間序列資料庫探討 - Prometheus系列 第 22 篇