iT邦幫忙

2025 iThome 鐵人賽

DAY 11
0
Cloud Native

K8s上的機器人沙盒系列 第 17

Day 17|Monitor:DCGM Exporter + Prometheus + Grafana

  • 分享至 

  • xImage
  •  

承接 Day 16(Isaac + Selkies 同 Pod)。今天把 GPU 監控補齊:用 DCGM Exporter 暴露 GPU 指標,交給 Prometheus 抓取,最後在 Grafana 看到 Isaac 工作負載的 利用率、溫度、功耗、顯存、NVENC/NVDEC 等關鍵數據。

A. 前置與目標

  • Day 4 透過 GPU Operator 已部署 nvidia-dcgm-exporter(Namespace 通常為 gpu-operator)。

  • 今天的目標:

    1. 讓 Prometheus 能抓到 DCGM 指標(ServiceMonitor 或 scrape config)。
    2. 匯入簡易 Grafana 儀表板(附查詢與面板建議)。
    3. 加幾條 告警規則(溫度、功耗、利用率)。

B. kube-prometheus-stack(Prometheus Operator)

推薦方式,管理簡單、易於擴充。

1) 安裝(或確認已裝)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create ns monitoring || true
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  -n monitoring \
  --set grafana.defaultDashboardsEnabled=true

2) 讓 Prometheus 抓到 DCGM Exporter

如果 Day 4 已建立 ServiceMonitor,可直接跳過;否則建立:

# servicemonitor-dcgm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
  labels:
    release: kube-prometheus-stack   # 對應上面 Helm Release 名稱
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-dcgm-exporter
  namespaceSelector:
    matchNames: [gpu-operator]
  endpoints:
  - port: metrics
    interval: 15s
kubectl apply -f servicemonitor-dcgm.yaml

驗證:kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 → 瀏覽 http://localhost:9090/targets 看到 dcgm-exporter Up。

C. Selkies/Isaac 標籤

為了在 Grafana 中把 工作負載 ↔ GPU 節點關聯,建議在 Day 16 的 Chart 加上 Pod/Deployment 標籤

# values.yaml 片段
podLabels:
  app.kubernetes.io/part-of: isaac-sandbox
  sandbox.role: streaming

並在 Prometheus 端(或通過 ServiceMonitor)把這些 labels 帶進時序(常見作法是經由 kubernetes_sd_configs 自動抓 Pod labels,或用 metric_relabel_configs 附加)。


D. Grafana:儀表板

先把 Grafana UI 開起來:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80
# 帳密預設 admin / prom-operator(或 Chart 隨機密碼,請查 Secret)

建議面板與 PromQL 範例

DCGM Exporter 的度量名稱會依版本有些差異,下列是常見命名。可在 Prometheus 先用 {job="dcgm-exporter"} 探索實際可用的 metric。

  1. GPU 利用率(%)
avg by (gpu, hostname) (DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"})
  1. 顯存使用率(%)
avg by (gpu, hostname) (DCGM_FI_DEV_MEM_COPY_UTIL{job="dcgm-exporter"})
  1. 顯存用量(GiB)
avg by (gpu, hostname) (DCGM_FI_DEV_FB_USED{job="dcgm-exporter"}) / 1024 / 1024 / 1024
  1. GPU 溫度(°C)
max by (gpu, hostname) (DCGM_FI_DEV_GPU_TEMP{job="dcgm-exporter"})
  1. 功耗(W)
avg by (gpu, hostname) (DCGM_FI_DEV_POWER_USAGE{job="dcgm-exporter"})
  1. NVENC / NVDEC 使用率(%)(若有)
avg by (gpu, hostname) (DCGM_FI_DEV_ENC_UTIL{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_DEC_UTIL{job="dcgm-exporter"})
  1. 每個 Isaac Pod 映射到節點(表格面板可放映射與當前 GPU)
  • kube_pod_infokube_pod_container_info(kube-state-metrics)或 kube_pod_container_resource_limits 關聯 Pod → Node,再以 Node 關聯 DCGM 指標。

進階:可以放 混合查詢,例如顯示「某個 Deployment 的平均 GPU 利用率」:

avg by (deployment) (
  label_replace(
    DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"},
    "deployment", "$1",
    "kubernetes_io_metadata_name", "(.*)"
  )
)

(實際 label 名稱請依你環境中的自動附標調整)

E. Recording/Alerting Rules

# gpu-alerts.yaml (若用 Prometheus Operator,改成 PrometheusRule)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: monitoring
spec:
  groups:
  - name: gpu.rules
    rules:
    - alert: GPUHighTemperature
      expr: max_over_time(DCGM_FI_DEV_GPU_TEMP{job="dcgm-exporter"}[5m]) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "GPU temperature high"
        description: "GPU temp > 80°C for 5m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"

    - alert: GPUNearPowerLimit
      expr: avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="dcgm-exporter"}[5m])
            / avg_over_time(DCGM_FI_DEV_POWER_MGMT_LIMIT{job="dcgm-exporter"}[5m]) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "GPU power near limit"
        description: "Power > 90% of limit for 10m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"

    - alert: GPUStuckLowUtilization
      expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"}[30m]) < 10
      for: 30m
      labels:
        severity: info
      annotations:
        summary: "GPU long low utilization"
        description: "GPU util < 10% for 30m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"
kubectl apply -f gpu-alerts.yaml

依你環境調整門檻(80°C/90% 僅為示例)。如有 Alertmanager,請設定通知管道(Email/Slack/Webhook)。

F. 儀表板快速模板(JSON Provisioning,可選)

若你熟悉 Grafana Provisioning,可用 ConfigMap 自動匯入儀表板:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards-gpu
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
    app.kubernetes.io/name: kube-prometheus-stack
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/part-of: grafana-dashboards
data:
  gpu-overview.json: |
    {
      "title": "GPU Overview (DCGM)",
      "timezone": "browser",
      "panels": [
        {"type":"timeseries","title":"GPU Util %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_GPU_UTIL{job=\"dcgm-exporter\"})"}]},
        {"type":"timeseries","title":"FB Used (GiB)","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_FB_USED{job=\"dcgm-exporter\"}) / 1024 / 1024 / 1024"}]},
        {"type":"timeseries","title":"Temp °C","targets":[{"expr":"max by (gpu,hostname) (DCGM_FI_DEV_GPU_TEMP{job=\"dcgm-exporter\"})"}]},
        {"type":"timeseries","title":"Power (W)","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_POWER_USAGE{job=\"dcgm-exporter\"})"}]},
        {"type":"timeseries","title":"NVENC %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_ENC_UTIL{job=\"dcgm-exporter\"})"}]},
        {"type":"timeseries","title":"NVDEC %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_DEC_UTIL{job=\"dcgm-exporter\"})"}]}
      ]
    }

kube-prometheus-stack 的 Grafana 會自動偵測 grafana_dashboard: "1" 標籤並匯入。

感想

もうちょっと頑張ればまた連休だ!!!


上一篇
Day 16|把 Isaac + Selkies 放進同一個 Pod
下一篇
Day 18|Logging:Loki + Promtail,集中收集 Isaac 與 Selkies 日誌
系列文
K8s上的機器人沙盒18
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言