承接 Day 16(Isaac + Selkies 同 Pod)。今天把 GPU 監控補齊:用 DCGM Exporter 暴露 GPU 指標,交給 Prometheus 抓取,最後在 Grafana 看到 Isaac 工作負載的 利用率、溫度、功耗、顯存、NVENC/NVDEC 等關鍵數據。
Day 4 透過 GPU Operator 已部署 nvidia-dcgm-exporter(Namespace 通常為 gpu-operator
)。
今天的目標:
ServiceMonitor
或 scrape config)。推薦方式,管理簡單、易於擴充。
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create ns monitoring || true
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
-n monitoring \
--set grafana.defaultDashboardsEnabled=true
如果 Day 4 已建立 ServiceMonitor
,可直接跳過;否則建立:
# servicemonitor-dcgm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
labels:
release: kube-prometheus-stack # 對應上面 Helm Release 名稱
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-dcgm-exporter
namespaceSelector:
matchNames: [gpu-operator]
endpoints:
- port: metrics
interval: 15s
kubectl apply -f servicemonitor-dcgm.yaml
驗證:
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090
→ 瀏覽http://localhost:9090/targets
看到dcgm-exporter
Up。
為了在 Grafana 中把 工作負載 ↔ GPU 節點關聯,建議在 Day 16 的 Chart 加上 Pod/Deployment 標籤:
# values.yaml 片段
podLabels:
app.kubernetes.io/part-of: isaac-sandbox
sandbox.role: streaming
並在 Prometheus 端(或通過 ServiceMonitor)把這些 labels 帶進時序(常見作法是經由 kubernetes_sd_configs
自動抓 Pod labels,或用 metric_relabel_configs
附加)。
先把 Grafana UI 開起來:
kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80
# 帳密預設 admin / prom-operator(或 Chart 隨機密碼,請查 Secret)
DCGM Exporter 的度量名稱會依版本有些差異,下列是常見命名。可在 Prometheus 先用
{job="dcgm-exporter"}
探索實際可用的 metric。
avg by (gpu, hostname) (DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_MEM_COPY_UTIL{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_FB_USED{job="dcgm-exporter"}) / 1024 / 1024 / 1024
max by (gpu, hostname) (DCGM_FI_DEV_GPU_TEMP{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_POWER_USAGE{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_ENC_UTIL{job="dcgm-exporter"})
avg by (gpu, hostname) (DCGM_FI_DEV_DEC_UTIL{job="dcgm-exporter"})
kube_pod_info
、kube_pod_container_info
(kube-state-metrics)或 kube_pod_container_resource_limits
關聯 Pod → Node,再以 Node 關聯 DCGM 指標。進階:可以放 混合查詢,例如顯示「某個 Deployment 的平均 GPU 利用率」:
avg by (deployment) (
label_replace(
DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"},
"deployment", "$1",
"kubernetes_io_metadata_name", "(.*)"
)
)
(實際 label 名稱請依你環境中的自動附標調整)
# gpu-alerts.yaml (若用 Prometheus Operator,改成 PrometheusRule)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
namespace: monitoring
spec:
groups:
- name: gpu.rules
rules:
- alert: GPUHighTemperature
expr: max_over_time(DCGM_FI_DEV_GPU_TEMP{job="dcgm-exporter"}[5m]) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "GPU temperature high"
description: "GPU temp > 80°C for 5m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"
- alert: GPUNearPowerLimit
expr: avg_over_time(DCGM_FI_DEV_POWER_USAGE{job="dcgm-exporter"}[5m])
/ avg_over_time(DCGM_FI_DEV_POWER_MGMT_LIMIT{job="dcgm-exporter"}[5m]) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "GPU power near limit"
description: "Power > 90% of limit for 10m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"
- alert: GPUStuckLowUtilization
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"}[30m]) < 10
for: 30m
labels:
severity: info
annotations:
summary: "GPU long low utilization"
description: "GPU util < 10% for 30m on {{ $labels.hostname }} (gpu={{ $labels.gpu }})"
kubectl apply -f gpu-alerts.yaml
依你環境調整門檻(80°C/90% 僅為示例)。如有 Alertmanager,請設定通知管道(Email/Slack/Webhook)。
若你熟悉 Grafana Provisioning,可用 ConfigMap 自動匯入儀表板:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-gpu
namespace: monitoring
labels:
grafana_dashboard: "1"
app.kubernetes.io/name: kube-prometheus-stack
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/part-of: grafana-dashboards
data:
gpu-overview.json: |
{
"title": "GPU Overview (DCGM)",
"timezone": "browser",
"panels": [
{"type":"timeseries","title":"GPU Util %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_GPU_UTIL{job=\"dcgm-exporter\"})"}]},
{"type":"timeseries","title":"FB Used (GiB)","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_FB_USED{job=\"dcgm-exporter\"}) / 1024 / 1024 / 1024"}]},
{"type":"timeseries","title":"Temp °C","targets":[{"expr":"max by (gpu,hostname) (DCGM_FI_DEV_GPU_TEMP{job=\"dcgm-exporter\"})"}]},
{"type":"timeseries","title":"Power (W)","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_POWER_USAGE{job=\"dcgm-exporter\"})"}]},
{"type":"timeseries","title":"NVENC %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_ENC_UTIL{job=\"dcgm-exporter\"})"}]},
{"type":"timeseries","title":"NVDEC %","targets":[{"expr":"avg by (gpu,hostname) (DCGM_FI_DEV_DEC_UTIL{job=\"dcgm-exporter\"})"}]}
]
}
kube-prometheus-stack 的 Grafana 會自動偵測
grafana_dashboard: "1"
標籤並匯入。
もうちょっと頑張ればまた連休だ!!!