承接 Day 3(Cilium 就緒),今天把 NVIDIA GPU Operator 部署到我們的最小叢集(1 控制 + 1 八 GPU Worker),自動化安裝:Driver、Container Toolkit、Device Plugin、GPU Feature Discovery(GFD),並驗證
nvidia.com/gpu
資源可用,最後用 DCGM Exporter 打出第一份 GPU 監控指標。
手動逐台安裝驅動、Container Toolkit、Device Plugin、GFD、DCGM Exporter 不僅耗時,也難保持一致。GPU Operator 以 Helm/Operator 方式把這些元件自動化、可升級、可監控:
nvidia.com/gpu
形式調度 GPUgpu01
)nvidia
或 nouveau
模組仍在)。containerd
已安裝且使用 systemd cgroup(Day 2 已設定)。在Control Plane
cp1
執行,kubectl
指向叢集。
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# 建議獨立Name Space
kubectl create namespace gpu-operator || true
# 以預設值安裝(包含 Driver/Toolkit/Device Plugin/GFD/DCGM Exporter)
helm install gpu-operator \
-n gpu-operator \
nvidia/gpu-operator
等待幾分鐘讓 Driver DaemonSet 在
gpu01
編譯並載入模組。期間nvidia-driver-daemonset
會從 Init → Running,Device Plugin 與 GFD 也會陸續 Ready。
kubectl -n gpu-operator get pods -o wide
# 看到類似:
# nvidia-operator-validator-xxxxx
# nvidia-driver-daemonset-xxxxx
# nvidia-container-toolkit-daemonset-xxxxx
# nvidia-device-plugin-daemonset-xxxxx
# nvidia-gpu-feature-discovery-xxxxx
# nvidia-dcgm-exporter-xxxxx
nvidia.com/gpu
可調度kubectl get nodes gpu01 -o json | jq '.status.allocatable, .metadata.labels'
# 會看到類似:
# "nvidia.com/gpu": "8"
# 以及由 GFD 自動貼的標籤(例如 GPU 型號、算力、MIG 支援)
# cuda-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04
command: ["bash","-lc","nvidia-smi && python - <<'PY'\nimport torch; print('CUDA avail:', torch.cuda.is_available())\nPY"]
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f cuda-test.yaml
kubectl logs -f pod/cuda-test
# 期待輸出 nvidia-smi 與 CUDA 可用(若你鏡像內有 PyTorch 也會顯示 True)
若想更純粹地驗證 CUDA,可改用
nvidia/cuda:12.4.1-base-ubuntu22.04
並執行nvidia-smi
或 CUDA Samples。
GPU Operator 預設會幫你跑起 DCGM Exporter。先確認 Service:
kubectl -n gpu-operator get svc | grep dcgm
如果你已經有 Prometheus:
kubernetes-service-endpoints
的自動發現(Helm 的 kube-state-metrics/prometheus-operator 常見設定),即可抓到 nvidia-dcgm-exporter
指標。# servicemonitor-dcgm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-dcgm-exporter
namespaceSelector:
matchNames: [gpu-operator]
endpoints:
- port: metrics
interval: 15s
kubectl apply -f servicemonitor-dcgm.yaml
helm upgrade --install gpu-operator -n gpu-operator nvidia/gpu-operator \
--set driver.enabled=false
helm upgrade --install gpu-operator -n gpu-operator nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
nvidia-ctk
預先寫好對應設定;請依你實際 runtime 調整。これって、休日も書かなきゃ行けないの!?ヤバすぎだろう......