LLM 推理的挑戰:
固定部署的問題:
自動伸縮(Auto Scaling)的目標:
User → Inference Service (Pods) → Metrics Exporter → Prometheus Adapter → HPA
範例目標:建立一個能根據 GPU 使用率與請求量動態伸縮的 LLM 推理服務。
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:3.1.8-2.6.10
ports:
- containerPort: 9400
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-metrics-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'DCGM_FI_DEV_GPU_UTIL'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "DCGM_FI_DEV_GPU_UTIL"
as: "gpu_utilization"
metricsQuery: avg(DCGM_FI_DEV_GPU_UTIL) by (pod)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: llm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: 80
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-keda
namespace: llm
spec:
scaleTargetRef:
kind: Deployment
name: llm-inference
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: http_requests_per_second
threshold: "10"
當 HTTP 請求速率超過 10 req/s 時自動擴容。
apiVersion: batch/v1
kind: CronJob
metadata:
name: llm-batch-inference
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: llm-batch
image: myrepo/llm-batch:latest
command: ["python", "batch_infer.py"]
restartPolicy: OnFailure
每 30 分鐘自動啟動一次批次推理任務。
問題:GPU Pod 啟動時間通常在 30–90 秒。
解法:
自動伸縮只是第一步,最終目標是: