[29] 為什麼 EKS 在更新 Kubernetes deployment 時會有 HTTP 502 的 error（四）

2022 iThome 鐵人賽

DAY 29

DevOps

那些文件沒告訴你的 AWS EKS系列第 29 篇

14th鐵人賽 eks kubernetes

focaaby

2022-10-14 23:55:28

1453 瀏覽

分享至

本文將依據昨日主要提及的兩種解決方式 preStop hook 搭配 terminationGracePeriodSeconds 以及啟用 AWS Load Balancer Controller Pod Readiness Gates 功能分別實測：

ALB 作為 Ingress endpoint
NLB 作為 Service endpoint

以下環境使用 AWS Load Balancer Controller v2.4.3 進行測試：

$ kubectl -n kube-system get deploy aws-load-balancer-controller -o yaml | grep "image: "
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.3

ALB

更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。

$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled

更新 nginx-demo-ingress.yaml YAML 內容。以下使用 game 2048 作為測試 server。先前在測試驗證過程中，發現 Nginx image 終止（terminated）速度太快，很難識別到底是否真的有遭遇到 HTTP 500-level 的錯誤，故替換成 game 2048 image 進行測試。

其 game 2048 需要約 40 秒時間才能完全終止，故 sleep 約 40 秒時間， terminationGracePeriodSeconds 則額外延長 20 秒故設定 60 秒，

$ cat ./nginx-demo-ingress.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "demo-nginx-deployment"
  namespace: "demo-nginx"
spec:
  selector:
    matchLabels:
      app: "demo-nginx"
  replicas: 10
  template:
    metadata:
      labels:
        app: "demo-nginx"
        role: "backend"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        # - image: nginx:latest
      - image: public.ecr.aws/l6m2t8p7/docker-2048:latest
        imagePullPolicy: Always
        name: "demo-nginx"
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]

  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: "service-demo-nginx"
  namespace: "demo-nginx"
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: "demo-nginx"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: "demo-nginx"
  name: "ingress-nginx"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  rules:
  - http:
      paths:
      - path: "/"
        pathType: Prefix
        backend:
          service:
            name: service-demo-nginx
            port:

部署此 YAML。

$ kubectl apply -f  ./nginx-demo-ingress.yaml

使用以下 curl command 每 0.5 秒訪問一次 ALB endpoint。

$ counter=1; for i in {1..200}; do echo $counter; counter=$((counter+1)); sleep │
0.5; curl -I k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com >> test-result.txt; done

於步驟 3 同一時間，於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。

$ kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:latest" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

未查看到 HTTP 500-level，其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1。

$ grep "50[2|3|4]" ./test-result.txt

$ grep "Server: " ./test-result.txt
...
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
...
...

NLB

刪除所有 namespace demo-nginx 資源。

$ kubectl delete ns demo-nginx --grace-period=0

建立 nginx-demo-nlb.yaml 一樣使用與 ALB 相同的 terminationGracePeriodSeconds 及 preStop。

$ cat ./nginx-demo-nlb.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "demo-nginx-deployment"
  namespace: "demo-nginx"
spec:
  selector:
    matchLabels:
      app: "demo-nginx"
  replicas: 10
  template:
    metadata:
      labels:
        app: "demo-nginx"
        role: "backend"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        # - image: nginx:latest
      - image: public.ecr.aws/l6m2t8p7/docker-2048:latest
        imagePullPolicy: Always
        name: "demo-nginx"
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]

  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: "service-demo-nginx"
  namespace: "demo-nginx"
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  loadBalancerClass: service.k8s.aws/nlb
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: "demo-nginx"

部署此 YAML。

$ kubectl apply -f ./nginx-demo-nlb.yaml
namespace/demo-nginx created
deployment.apps/demo-nginx-deployment created
service/service-demo-nginx created

$ kubectl -n demo-nginx get svc -o wide
NAME                         TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)        AGE   SELECTOR
service/service-demo-nginx   LoadBalancer   10.100.58.251   k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com   80:30415/TCP   5m   app=demo-nginx

更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。

$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled

若 Pod 並未更新 condition 則需要重新啟用 Pod 
$ kubectl -n demo-nginx rollout restart deploy demo-nginx-deployment

確認 Target Group 使用 IP 類型。

$ aws elbv2 describe-target-groups --load-balancer-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111

{
    "TargetGroups": [
        {
            "TargetGroupArn": "arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53",
            "TargetGroupName": "k8s-demongin-serviced-4a99c83f08",
            "Protocol": "TCP",
            "Port": 80,
            "VpcId": "vpc-0b150640c72b722db",
            "HealthCheckProtocol": "TCP",
            "HealthCheckPort": "traffic-port",
            "HealthCheckEnabled": true,
            "HealthCheckIntervalSeconds": 10,
            "HealthCheckTimeoutSeconds": 10,
            "HealthyThresholdCount": 3,
            "UnhealthyThresholdCount": 3,
            "LoadBalancerArns": [
                "arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111"
            ],
            "TargetType": "ip",
            "IpAddressType": "ipv4"
        }
    ]
}

也可以查看 Target id 為 IP 地址。

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].Target' --output table
------------------------------------------------
|             DescribeTargetHealth             |
+-------------------+------------------+-------+
| AvailabilityZone  |       Id         | Port  |
+-------------------+------------------+-------+
|  eu-west-1a       |  192.168.0.223   |  80   |
|  eu-west-1c       |  192.168.44.248  |  80   |
|  eu-west-1a       |  192.168.19.10   |  80   |
|  eu-west-1c       |  192.168.41.36   |  80   |
|  eu-west-1b       |  192.168.76.79   |  80   |
|  eu-west-1b       |  192.168.83.233  |  80   |
|  eu-west-1a       |  192.168.22.141  |  80   |
|  eu-west-1b       |  192.168.80.209  |  80   |
|  eu-west-1b       |  192.168.66.152  |  80   |
|  eu-west-1b       |  192.168.66.129  |  80   |
|  eu-west-1c       |  192.168.55.7    |  80   |
|  eu-west-1c       |  192.168.61.246  |  80   |
|  eu-west-1c       |  192.168.49.142  |  80   |
|  eu-west-1c       |  192.168.45.23   |  80   |
|  eu-west-1b       |  192.168.67.240  |  80   |
+-------------------+------------------+-------+

使用以下 curl command 每 0.5 秒訪問一次 NLB endpoint。因 NLB Target 於 initial 進行健康檢查時間較長，因此延長測試時間。

$ counter=1; for i in {1..600}; do echo $counter; counter=$((counter+1)); sleep 0.5; curl -I k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com >> test-nlb-result.txt; done

於步驟 6 同一時間，於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。

$ kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

在 Pod 更新過程中，透過 AWS CLI aws elbv2 describe-target-health 檢查 Target Group 健康檢查狀態，可以觀察確認 Target initial 及 draining。

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].[Target.Id,TargetHealth.State]' --output table
--------------------------------
|     DescribeTargetHealth     |
+-----------------+------------+
|  192.168.48.65  |  draining  |
|  192.168.0.136  |  initial   |
|  192.168.25.57  |  healthy   |
|  192.168.84.107 |  healthy   |
|  192.168.22.132 |  draining  |
|  192.168.69.15  |  healthy   |
|  192.168.56.52  |  draining  |
|  192.168.19.10  |  draining  |
|  192.168.41.36  |  initial   |
|  192.168.80.209 |  initial   |
|  192.168.66.152 |  healthy   |
|  192.168.48.198 |  healthy   |
|  192.168.56.132 |  healthy   |
|  192.168.66.129 |  healthy   |
|  192.168.55.7   |  initial   |
|  192.168.49.142 |  draining  |
|  192.168.67.240 |  initial   |
|  192.168.45.23  |  healthy   |
+-----------------+------------+

檢視結果輸出 test-nlb-result.txt 未查看到 HTTP 500-level，其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1。

$ grep "50[2|3|4]" ./test-nlb-result.txt

$ grep "Server: " ./test-nlb-result.txt
...
...
...
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
...
...
...

總結

不論使用 ALB 作為 Ingress endpoint 或是 NLB 作為 Service endpoint 公開服務時，皆可以透過 preStop hook 及啟用 Pod Readiness Gates 功能來確保 deployment 於 rolling update 過程中不會遭遇到 HTPP 500-level 錯誤。