學 Kubernetes 的第二十一天 - Pod - Probe - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 21

Kubernetes

都什麼年代了，還在學 Kubernetes系列第 21 篇

學 Kubernetes 的第二十一天 - Pod - Probe

16th鐵人賽 kubernetes k8s

vincentlin2447

團隊Grafana 科研遠征小隊

2024-10-05 00:39:00

501 瀏覽

分享至

在現代應用服務架構中，保持服務健康不中斷已經是基本要求。傳統的虛擬機（VM）環境需要手動建構額外且複雜的檢查流程來維持服務穩定性。在 Docker 中，雖然可以使用健康檢查機制來監控容器狀態，但其功能相對簡單，缺乏足夠的靈活性和深度。而在 Kubernetes 中，通過使用 Probe，可以提供更自動化且靈活的健康檢查機制，確保應用程序的高可用性和穩定性。

什麼是 Probe

在 Kubernetes 中，Probe 是用來檢查容器健康狀況的機制。它定期檢查容器的狀態，以確保它們正常運行。這些檢查幫助 Kubernetes 採取適當的行動，例如重新啟動失敗的容器或從服務中移除不健康的容器。

為什麼需要 Probe

Probe 的主要目的是確保應用在 Kubernetes 中的高可用性和穩定性。通過 Probe，可以達成以下目標：

自動化健康檢查：定期檢查容器的健康狀況，及時發現並響應故障。
保證服務可用性：當容器發生故障時，Probe 可以觸發重啟或移除，確保服務穩定運行。
減少手動干預：自動化的健康檢查減少運維人員的手動干預，提升效率。

Probe 的種類

Kubernetes 提供了三種類型的 Probe 來進行健康檢查：

Liveness Probe

用途：檢查容器是否仍在正常運行。如果檢查失敗，Kubernetes 會重啟容器，解決如死鎖或無限循環等問題。
使用場景：適用於需要長期穩定運行的應用。

Readiness Probe

用途：檢查容器是否已準備好處理流量。如果檢查失敗，容器會從負載均衡中移除，直到恢復正常，確保只有健康的容器處理請求。
使用場景：適用於啟動較慢或暫時無法處理請求的應用。

Startup Probe

用途：檢查容器是否成功啟動。如果在指定時間內未成功啟動，Kubernetes 會重啟容器。適用於啟動過程較長的應用。
使用場景：適用於啟動過程複雜且耗時的應用。

應用場景

Web 應用健康檢查：使用 Liveness Probe 確保 Web 服務不會因為死鎖或資源耗盡而長時間無響應。
資料庫連接檢查：使用 Readiness Probe 確保應用程序在與資料庫建立連接後才開始處理客戶請求。
啟動過程監控：使用 Startup Probe 確保啟動時間較長的應用能夠成功啟動，並在啟動過程中不被 Kubernetes 錯誤判斷為失敗。

組態檔案說明

以下是一個 Pod 使用三種類型 Probe 的組態檔範例：

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: myapp-container
    image: myapp:latest
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

Liveness Probe：
- httpGet：檢查 /healthz 路徑的 HTTP GET 請求返回狀態。如果連續 3 次檢查失敗，容器將被重啟。
- initialDelaySeconds：表示在容器啟動後等待 10 秒才開始進行第一次檢查。
- periodSeconds：每 5 秒進行一次檢查。
Readiness Probe：
- httpGet：檢查 /ready 路徑的 HTTP GET 請求返回狀態。如果檢查失敗，容器會從服務負載均衡器中移除。
- initialDelaySeconds：表示在容器啟動後等待 5 秒才開始進行第一次檢查。
- periodSeconds：每 5 秒進行一次檢查。
Startup Probe：
- httpGet：檢查 /startup 路徑的 HTTP GET 請求返回狀態。如果超過 failureThreshold（這裡是 30 次）且仍然失敗，Kubernetes 將認為容器啟動失敗並重啟該容器。
- failureThreshold：最多允許失敗 30 次。
- periodSeconds：每 10 秒進行一次檢查。

實作

透過命令實作 Liveness Probe

組態檔案: pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: test-pod
  name: test-pod
spec:
  containers:
  - name: liveness
    command:
    - /bin/sh
    - -c
    args:
    - "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 3600"
    image: registry.k8s.io/busybox
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

監控 test-pod 事件

kubectl get events --field-selector involvedObject.name=test-pod -w

等待事件的變化，結果如下

LAST SEEN   TYPE      REASON      OBJECT         MESSAGE
15s         Normal    Created     pod/test-pod   Created container liveness
15s         Normal    Started     pod/test-pod   Started container liveness
46s         Warning   Unhealthy   pod/test-pod   Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
46s         Normal    Killing     pod/test-pod   Container liveness failed liveness probe, will be restarted
15s         Normal    Pulled      pod/test-pod   Successfully pulled image "registry.k8s.io/busybox" in 768ms (768ms including waiting). Image size: 1144547 bytes.

可以看到，當 Liveness Probe 偵測到失敗後， Pod 重新啟動。

透過 HTTP 實作 Liveness Probe

我們來看看 lofairy/foo 映像檔的 main.go 檔案內容

main.go

package main

import (
	"net/http"
	"os"
	"time"

	"github.com/gin-gonic/gin"
)

func main() {
	hostname, _ := os.Hostname()
	started := time.Now()

	r := gin.Default()
	r.GET("/", func(c *gin.Context) {
		c.String(http.StatusOK, "foo")

	})

	r.GET("/hostname", func(c *gin.Context) {
		c.JSON(http.StatusOK, gin.H{
			"hostname": hostname,
		})

	})

	r.GET("/healthy", func(c *gin.Context) {
		c.Status(http.StatusOK)
	})

	r.GET("/healthy_test", func(c *gin.Context) {
		duration := time.Since(started)
		if duration.Seconds() > 30 {
			c.Status(http.StatusServiceUnavailable)
		} else {
			c.Status(http.StatusOK)
		}
	})
	r.Run()
}

lofairy/foo 這個 API Server 提供了 /healthy_test 作為探針測試端點，啟動後 30 秒內返回 HTTP 200，之後返回 HTTP 503。我們將使用這個 API 端點進行測試。

組態檔案: pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: test-pod
  name: test-pod
spec:
  containers:
  - name: liveness
    image: lofairy/foo
    livenessProbe:
      httpGet:
        path: /healthy_test
        port: 8080
        httpHeaders:
          - name: X-Custom-Header
            value: Awesome
      initialDelaySeconds: 3
      periodSeconds: 3

監控 test-pod 事件

kubectl get events --field-selector involvedObject.name=test-pod -w

等待事件的變化，結果如下

LAST SEEN   TYPE      REASON      OBJECT         MESSAGE
0s          Normal   Scheduled        pod/test-pod   Successfully assigned default/test-pod to wslkind-worker2
0s          Normal   Pulling          pod/test-pod   Pulling image "lofairy/foo"
0s          Normal   Pulled           pod/test-pod   Successfully pulled image "lofairy/foo" in 1.623s (1.623s including waiting). Image size: 9539405 bytes.
0s          Normal   Created          pod/test-pod   Created container liveness
0s          Normal   Started          pod/test-pod   Started container liveness
0s          Warning   Unhealthy        pod/test-pod   Liveness probe failed: HTTP probe failed with statuscode: 503
0s          Warning   Unhealthy        pod/test-pod   Liveness probe failed: HTTP probe failed with statuscode: 503
0s          Warning   Unhealthy        pod/test-pod   Liveness probe failed: HTTP probe failed with statuscode: 503
0s          Normal    Killing          pod/test-pod   Container liveness failed liveness probe, will be restarted
0s          Normal    Pulling          pod/test-pod   Pulling image "lofairy/foo"
0s          Normal    Pulled           pod/test-pod   Successfully pulled image "lofairy/foo" in 1.544s (1.544s including waiting). Image size: 9539405 bytes.

可以看到，當 Liveness Probe 偵測到失敗後， Pod 重新啟動。

Readiness Probe

Readiness 的組態參數跟 Liveness 基本一樣，只是將頂級標籤 Liveness 換成了 Readiness 。

組態檔案: pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: test-pod
  name: test-pod
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    readinessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  labels:
    type: test-pod-service
  name: test-pod-service
spec:
  type: NodePort
  selector:
    app: test-pod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
    nodePort: 30000

就緒探針配置：

httpGet: 指定通過 HTTP GET 方式探測就緒狀態，請求路徑為 /，埠為 80。
initialDelaySeconds: 在容器啟動後等待 5 秒才開始進行就緒探針。
periodSeconds: 每 10 秒執行一次就緒探針。
failureThreshold: 如果連續 3 次探測失敗，則認為容器不就緒。

簡單來說，我們預期 pod 啟動後 15 秒，就緒態針才會回報成功。

部署 Pod

kubectl apply -f pod.yaml

組態檔案: service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    type: test-pod-service
  name: test-pod-service
spec:
  type: NodePort
  selector:
    app: test-pod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
    nodePort: 30000

建立 Service，暴露 API Server

kubectl apply -f service.yaml

訪問 Service

curl -o /dev/null -s -w "%{http_code}\n" 0.0.0.0:30000
---
000

我們發現 HTTP Code 回傳 000，意思是流量根本沒到伺服器，因為就緒態探針沒有通過，叢集就不會讓流量流入該 Pod。

過了 15 秒後，重新訪問 Service，

curl -o /dev/null -s -w "%{http_code}\n" 0.0.0.0:30000
---
200

這次回傳 200，因為就緒探針通過了，所以導通了流量。

Startup Probe

我們要來驗證，Startup Probe 會組塞 Liveness Probe 或 Readiness Probe。

組態檔案: pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: test-pod
  name: test-pod
spec:
  containers:
  - name: liveness
    command:
    - /bin/sh
    - -c
    args:
    - "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 30; touch /tmp/ready; sleep 3600"
    image: registry.k8s.io/busybox
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 10
      failureThreshold: 1
      periodSeconds: 1
    startupProbe:
      exec:
        command:
        - cat
        - /tmp/ready
      failureThreshold: 10
      periodSeconds: 10

說明:

container 啟動時添加檔案 /tmp/healthy 讓 liveness probe 可以成功
container 啟動後 30 秒移除檔案 touch /tmp/healthy 讓 liveness probe 成功
container 啟動後 60 秒添加檔案 touch /tmp/ready 讓 startup probe 成功
建立 Pod

kubectl apply -f pod.yaml

透過指令 kubectl get events 取得的 probe 事件，只有 Unhealthy 。

從上面學習到的知識，我們可以假設接下來的發展:

由於 startup probe 在 60 秒後才會完成，因此 startup probe 最後一次報警的時間與 container 啟動時間會相差約 60 秒。
由於 liveness probe 只會在 startup probe 就緒後動作，並且 liveness probe 的 initialDelaySeconds 設定 10 秒，因此 liveness probe 最後一次報警的時間與 container 啟動時間會相差約 70 秒。

我們來驗證一下。

監控 test-pod 事件

kubectl get events --field-selector involvedObject.name=test-pod -w

等待事件的變化，結果如下

LAST SEEN   TYPE      REASON      OBJECT         MESSAGE
74s         Normal    Scheduled   pod/test-pod   Successfully assigned default/test-pod to wslkind-worker
74s         Normal    Pulling     pod/test-pod   Pulling image "registry.k8s.io/busybox"
73s         Normal    Pulled      pod/test-pod   Successfully pulled image "registry.k8s.io/busybox" in 768ms (768ms including waiting). Image size: 1144547 bytes.
73s         Normal    Created     pod/test-pod   Created container liveness
73s         Normal    Started     pod/test-pod   Started container liveness
14s         Warning   Unhealthy   pod/test-pod   Startup probe failed: cat: can't open '/tmp/ready': No such file or directory
3s          Warning   Unhealthy   pod/test-pod   Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
3s          Normal    Killing     pod/test-pod   Container liveness failed liveness probe, will be restarted

可以看到，事件的發展大致跟我們的假設相符。