Day 10 - pprof：找到 1 個熱點並改善 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 10

Software Development

用 Golang ＋ Elasticsearch ＋ Kubernetes 打造雲原生搜尋服務系列第 10 篇

Day 10 - pprof：找到 1 個熱點並改善

17th鐵人賽 golang pprof cpu optimization

Shirley

團隊躺平的內捲小隊

2025-09-24 01:25:02

105 瀏覽

分享至

到目前為止，我們寫了 table-driven 測試、加了 context/timeout、做了微基準。今天要進一步學會：

啟動 pprof：在程式中暴露 profiling 介面。
跑壓測：讓服務有足夠流量，產生 profile。
用 pprof 工具分析：找到熱點函式。
做一次微調：看到優化數字差異。

Step 0：什麼是 pprof

pprof: 找出程式的 CPU 熱點、記憶體分配、goroutine 狀態 等瓶頸，方便做效能優化。如何理解 pprof 的數據可參考完整文件，目前我們只會在 Step 5 初步分析，找到最耗能的 function

Step 1：在專案裡啟動 pprof

我們新增一個檔案 pprof.go，用 build tag 來控制（避免 production 預設就開 profiling）。

// pprof.go
//go:build debug

package main

import (
	"log"
	"net/http"
	_ "net/http/pprof"
)

// StartPprof 在 :6060 開 profiling server
func StartPprof() {
	go func() {
		log.Println("pprof listening on :6060")
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()
}

在 main.go 裡加：

func main() {
	cfg := LoadConfig()

	mux := http.NewServeMux()
	mux.HandleFunc("/healthz", healthHandler)
	mux.HandleFunc("/search", searchHandler)

	handler := LoggingMiddleware(RecoveryMiddleware(mux))

	// 啟動 pprof (只有 build tag=debug 才會啟動)
	StartPprof()

	log.Printf("Server listening on %s", cfg.Port)
	if err := http.ListenAndServe(cfg.Port, handler); err != nil {
		log.Fatal(err)
	}
}

這時會在 main.go 出現錯誤：

undefined: StartPprofcompilerUndeclaredName

在 pprof.go 出現警告：

No packages found for open file /Users/shirley/programming/golang/cloud-native-search/pprof.go.
This file may be excluded due to its build tags; try adding "-tags=<build tag>" to your gopls "buildFlags" configuration
See the documentation for more information on working with build tags:
https://github.com/golang/tools/blob/master/gopls/doc/settings.md#buildflags-string.

但沒關係，這是因為我們加了 //go:build debug ，編譯器友善提醒

Step 2：跑起服務（debug 模式）

# 加上 debug build tag
go run -tags debug .

此時：

服務一樣跑在 :8080
pprof 跑在 :6060，可透過 http://localhost:6060/debug/pprof/ 查看

接下來開兩個 terminal tab，第一個製造流量(Step 3)，第二個取 profile(Step 4)

Step 3：製造流量

用 hey 或 ab 模擬 QPS，讓 /search 持續有請求。*

hey -n 1000000 -c 50 "<http://localhost:8080/search?q=golang>"

參數說明：

n 1000000

→ 總共要發送 1,000,000 個請求。
c 50

→ 同時維持 50 個並發連線 (concurrency)。

這表示同一時間最多有 50 個請求同時在跑。

預期結果

Summary:
  Total:	21.4561 secs
  Slowest:	0.3754 secs
  Fastest:	0.0000 secs
  Average:	0.0011 secs
  Requests/sec:	46606.8823

  Total data:	102000000 bytes
  Size/request:	102 bytes

Response time histogram:
  0.000 [1]	|
  0.038 [999124]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.075 [429]	|
  0.113 [190]	|
  0.150 [98]	|
  0.188 [7]	|
  0.225 [61]	|
  0.263 [32]	|
  0.300 [9]	|
  0.338 [1]	|
  0.375 [48]	|

Latency distribution:
  10% in 0.0001 secs
  25% in 0.0003 secs
  50% in 0.0006 secs
  75% in 0.0010 secs
  90% in 0.0020 secs
  95% in 0.0031 secs
  99% in 0.0081 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0000 secs, 0.3754 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0027 secs
  req write:	0.0000 secs, 0.0000 secs, 0.3327 secs
  resp wait:	0.0010 secs, 0.0000 secs, 0.3753 secs
  resp read:	0.0000 secs, 0.0000 secs, 0.3322 secs

Status code distribution:
  [200]	1000000 responses

Step 4：取 profile

go tool pprof "http://localhost:6060/debug/pprof/profile?seconds=10"

預期結果

go tool pprof "http://localhost:6060/debug/pprof/profile?seconds=10"

Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile?seconds=10
Saved profile in /Users/shirley/pprof/pprof.cloud-native-search.samples.cpu.002.pb.gz
File: cloud-native-search
Type: cpu
Time: 2025-09-24 00:29:37 CST
Duration: 10.13s, Total samples = 77.10s (761.26%)
Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top
Showing nodes accounting for 75.86s, 98.39% of 77.10s total
Dropped 264 nodes (cum <= 0.39s)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    61.67s 79.99% 79.99%     61.69s 80.01%  syscall.syscall
     6.44s  8.35% 88.34%      6.44s  8.35%  runtime.kevent
     4.06s  5.27% 93.61%      4.06s  5.27%  runtime.pthread_cond_wait
     1.91s  2.48% 96.08%      1.91s  2.48%  runtime.usleep
     0.78s  1.01% 97.09%      0.78s  1.01%  runtime.pthread_cond_signal
     0.49s  0.64% 97.73%      0.49s  0.64%  runtime.madvise
     0.46s   0.6% 98.33%      0.46s   0.6%  runtime.pthread_cond_timedwait_relative_np
     0.03s 0.039% 98.37%      0.86s  1.12%  runtime.lock2
     0.01s 0.013% 98.38%     16.30s 21.14%  main.main.LoggingMiddleware.func2
     0.01s 0.013% 98.39%     22.37s 29.01%  net/http.(*connReader).Read

(pprof) list main.main.LoggingMiddleware
Total: 77.10s
ROUTINE ======================== main.main.LoggingMiddleware.func2 in /Users/shirley/programming/golang/cloud-native-search/middleware.go
      10ms     16.30s (flat, cum) 21.14% of Total
         .          .     10:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .       10ms     11:		start := time.Now()
         .      170ms     12:		next.ServeHTTP(w, r)
      10ms       10ms     13:		duration := time.Since(start)
         .          .     14:
         .     16.11s     15:		log.Printf("%s %s %v", r.Method, r.URL.Path, duration)
         .          .     16:	})
         .          .     17:}
         .          .     18:
         .          .     19:func RecoveryMiddleware(next http.Handler) http.Handler {
         .          .     20:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {

會下載 CPU profile，接著進入互動介面。常用指令：

top：列出最耗 CPU 的函式
list <func>：顯示該函式內部的耗時
web：輸出 flame graph（需 graphviz）*

Step 5：分析 pprof 結果並找到真正的熱點

(pprof) top
Showing nodes accounting for 75.86s, 98.39% of 77.10s total
Dropped 264 nodes (cum <= 0.39s)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
    61.67s 79.99% 79.99%     61.69s 80.01%  syscall.syscall
     6.44s  8.35% 88.34%      6.44s  8.35%  runtime.kevent
     4.06s  5.27% 93.61%      4.06s  5.27%  runtime.pthread_cond_wait
     1.91s  2.48% 96.08%      1.91s  2.48%  runtime.usleep
     0.78s  1.01% 97.09%      0.78s  1.01%  runtime.pthread_cond_signal
     0.49s  0.64% 97.73%      0.49s  0.64%  runtime.madvise
     0.46s   0.6% 98.33%      0.46s   0.6%  runtime.pthread_cond_timedwait_relative_np
     0.03s 0.039% 98.37%      0.86s  1.12%  runtime.lock2
     0.01s 0.013% 98.38%     16.30s 21.14%  main.main.LoggingMiddleware.func2
     0.01s 0.013% 98.39%     22.37s 29.01%  net/http.(*connReader).Read

每欄代表的意思：

flat：在這個函式「自己」花掉的 CPU 時間（不含子呼叫）。
flat%：flat / 總時間的百分比。
sum%：到目前為止累加的百分比。
cum (cumulative)：這個函式「自己 + 所有子函式」加總的時間。
cum%：cum / 總時間的百分比。
最後一欄：function name

如何理解 pprof log 結果

syscall.syscall (79.99%)：系統調用，主要是網路 I/O，這是正常的
runtime.kevent (8.35%)：事件輪詢，也是正常的
main.LoggingMiddleware.func2 (21.14% cum)：我們的日誌中介層

關鍵發現：日誌中介層的累積時間佔了 21%，這是可以優化的地方。

Step 7：優化熱點 - 改善日誌性能

問題出在 log.Printf 每次都要格式化字串和寫入。在高 QPS *下很昂貴。

  // middleware.go
  func LoggingMiddleware(next http.Handler) http.Handler {
      return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
          start := time.Now()
          next.ServeHTTP(w, r)
          duration := time.Since(start)

          // 只在 debug 模式才詳細記錄，或者降低日誌頻率
          if duration > 100*time.Millisecond { // 只記錄慢請求
              log.Printf("%s %s %v", r.Method, r.URL.Path, duration)
          }
      })
  }

透過 JSON 結構化日誌

type LogEntry struct {
      Method   string        `json:"method"`
      Path     string        `json:"path"`
      Duration time.Duration `json:"duration"`
  }

  func LoggingMiddleware(next http.Handler) http.Handler {
      return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
          start := time.Now()
          next.ServeHTTP(w, r)
          duration := time.Since(start)

          // 使用結構化日誌，避免字串格式化
          entry := LogEntry{
              Method:   r.Method,
              Path:     r.URL.Path,
              Duration: duration,
          }

          // 只記錄慢請求或取樣記錄
          if duration > 100*time.Millisecond {
              if data, err := json.Marshal(entry); err == nil {
                  log.Println(string(data))
              }
          }
      })
  }

Step 8: 驗證優化效果

重新跑壓測：

hey -n 1000000 -c 50 "http://localhost:8080/search?q=golang"

預期結果

hey -n 1000000 -c 50 "http://localhost:8080/search?q=golang"

Summary:
  Total:	14.2736 secs
  Slowest:	0.1821 secs
  Fastest:	0.0000 secs
  Average:	0.0007 secs
  Requests/sec:	70059.3366

  Total data:	102000000 bytes
  Size/request:	102 bytes

Response time histogram:
  0.000 [1]	|
  0.018 [999477]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.036 [426]	|
  0.055 [37]	|
  0.073 [5]	|
  0.091 [4]	|
  0.109 [6]	|
  0.127 [0]	|
  0.146 [0]	|
  0.164 [0]	|
  0.182 [44]	|

Latency distribution:
  10% in 0.0001 secs
  25% in 0.0002 secs
  50% in 0.0004 secs
  75% in 0.0008 secs
  90% in 0.0014 secs
  95% in 0.0021 secs
  99% in 0.0052 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0000 secs, 0.1821 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0026 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0191 secs
  resp wait:	0.0006 secs, 0.0000 secs, 0.1811 secs
  resp read:	0.0000 secs, 0.0000 secs, 0.0347 secs

Status code distribution:
  [200]	1000000 responses

跑壓測的同時抓 profile：

go tool pprof "http://localhost:6060/debug/pprof/profile?seconds=10"

獲得結果：

優化後，LoggingMiddleware 不再出現在 top 10

go tool pprof "http://localhost:6060/debug/pprof/profile?seconds=10"

Fetching profile over HTTP from http://localhost:6060/debug/pprof/profile?seconds=10
Saved profile in /Users/shirley/pprof/pprof.cloud-native-search.samples.cpu.003.pb.gz
File: cloud-native-search
Type: cpu
Time: 2025-09-24 00:56:55 CST
Duration: 10.19s, Total samples = 71.04s (696.95%)
Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top
Showing nodes accounting for 69.55s, 97.90% of 71.04s total
Dropped 260 nodes (cum <= 0.36s)
Showing top 10 nodes out of 67
      flat  flat%   sum%        cum   cum%
    57.85s 81.43% 81.43%     57.87s 81.46%  syscall.syscall
     4.95s  6.97% 88.40%      4.95s  6.97%  runtime.kevent
     3.41s  4.80% 93.20%      3.41s  4.80%  runtime.pthread_cond_wait
     1.72s  2.42% 95.62%      1.72s  2.42%  runtime.usleep
     0.60s  0.84% 96.47%      0.60s  0.84%  runtime.pthread_cond_timedwait_relative_np
     0.50s   0.7% 97.17%      0.50s   0.7%  runtime.pthread_cond_signal
     0.48s  0.68% 97.85%      0.48s  0.68%  runtime.madvise
     0.02s 0.028% 97.87%      1.03s  1.45%  runtime.lock2
     0.01s 0.014% 97.89%     20.76s 29.22%  bufio.(*Reader).Peek
     0.01s 0.014% 97.90%     32.07s 45.14%  internal/poll.(*FD).Write

數據分析

現在的熱點全部是系統層面：

syscall.syscall (81.43%) - 網絡 I/O
runtime.kevent (6.97%) - 事件輪詢
bufio.(*Reader).Peek (29.22% cum) - HTTP 請求解析
internal/poll.(*FD).Write (45.14% cum) - 網絡寫入

優化成功的證明：

LoggingMiddleware 不再出現在熱點列表
CPU 使用更集中在系統調用 - 這是 I/O 密集型服務的理想狀態
應用層開銷大幅減少

可進一步優化：

連接池優化 - HTTP keep-alive
響應緩存 - 避免重複計算
批次處理 - 減少系統呼叫次數

建立習慣

pprof 不只是 CPU：也能看記憶體（heap）、阻塞（block）、goroutine。*
持續記錄基線：每次優化前後，跑一次 profile，留下數據。
和 Day 8 搭配：先用微基準找「單點差異」，再用 pprof 找「整體熱點」。

小結

今天學會：

如何在 Go 服務裡啟動 pprof
如何跑壓測並抓取 profile
如何用 pprof top/list 找熱點
做一次小優化並看到數據差異

*註1：Linux / Window 參考官方文件安裝 hey，macOS 則可透過 brew 安裝

brew install hey

*註2：Linux / Window 參考官方文件安裝 graphviz，macOS 則可透過 brew 安裝

brew install graphviz

*註3：QPS 是 request per second，每秒處理的請求數

*註4：

heap profile: 查看哪些 function 分配了多少記憶體，用來找記憶體洩漏 (memory leak)，可能某些結構沒釋放，或 slice/map 一直擴張。

block profile: 哪些地方 goroutine 在「等待」同步（mutex、chan）而被 block，用來找出 goroutine 卡住的地方（例如 channel send/recv、mutex.Lock）。

goroutine profile: 當下所有 goroutine 的堆疊狀態，用來發現 goroutine leak（goroutine 數量暴增卻沒回收），或確認是否有 goroutine 一直卡在同一個呼叫堆疊（死鎖、阻塞）。

go tool pprof http://localhost:6060/debug/pprof/heap

go tool pprof http://localhost:6060/debug/pprof/block

go tool pprof http://localhost:6060/debug/pprof/goroutine

👉 明天接著「如何暴露自定義的指標（QPS、Latency）」，方便用 Prometheus + Grafana 長期觀測。