Day 9：架設 Prometheus (1)

2021 iThome 鐵人賽

DAY 9

DevOps

這個 site 就是遜啦 - SRE 30 天登大人之旅系列第 9 篇

13th鐵人賽 devops monitor prometheus

bogay

團隊NTNU-Unic0rn

2021-09-23 23:05:57

10223 瀏覽

分享至

昨天我們成功的讓 Prometheus 可以採集到一些指標了，可是為了瞭解服務的狀態，我們還需要自己提供指標，像是以 web server 來講，可能就需要諸如 HTTP 請求相關的指標、機器上面的硬體資訊、然後還有資料庫的資訊，都是為了監控服務所需的重要指標。

那麼我們要如何自己生出指標呢，這個動作稱之為 instrument（繁中方面我還沒找到適當的翻譯，似乎有簡中翻為「检测」），可以分成透過 client library 匯出，或是透過 exporter / integration，exporter 通常會是一些可執行檔，可以幫助我們爬取一些資訊並轉換成 Prometheus 可以接受的格式。然而也有一些工具是本身就有匯出 Prometheus 的指標的，這種情況下我們就不需要再額外設定 exporter，例如我常使用的 caddy 就有提供指標。

設定 node exporter

那麼首先我們就來設定一個 exporter 匯出指標吧，這邊紀錄一下 node exporter 的過程。

node exporter 主要用來匯出機器本身相關的資訊，包含 CPU、記憶體和硬碟用量等等，如果我這邊寫得不夠清楚的話，也可以參考 grafana 官方文件的教學（因為我是使用 docker 部署的，不然 Prometheus 也有提供直接部署在機器上的版本）。

其實部署的方式很單純，就直接把這個 config 加進去 docker-compose.yml 的 services 裡面就好了：

node-exporter:
  image: prom/node-exporter:v1.2.2
  restart: unless-stopped
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    - '--no-collector.arp'
    - '--no-collector.netstat'
    - '--no-collector.netdev'
    - '--no-collector.softnet'

然後，在 prometheus.yml 裡面新增 node exporter 這個 target：

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  # 加上這個！
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]

現在重開服務 (docker-compose up -d)，應該就會看到 node exporter 匯出的那些指標了，像是下面我就查了一下硬碟的使用率。（我是使用 WSL2，所以才會有那個路徑）

關於使用 docker image 部署 node exporter 的一些問題

接下來，來談談使用 docker 的話會遇到一些什麼問題吧。今天這樣用下來真的是覺得 README 裡面提到不建議使用 docker 是合理的，遇到不少麻煩。不過個人因為比較希望這些服務都用 docker 管理，所以最後還是這麼做了。

（Docker Desktop for Windows/Mac, Docker EE for Windows Server 限定）不能使用 host networking

因為在 repo 上面的 README 寫說 network_mode 要用 host，結果跑起來之後我發現不管怎樣就是連不上，換到 linux 的機器上面測試卻又正常，找了好久才在 docker 的文件上找到這段話：

The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server.

原來並不是任何版本的 docker 都可以使用 host mode 的...以前從來沒有注意過。

要有多個 bind mount volume

本來我看 GitHub repo 上面的範例只有掛載 / 這個路徑，然而 grafana 那邊的確有掛了三個路徑，可是我覺得 /proc 跟 /sys 應該都會包含在 / 底下吧，而且又看到 README 上面這樣寫：

The node_exporter will use path.rootfs as prefix to access host filesystem.

所以一開始就只有把 / 掛到 /rootfs 底下，可是後來發現好像有些 collector 沒有在運作，仔細翻了一遍文件才發現這句話：

Be aware that any non-root mount points you want to monitor will need to be bind-mounted into the container.

所以說每個路徑都需要分開 mount 才行，翻了一下目前的 source code，看起來應該只有這三個選項：

path.procfs
path.sysfs
path.rootfs

var (
	// The path of the proc filesystem.
	procPath   = kingpin.Flag("path.procfs", "procfs mountpoint.").Default(procfs.DefaultMountPoint).String()
	sysPath    = kingpin.Flag("path.sysfs", "sysfs mountpoint.").Default("/sys").String()
	rootfsPath = kingpin.Flag("path.rootfs", "rootfs mountpoint.").Default("/").String()
)

至於為什麼需要特意分開呢，我想有興趣但不知道原因的人，可以去查看看 procfs 跟 sysfs 是什麼東西。

某些 collector 無法使用（若沒有 host networking 的話）

應該有些人會發現，在上面的設定裡面我禁用了一些 collector，原因是因為，那些 collector 會需要存取 /host/proc/net（也就是 host 上的 /proc/net），可是在沒有禁用他們的時候，你可能會在 node exporter 的 log 裡面找到這些錯誤訊息：

node-exporter_1  | level=error ts=2021-09-23T12:48:38.490Z caller=collector.go:169 msg="collector failed" name=netdev duration_seconds=2.49e-05 err="couldn't get netstats: open /host/proc/net/dev: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.491Z caller=collector.go:169 msg="collector failed" name=softnet duration_seconds=2.45e-05 err="could not get softnet statistics: open /host/proc/net/softnet_stat: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.492Z caller=collector.go:169 msg="collector failed" name=netstat duration_seconds=1.43e-05 err="couldn't get netstats: open /host/proc/net/netstat: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.492Z caller=collector.go:169 msg="collector failed" name=arp duration_seconds=0.0001197 err="could not get ARP entries: open /host/proc/net/arp: no such file or directory"

雖然 host 上是可以看到那些檔案的，但是在 container 裡面卻找不到。後來我在 metricbeat 的文件上找到了下面這段話：

The system network metricset uses data from /proc/net/dev, or /hostfs/proc/net/dev when using -system.hostfs=/hostfs. The only way to make this file contain the host’s network devices is to use the --net=host flag. This is due to Linux namespacing; simply bind mounting the host’s /proc to /hostfs/proc is not sufficient.

雖然這不是 node exporter，但可以看到要存取 /proc/net 的話，需要 container 使用 host network 才行。

匯出 caddy 的指標

除了機器的指標以外，我們應該也會想要了解 web server 的指標，本來想透過 statsd exporter 去匯出 gunicorn 的指標的，後來想到 caddy 本身不是就有提供了嗎，而且還能順便連前端的部分都一起紀錄了。

根據官方文件的說明，我本來嘗試直接透過同個 bridge network 底下的 container 去打 http://caddy:2019/metric，然而卻發現 caddy 的 admin API 只有開給 localhost，既然這樣那我就自己開一個地方匯出指標吧。

在 caddyfile 裡面加上以下這個 block，就能透過 http://caddy:3939/metric 拿到指標了。

:3939 {
        metrics /metrics
}

用 curl 實驗一下，看起來正常：

接著修改 prometheus.yml，讓他去抓 caddy 的指標：

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
  # 加上這個！
  - job_name: caddy
    static_configs:
      - targets: ["caddy:3939"]

重開服務之後，打開 Prometheus 的 UI，輸入 caddy 應該就會看到一些指標了：