繼續離題：繼續查修Prometheus

第 11 屆 iThome 鐵人賽

DAY 11

DevOps

不知所云之 KK8s 實務記憶篇系列第 11 篇

11th鐵人賽 prometheus

小夫

2019-09-27 21:46:12

1623 瀏覽

分享至

前提

前提就是前一篇得內容，**Prometheus**開始用了PV、PVC
文章連結：https://ithelp.ithome.com.tw/articles/10221268

緊接著昨日晚上，prometheus pod異常了，以為小問題，刪除了prometheus pod期許它自行恢復，但...無情的是"沒有恢復"。
Prometheus log 顯示

component=tsdb \
msg="last page of the wal is torn, filling it with zeros" \
segment=/prometheus/wal/00000012

按照上述 log 特徵去查詢網路資訊，輾轉查到這幾篇：
https://github.com/prometheus/prometheus/issues/3632
https://github.com/prometheus/tsdb/issues/590
https://github.com/prometheus/tsdb/pull/623
得知在v.2.11.0-rc.0版本中獲得WAL問題的解決，於是....

更新版本

當時我使用的版本是v2.10.0，更新前去瀏覽 Prometheus github 網站查看目前版本進展，於是我選擇了v2.12.0版本。
緊接著趕緊去更新，驗證問題是否真的解決。

更新後狀況

更新之後，於是看見了新問題...嗎？

level=error ts=2019-09-26T15:58:39.427Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-26T15:58:39.427Z caller=pod.go:85 component="discovery manager scrape" discovery=k8s role=pod msg="pod informer unable to sync cache"
level=error ts=2019-09-26T15:58:39.427Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"

Prometheus仍然無法正常啟動，無解......
即使使用v2.11.2版本，情況相同。

於是我想到，當Prometheus pod啟動時，會去讀取PV裡頭的數據資料，於是我選擇了捨棄PVC、PV掛載使用。
Prometheus pod重新更新後，迅速恢復了～～～

不死心，再重新將PVC、PV掛載回來，果然Prometheus無法正常運作，總是讀取了眾多WAL資訊後就停擺了。

於是我做了個動作：移除PV，再重新掛載新的PV~
就這樣 "期許" 放著運作數日看看，是否有新問題發生。

但殘酷是，一小時之後，結果prometheus pod又故障不工作了！
只好先捨棄不採用PV、PVC磁碟方案了。