講完 Trino Cluster 的監控,當然要補充與查詢相關的監控指標,這邊列幾個筆者有實作的 Trino Query 監控指標:
(
sum(org_alluxio_client_cachebytesreadcache_count)/sum(org_alluxio_client_cachebyteswrittencache_count)
)
1. trino_execution_querymanager_runningqueries
2. trino_execution_querymanager_runningqueries offset 1d
3. trino_execution_querymanager_runningqueries offset 7d
1. trino_execution_querymanager_queuedqueries
2.trino_execution_QueryManager_QueuedQueries{app=~"${cluster}"} offset 1d
3.trino_execution_QueryManager_QueuedQueries{app=~"${cluster}"} offset 7d
# Submitted Queries in Five minutes
trino_execution_querymanager_submittedqueries_fiveminute_count
# Completed Queries in Five minutes
trino_execution_querymanager_completedqueries_fiveminute_count
trino_execution_querymanager_failedqueries_fiveminute_count
有指標就代表有監控的必要,而決定監控是否又有後續動作的標準便是告警,這邊我們一樣針對 Trino Query 的監控指標實作一些告警機制:
## 快取效率低告警
expr: (sum(org_alluxio_client_cachebytesreadcache_count) / sum(org_alluxio_client_cachebyteswrittencache_count)) < 0.5
for: 10m
## 快取幾乎沒被利用告警
expr: (sum(org_alluxio_client_cachebytesreadcache_count) / sum(org_alluxio_client_cachebyteswrittencache_count)) < 0.1
for: 30m
queuedqueries > 0
持續超過 10 分鐘queuedqueries
5 分鐘斜率大於 0 持續超過 15 分鐘## 持續有排隊
- alert: TrinoQueuePersistentlyNonZero
expr: trino_execution_querymanager_queuedqueries > 0
for: 10m
## 排隊越積越多
- alert: TrinoQueueBacklogGrowing
expr: rate(trino_execution_querymanager_queuedqueries[5m]) > 0
for: 15m
5min_FailedQueries > 5
持續 15 分鐘5min_FailedQueries
大於昨天的兩倍持續 10 分鐘## 相對昨天異常暴衝
- alert: TrinoFailedQueriesPersistentlyHigh
expr: sum(trino_execution_querymanager_failedqueries_fiveminute_count) >= 5
for: 15m
## 一段時間持續失敗
- alert: TrinoFailedQueriesSpikeVsYesterday
expr: sum(trino_execution_querymanager_failedqueries_fiveminute_count)
> 2 * (sum(trino_execution_querymanager_failedqueries_fiveminute_count offset 1d) + 1)
for: 10m
以第一次參賽來說,算是滿意這次的過程,無論是從受同事啟發提筆記錄工作、真的開始讀文件補充內容,到最中一篇篇完成 30 天的文章,中間受到許多人的指教跟鼓勵,謝謝這一切的發生也謝謝持續提筆的自己。
最後想拿前幾天參加 Pycon 聽到 碼農高天 說的話期許下自己,寫代碼就是有興趣就寫,有心得就分享,希望再往後資料工程師的職涯中也能持續保持熱忱。
最後幫一起參賽的朋朋 @wudihero2 的文章【知其然,更知其所以然】打下廣告:https://ithelp.ithome.com.tw/articles/10376305
My Linkedin: https://www.linkedin.com/in/benny0624/
My Medium: https://hndsmhsu.medium.com/