iT邦幫忙

0

如何使用 Grafana Loki 警報規則並透過 Alertmanager 發送警告

  • 分享至 

  • xImage
  •  

我們會把 Loki 的警報規則發送到 Alertmanager 來進行管理,包括靜音、刪除重複數據與分組,並將它們路由到正確的接收器,例如電子郵件或 LINE Notify。

設置警報和通知的主要步驟如下

  • 設置 Alertmanager
  • 配置 Loki 與 Alertmanager 對話
  • 在 Loki 中創建警報規則

設置 Alertmanager
如何安裝 Alertmanager 可以參考這篇文章

修改 alertmanager.yml 配置文件

sudo vi /opt/alertmanager/alertmanager-0.25.0.linux-amd64/alertmanager.yml

新增一組接收器 team-infra-mails,透過電子郵件來發送警報。

global:
  smtp_smarthost: 'your_smtp_ip:your_port'
  smtp_from: 'your_from_mail_address'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'team-infra-mails'

receivers:
  - name: 'team-infra-mails'
    email_configs:
      - to: 'your_to_mail_address'
        send_resolved: true

# Inhibition rules allow to mute a set of alerts given that another alert is firing.
# We use this to mute any warning-level notifications if the same alert is already critical.
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the alertname is the same.
    # CAUTION:
    #   If all label names listed in `equal` are missing
    #   from both the source and target alerts,
    #   the inhibition rule will apply!
    equal: ['alertname', 'dev', 'instance']

記得重啟 Alertmanager 服務

sudo service alertmanager restart

配置 Loki 與 Alertmanager 對話
編輯 Loki 的配置文件

sudo vi /opt/loki/loki-local-config.yaml

修改 rules_directory 指向您存放警報規則的資料夾

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

注意 filesystem 的 chunks_directory 與 rules_directory 的路徑為 /tmp,代表重開後資料就會消失,若需要保留的數據記得自行修改。

修改 alertmanager_url 指向您安裝的伺服器

ruler:
  alertmanager_url: http://localhost:9093

記得重啟 Loki 服務

sudo service loki restart

在 /tmp/loki/rules 底下建立 fake 資料夾

sudo mkdir /tmp/loki/rules/fake

為什麼要建立 fake 資料夾?
主要是因為 Loki 支援多租戶模式,單租戶模式下 fake 是預設的用戶名稱。若您開啟多租戶模式,請記得透過用戶名稱區隔開來。

建立警報規則
我們使用資料庫或者資料表執行 CREATE、ALTER 或 DROP 作為演示範例。

sudo vi /tmp/loki/rules/fake/mssql-ddl-alert.yml

文件內容如下

groups:
  - name: mssql-object-created
    rules:
      - alert: mssql-object-created
        expr: |
          count_over_time({computer=~"your_mssql_server", source="MSSQLSERVER", eventID="33205"}
            | pattern `<_>event_time:<event_time>\n<_>`
            | pattern `<_>action_id:<action_id>\n<_>`
            | label_format action_id=`{{.action_id | trim | replace "CR" "CREATE" | replace "AL" "ALTER" | replace "DR" "DROP"}}`
            | action_id ="CREATE"
            | pattern `<_>class_type:<class_type>\n<_>`
            | label_format class_type=`{{.class_type | trim | replace "DB" "DATABASE" | replace "U" "TABLE" | replace "V" "VIEW" | replace "P" "STORED PROCEDURE"}}`
            | pattern `<_>database_name:<database_name>\n<_>`
            | database_name !~`(tempdb)`
            | pattern `<_>object_name:<object_name>\n<_>`
            | pattern `<_>schema_name:<schema_name>\n<_>`
            | pattern `<_>server_instance_name:<server_instance_name>\n<_>`
            | pattern `<_>server_principal_name:<server_principal_name>\n<_>`
            | pattern `<_>statement:<statement>\nadditional_information<_>`
            | label_format statement=`{{.statement | replace "\\r\\n" " " | replace "\\r" " " | replace "\\n" " " | replace "\u005c\u005c" "\u005c" | replace "[" "" | replace "]" ""}}` [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "主機名稱: {{ $labels.computer }}\n警示訊息: {{ $labels.object_name }} has been created.\n敘述句: {{ $labels.statement }}\n"
  - name: mssql-object-alerted
    rules:
      - alert: mssql-object-alerted
        expr: |
          count_over_time({computer=~"your_mssql_server", source="MSSQLSERVER", eventID="33205"}
            | pattern `<_>event_time:<event_time>\n<_>`
            | pattern `<_>action_id:<action_id>\n<_>`
            | label_format action_id=`{{.action_id | trim | replace "CR" "CREATE" | replace "AL" "ALTER" | replace "DR" "DROP"}}`
            | action_id ="ALTER"
            | pattern `<_>class_type:<class_type>\n<_>`
            | label_format class_type=`{{.class_type | trim | replace "DB" "DATABASE" | replace "U" "TABLE" | replace "V" "VIEW" | replace "P" "STORED PROCEDURE"}}`
            | pattern `<_>database_name:<database_name>\n<_>`
            | database_name !~`(tempdb)`
            | pattern `<_>object_name:<object_name>\n<_>`
            | pattern `<_>schema_name:<schema_name>\n<_>`
            | pattern `<_>server_instance_name:<server_instance_name>\n<_>`
            | pattern `<_>server_principal_name:<server_principal_name>\n<_>`
            | pattern `<_>statement:<statement>\nadditional_information<_>`
            | label_format statement=`{{.statement | replace "\\r\\n" " " | replace "\\r" " " | replace "\\n" " " | replace "\u005c\u005c" "\u005c" | replace "[" "" | replace "]" ""}}` [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "主機名稱: {{ $labels.computer }}\n警示訊息: {{ $labels.object_name }} has been alerted.\n敘述句: {{ $labels.statement }}\n"
  - name: mssql-object-dropped
    rules:
      - alert: mssql-object-dropped
        expr: |
          count_over_time({computer=~"your_mssql_server", source="MSSQLSERVER", eventID="33205"}
            | pattern `<_>event_time:<event_time>\n<_>`
            | pattern `<_>action_id:<action_id>\n<_>`
            | label_format action_id=`{{.action_id | trim | replace "CR" "CREATE" | replace "AL" "ALTER" | replace "DR" "DROP"}}`
            | action_id ="DROP"
            | pattern `<_>class_type:<class_type>\n<_>`
            | label_format class_type=`{{.class_type | trim | replace "DB" "DATABASE" | replace "U" "TABLE" | replace "V" "VIEW" | replace "P" "STORED PROCEDURE"}}`
            | pattern `<_>database_name:<database_name>\n<_>`
            | database_name !~`(tempdb)`
            | pattern `<_>object_name:<object_name>\n<_>`
            | pattern `<_>schema_name:<schema_name>\n<_>`
            | pattern `<_>server_instance_name:<server_instance_name>\n<_>`
            | pattern `<_>server_principal_name:<server_principal_name>\n<_>`
            | pattern `<_>statement:<statement>\nadditional_information<_>`
            | label_format statement=`{{.statement | replace "\\r\\n" " " | replace "\\r" " " | replace "\\n" " " | replace "\u005c\u005c" "\u005c" | replace "[" "" | replace "]" ""}}` [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "主機名稱: {{ $labels.computer }}\n警示訊息: {{ $labels.object_name }} has been dropped.\n敘述句: {{ $labels.statement }}\n"

利用下列的 T-SQL 指令碼來觸發警報規則

USE [Database_1]
GO

CREATE VIEW [dbo].[View_1118]  
AS  
SELECT *  
FROM [dbo].[Table_1]
GO

ALTER VIEW [dbo].[View_1118]  
AS  
SELECT *  
FROM [dbo].[Table_2]
GO

DROP VIEW [dbo].[View_1118]
GO

查看 Alermanager 是否接收到警報

檢視郵件伺服器,Alermanager 有確實的透過電子郵件進行發送。

若有設定 send_resolved,Alermanager 也會發送警報解除的通知。

receivers:
  - name: 'team-infra-mails'
    email_configs:
      - to: 'your_to_mail_address'
        send_resolved: true

使用 LINE Notify
由於目前公司主要還是透過 LINE 來進行協同合作,因此把警報推播到 LINE Notify 來進行警告吧。

如何申請 LINE Notify 發行存取權杖可以參考這篇文章

很可惜的是目前 Alermanager 的 Receiver 並不支援 LINE Notify

# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
sns_configs:
  [ - <sns_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]
telegram_configs:
  [ - <telegram_config>, ... ]
webex_configs:
  [ - <webex_config>, ... ]

不過還好可以使用 webhook 的方式來串接 LINE Notify
感謝泰國曼谷的大大已經幫我們種好樹了
https://github.com/be99inner/line-notify-gateway/

不過 message 是寫死在 app.py 裡面有些可惜

def firing_alert(request):
    if request.json['status'] == 'firing':
        icon = "⛔⛔⛔ 😡 ⛔⛔⛔"
        status = "Firing"
        time = reformat_datetime(request.json['alerts'][0]['startsAt'])
    else:
        icon = "🔷🔷🔷 😎 🔷🔷🔷"
        status = "Resolved"
        time = str(datetime.now().date()) + ' ' + str(datetime.now().time().strftime('%H:%M:%S'))
    header = {'Authorization':request.headers['AUTHORIZATION']}
    for alert in request.json['alerts']:
        msg = "Alertmanger: " + icon + "\nStatus: " + status + "\nSeverity: " + alert['labels']['severity'] + "\nTime: " + time + "\nSummary: " + alert['annotations']['summary'] + "\nDescription: " + alert['annotations']['description']
        msg = {'message': msg}
        response = requests.post(LINE_NOTIFY_URL, headers=header, data=msg)

改成僅透過 alert[‘annotations’][‘summary’] 當作參數傳入
https://github.com/jieshiun/line-notify-gateway

如此一來,我們只要專心在 summary 修改告警訊息即可。

def firing_alert(request):
    if request.json['status'] == 'firing':
        status = "Firing"
        time = reformat_datetime(request.json['alerts'][0]['startsAt'])
    else:
        status = "Resolved"
        time = str(datetime.now().date()) + ' ' + str(datetime.now().time().strftime('%H:%M:%S'))
    header = {'Authorization':request.headers['AUTHORIZATION']}
    for alert in request.json['alerts']:
        msg = "\n發生時間: " + time + "\n" + alert['annotations']['summary'] + "當前狀態: " + status
        msg = {'message': msg}
        response = requests.post(LINE_NOTIFY_URL, headers=header, data=msg)

如何安裝 Docker 與 Docker Compose 可以參考這篇文章

透過 Docker Compose 啟動容器

cd /opt
sudo git clone https://github.com/jieshiun/line-notify-gateway.git
cd line-notify-gateway
sudo docker compose up -d
[+] Building 15.7s (9/9) FINISHED                                                                                                                                                                                                                                
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                        0.1s
 => => transferring dockerfile: 179B                                                                                                                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                           0.2s
 => => transferring context: 2B                                                                                                                                                                                                                             0.0s
 => [internal] load metadata for docker.io/library/python:3.8-slim                                                                                                                                                                                          0.0s
 => [1/4] FROM docker.io/library/python:3.8-slim                                                                                                                                                                                                            0.0s
 => [internal] load build context                                                                                                                                                                                                                           0.1s
 => => transferring context: 10.85kB                                                                                                                                                                                                                        0.0s
 => CACHED [2/4] WORKDIR /usr/app                                                                                                                                                                                                                           0.0s
 => [3/4] COPY ./ /usr/app                                                                                                                                                                                                                                  0.6s
 => [4/4] RUN pip install -r requirements.txt                                                                                                                                                                                                              13.9s
 => exporting to image                                                                                                                                                                                                                                      1.0s 
 => => exporting layers                                                                                                                                                                                                                                     1.0s 
 => => writing image sha256:56e0ab49d94ae2b7fe9995b8d1e266780ed06dcd533806b6ff39f5096efae7d1                                                                                                                                                                0.0s 
 => => naming to docker.io/library/line-notify-gateway-line-notify-gateway                                                                                                                                                                                  0.0s 
[+] Running 1/1                                                                                                                                                                                                                                                  
 ⠿ Container line-notify-gateway-line-notify-gateway-1  Started

查詢該服務開放 5000 埠號

sudo docker compose ps
SERVICE               CREATED             STATUS              PORTS
line-notify-gateway   30 seconds ago      Up 28 seconds       0.0.0.0:5000->5000/tcp, :::5000->5000/tcp

映像檔我也有上傳到 Docker Hub,需要的朋友可以直接使用。

sudo docker run -d -p 5000:5000 -v /etc/localtime:/etc/localtime:ro --restart always --name line-notify-gateway jieshiun/line-notify-gateway

瀏覽 http://your_host_ip:5000/webhook

瀏覽 http://your_host_ip:5000/logs

如此一來我們的 Webhook Receiver 就建置好了

配置 Alertmanager
修改 alertmanager.yml 配置文件

sudo vi /opt/alertmanager/alertmanager-0.25.0.linux-amd64/alertmanager.yml

新增一組接收器 team-infra-line-notify 使用 webhook_configs

  • url 填入 Webhook Receiver 的服務端點
  • bearer_token 填入您申請的 LINE 存取權杖

新增一組路由規則

  • 事件來源只要是符合 MSSQLSERVER 就透過 Line Notify 來發送警報
  • 若不符合路由規則,則是使用預設的電子郵件來發送警報。
global:
  smtp_smarthost: 'your_smtp_ip:your_port'
  smtp_from: 'your_from_mail_address'
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 1m
  group_interval: 15m
  repeat_interval: 4h
  receiver: 'team-infra-mails'
  routes:
    - receiver: "team-infra-line-notify"
      group_wait: 10s
      match_re:
        source: MSSQLSERVER
      continue: true

receivers:
  - name: 'team-infra-mails'
    email_configs:
      - to: 'your_to_mail_address'
        send_resolved: true
  - name: 'team-infra-line-notify'
    webhook_configs:
      - url: 'http://localhost:5000/webhook'
        send_resolved: true
        http_config:
          bearer_token: 'your_line_notify_access_token'

# Inhibition rules allow to mute a set of alerts given that another alert is firing.
# We use this to mute any warning-level notifications if the same alert is already critical.
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if the alertname is the same.
    # CAUTION:
    #   If all label names listed in `equal` are missing
    #   from both the source and target alerts,
    #   the inhibition rule will apply!
    equal: ['alertname', 'dev', 'instance']

記得重啟 Alertmanager 服務

sudo service alertmanager restart

利用下列的 T-SQL 指令碼來觸發警報規則

USE [Database_1]
GO

CREATE VIEW [dbo].[View_0240]  
AS  
SELECT *  
FROM [dbo].[Table_1]
GO

ALTER VIEW [dbo].[View_0240]  
AS  
SELECT *  
FROM [dbo].[Table_2]
GO

DROP VIEW [dbo].[View_0240]
GO

查看 Alermanager 是否接收到警報

檢視 LINE 聊天群組,Alermanager 有成功發送告警無誤。

也可以從 Webhook Receiver 查看呼叫 Notify API 是否成功

瀏覽 http://your_host_ip:5000/logs

建立一個登入失敗的警報規則來測試

sudo vi /tmp/loki/rules/fake/mssql-login-alert.yml

文件內容如下

groups:
  - name: mssql-login-failed-alert
    rules:
      - alert: mssql-login-failed
        expr: |
          count_over_time({computer=~"your_mssql_server", source="MSSQLSERVER", eventID="33205"}
            | pattern `<_>action_id:<action_id>\n<_>`
            | label_format action_id =`{{.action_id | trim | replace "LGIF" "LOGIN FAILED"}}`
            | pattern `<_>statement:<statement>\nadditional_information<_>`
            | action_id ="LOGIN FAILED" [1m]) > 3
        for: 0m
        labels: 
          severity: critical
        annotations:
          summary: "主機名稱: {{ $labels.computer }}\n警示訊息: Too many login failed in mssql.\n敘述句: {{ $labels.statement }}\n"

檢視 LINE 聊天群組,Alermanager 有成功發送告警無誤。

建立一個刪除數據的警報規則來測試

sudo vi /tmp/loki/rules/fake/mssql-dml-alert.yml

文件內容如下

groups:
  - name: mssql-object-deleted
    rules:
      - alert: mssql-object-deleted
        expr: |
          count_over_time({computer=~"your_mssql_server", source="MSSQLSERVER", eventID="33205"}
            | pattern `<_>event_time:<event_time>\n<_>`
            | pattern `<_>action_id:<action_id>\n<_>`
            | label_format action_id=`{{.action_id | trim | replace "SL" "SELECT" | replace "IN" "INSERT" | replace "UP" "UPDATE" | replace "DL" "DELETE"}}`
            | action_id ="DELETE"
            | pattern `<_>class_type:<class_type>\n<_>`
            | label_format class_type=`{{.class_type | trim | replace "DB" "DATABASE" | replace "U" "TABLE" | replace "V" "VIEW" | replace "P" "STORED PROCEDURE"}}`
            | pattern `<_>database_name:<database_name>\n<_>`
            | database_name !~`(tempdb)`
            | pattern `<_>object_name:<object_name>\n<_>`
            | pattern `<_>server_principal_name:<server_principal_name>\n<_>`
            | pattern `<_>statement:<statement>\nadditional_information<_>`
            | label_format statement=`{{.statement | replace "\\r\\n" " " | replace "\\r" " " | replace "\\n" " "}}` [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "主機名稱: {{ $labels.computer }}\n警示訊息: {{ $labels.object_name }} has been deleted.\n使用者名稱: {{ $labels.server_principal_name }}\n敘述句: {{ $labels.statement }}\n"

檢視 LINE 聊天群組,Alermanager 有成功發送告警無誤。

相信大家已經學會如何建立 Loki 警報規則並透過 Alertmanager 串接 LINE Notify 發送警告。

今天的分享就到這邊,希望有幫助到大家。

參考文件

  1. https://grafana.com/docs/loki/latest/rules/
  2. https://cloud.tencent.com/developer/article/2017526
  3. https://prometheus.io/docs/alerting/latest/alertmanager/
  4. https://zhangrongjie.blog.csdn.net/article/details/122617740

圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言