[18] 為什麼 CloudWatch Insight 可以收集 EKS cluster node 及 pod metrics

為什麼 CloudWatch Insight 可以收集 EKS cluster node 及 pod metrics

昨日 討論了以 FluentD 為例討論 log agent 部署至每個 node 上收集 /var/log/containers 目錄。接續討論 Container Insight 部署完之後,其這些 metrics 來源及收集方式。


  1. 透過官方 Container Insights on Amazon EKS and Kubernetes 文件[1]所提供的 Quick Start template 部署 CloudWatch agent 及 Fluentd。
$ curl | sed "s/{{cluster_name}}/ironman-2022/;s/{{region_name}}/eu-west-1/" | kubectl apply -f -

其 source code template 也可於 GitHub - CloudWatch Agent for Container Insights Kubernetes Monitoring[2] 查看。
2. 由於 CloudWatch agent 及 Fluentd 皆需要具有 CloudWatch Logs 或 Metrics IAM 權限,因此以下範例透過關聯 CloudWatchAgentAdminPolicy 方式設定 IRSA [3]。

透過 eksctl 命令建立 IAM role 給與 ServiceAccount fluentdcloudwatch-agent 使用。

$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=cloudwatch-agent \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \

$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=fluentd \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \

根據 CloudWatch Insight 需求[4],CloudWatch Agent 需要額外 EC2 DescribeVolumes 權限,若關聯的 Policy 沒有此權限則需要額外設定。

    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Resource": "*",
            "Effect": "Allow"
  1. 分別透過 kubectl rollout 重啟 DaemonSet cloudwatch-agentfluentd-cloudwatch
$ kubectl -n amazon-cloudwatch rollout restart ds cloudwatch-agent
daemonset.apps/cloudwatch-agent restarted

$ kubectl -n amazon-cloudwatch rollout restart ds fluentd-cloudwatch
daemonset.apps/fluentd-cloudwatch restarted


為避免混淆 FluentD 及 CloudWatch Agent 所負責事項,以下仍會依照 FluentD 及 CloudWatch Agent 設定檔進行解釋。

預設安裝好上述 QuickStart template ,則可以在 CloudWatch Log groups 查看到以下 Log group:

  • /aws/containerinsights/ironman-2022/application
  • /aws/containerinsights/ironman-2022/dataplane
  • /aws/containerinsights/ironman-2022/host
  • /aws/containerinsights/ironman-2022/performance


      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_systemd
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
        log_stream_name_key stream_name
        auto_create_stream true
        remove_log_stream_name_key true
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_containers
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/application"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
<match host.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_host_logs
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true

其中 application 部分則是 Day 17 所提及的 Pod Container log 。另外,在此 QuickStart template 中所使用的 fluentd 版本已經是較舊版本: fluent/fluentd-kubernetes-daemonset:v1.7.3-debian-cloudwatch-1.0。我們也可以透過查看 fluentd container 內的 Gemfile 查看 plugin 版本:

若有需要使用較新 fluentd 版本 ,則可以參考 Fluentd Daemonset for Kubernetes[6]。

$ kubectl -n amazon-cloudwatch exec -it fluentd-cloudwatch-qkf9m -- cat /fluentd/Gemfile
Defaulted container "fluentd-cloudwatch" out of: fluentd-cloudwatch, copy-fluentd-config (init), update-log-driver (init)
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/Gemfile.erb

source ""

gem "fluentd", "1.7.3"
gem "oj", "3.8.1"
gem "fluent-plugin-multi-format-parser", "~> 1.0.0"
gem "fluent-plugin-concat", "~> 2.3.0"
gem "fluent-plugin-grok-parser", "~> 2.5.0"
gem "fluent-plugin-prometheus", "~> 1.5.0"
gem 'fluent-plugin-json-in-json-2', ">= 1.0.2"
gem "fluent-plugin-record-modifier", "~> 2.0.0"
gem "fluent-plugin-rewrite-tag-filter", "~> 2.2.0"
gem "aws-sdk-cloudwatchlogs", "~> 1.0"
gem "fluent-plugin-cloudwatch-logs", "~> 0.7.4"
gem "fluent-plugin-kubernetes_metadata_filter", "~> 2.3.0"
gem "ffi"
gem "fluent-plugin-systemd", "~> 1.0.1"

根據 Fluentd 設定檔,可以確認以下目錄:

  • application:所有於 /var/log/containers 目錄下的 application log。
  • host:Node 上的 /var/log/dmesg/var/log/secure/var/log/messages 目錄。
  • dataplane:Node 上的 /var/log/journal 目錄,主要收集 kubelet.servicekubeproxy.servicedocker.service

CloudWatch Agent

根據 ConfigMap cwagentconfig,僅設定了 metrics_collected 屬於 kubernetes 類型,及定義 cluster_name

$ kubectl -n amazon-cloudwatch get cm cwagentconfig -o yaml
apiVersion: v1
  cwagentconfig.json: |
      "agent": {
        "region": "eu-west-1"
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "ironman-2022",
            "metrics_collection_interval": 60
        "force_flush_interval": 5
kind: ConfigMap

查看 CloudWatch Agent logs,可以查看到預設 Config 會被轉譯成 TOML 文件格式:

$ kubectl -n amazon-cloudwatch logs cloudwatch-agent-fdhmb
Configuration validation first phase succeeded

2022/10/03 12:46:34 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2022/10/03 12:46:34 D! toml config [agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logfile = ""
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false


    container_orchestrator = "eks"
    interval = "60s"
    mode = "detail"
      metricPath = "logs"

    interval = "60s"
    node_name = ""
      metricPath = "logs_k8sapiserver"


    force_flush_interval = "5s"
    log_stream_name = ""
    region = "eu-west-1"
    tagexclude = ["metricPath"]
      metricPath = ["logs", "logs_k8sapiserver"]


    disk_device_tag_key = "device"
    ebs_device_keys = ["*"]
    ec2_instance_tag_keys = ["aws:autoscaling:groupName"]
    ec2_metadata_tags = ["InstanceId", "InstanceType"]
      metricPath = ["logs"]

    cluster_name = "ironman-2022"
    host_ip = ""
    node_name = ""
    order = 1
    prefer_full_pod_name = false
    tag_service = true
      metricPath = ["logs", "logs_k8sapiserver"]
2022-10-03T12:46:34Z I! Starting AmazonCloudWatchAgent 1.247354.0
2022-10-03T12:46:34Z I! AWS SDK log level not set
2022-10-03T12:46:34Z I! Loaded inputs: cadvisor k8sapiserver
2022-10-03T12:46:34Z I! Loaded aggregators:
2022-10-03T12:46:34Z I! Loaded processors: ec2tagger k8sdecorator
2022-10-03T12:46:34Z I! Loaded outputs: cloudwatchlogs
2022-10-03T12:46:34Z I! Tags enabled:
2022-10-03T12:46:34Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:1s
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [logagent] starting
2022-10-03T12:46:34Z I! [logagent] found plugin cloudwatchlogs is a log backend
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
I1003 12:46:34.492882       1 leaderelection.go:248] attempting to acquire leader lease amazon-cloudwatch/cwagent-clusterleader...
I1003 12:46:34.504069       1 leaderelection.go:258] successfully acquired lease amazon-cloudwatch/cwagent-clusterleader
2022-10-03T12:46:34Z I! k8sapiserver Switch New Leader:
2022-10-03T12:46:34Z I! k8sapiserver OnStartedLeading:
I1003 12:46:34.504480       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"amazon-cloudwatch", Name:"cwagent-clusterleader", UID:"63166128-5ea1-40c7-ae47-67ac27ed5830", APIVersion:"v1", ResourceVersion:"5330123", Fi
eldPath:""}): type: 'Normal' reason: 'LeaderElection' became leader
W1003 12:46:34.610597       1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:40Z I! [outputs.cloudwatchlogs] First time sending logs to /aws/containerinsights/ironman-2022/performance/ since startup so sequenceToken is nil, learned new token:(0xc000fbe130): Th
e given sequenceToken is invalid. The next expected sequenceToken is: 49615097157874583574523473011765967269436169076596539730
2022-10-03T12:46:40Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 103.736985ms before retrying.
2022-10-03T12:47:34Z I! number of namespace to running pod num map[amazon-cloudwatch:8 default:1 demo:4 kube-system:10]


  • inputscadvisork8sapiserver
  • processorsec2taggerk8sdecorator
  • outputscloudwatchlogs,log stream 為。此為該 CloudWatch Agent Pod 所在 node。

同時,我們也能確認 Logs 被更新 log stream /aws/containerinsights/ironman-2022/performance/

查看 Log event:

    "AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
    "CloudWatchMetrics": [
            "Metrics": [
                    "Unit": "Percent",
                    "Name": "pod_cpu_utilization"
                    "Unit": "Percent",
                    "Name": "pod_memory_utilization"
                    "Unit": "Bytes/Second",
                    "Name": "pod_network_rx_bytes"
                    "Unit": "Bytes/Second",
                    "Name": "pod_network_tx_bytes"
                    "Unit": "Percent",
                    "Name": "pod_memory_utilization_over_pod_limit"
            "Dimensions": [
            "Namespace": "ContainerInsights"
            "Metrics": [
                    "Unit": "Percent",
                    "Name": "pod_cpu_reserved_capacity"
                    "Unit": "Percent",
                    "Name": "pod_memory_reserved_capacity"
            "Dimensions": [
            "Namespace": "ContainerInsights"
            "Metrics": [
                    "Unit": "Count",
                    "Name": "pod_number_of_container_restarts"
            "Dimensions": [
            "Namespace": "ContainerInsights"
    "ClusterName": "ironman-2022",
    "InstanceId": "i-03b91195eefbde8e3",
    "InstanceType": "m5.large",
    "Namespace": "amazon-cloudwatch",
    "NodeName": "",
    "PodName": "fluentd-cloudwatch",
    "Sources": [
    "Timestamp": "1664806540622",
    "Type": "Pod",
    "Version": "0",
    "kubernetes": {
        "host": "",
        "labels": {
            "controller-revision-hash": "66c8bfcf8f",
            "k8s-app": "fluentd-cloudwatch",
            "pod-template-generation": "2"
        "namespace_name": "amazon-cloudwatch",
        "pod_id": "8e6ebccc-2136-4c79-80e4-1d023677a66a",
        "pod_name": "fluentd-cloudwatch-2tfpp",
        "pod_owners": [
                "owner_kind": "DaemonSet",
                "owner_name": "fluentd-cloudwatch"
    "pod_cpu_request": 100,
    "pod_cpu_reserved_capacity": 5,
    "pod_cpu_usage_system": 0.49339748943631484,
    "pod_cpu_usage_total": 2.067604974446889,
    "pod_cpu_usage_user": 1.6446582981210494,
    "pod_cpu_utilization": 0.10338024872234447,
    "pod_memory_cache": 0,
    "pod_memory_failcnt": 0,
    "pod_memory_hierarchical_pgfault": 0,
    "pod_memory_hierarchical_pgmajfault": 0,
    "pod_memory_limit": 419430400,
    "pod_memory_mapped_file": 0,
    "pod_memory_max_usage": 137388032,
    "pod_memory_pgfault": 0,
    "pod_memory_pgmajfault": 0,
    "pod_memory_request": 209715200,
    "pod_memory_reserved_capacity": 2.5818072348088124,
    "pod_memory_rss": 130351104,
    "pod_memory_swap": 0,
    "pod_memory_usage": 137129984,
    "pod_memory_utilization": 1.6016785781100062,
    "pod_memory_utilization_over_pod_limit": 31.0185546875,
    "pod_memory_working_set": 130101248,
    "pod_network_rx_bytes": 4.580653646182247,
    "pod_network_rx_dropped": 0,
    "pod_network_rx_errors": 0,
    "pod_network_rx_packets": 0.08268327881195393,
    "pod_network_total_bytes": 7.755691552561277,
    "pod_network_tx_bytes": 3.1750379063790306,
    "pod_network_tx_dropped": 0,
    "pod_network_tx_errors": 0,
    "pod_network_tx_packets": 0.06614662304956315,
    "pod_number_of_container_restarts": 0,
    "pod_number_of_containers": 1,
    "pod_number_of_running_containers": 1,
    "pod_status": "Running"
    "AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
    "CloudWatchMetrics": [
            "Metrics": [
                    "Unit": "Percent",
                    "Name": "node_filesystem_utilization"
            "Dimensions": [
            "Namespace": "ContainerInsights"
    "ClusterName": "ironman-2022",
    "EBSVolumeId": "aws://eu-west-1c/vol-0f2bbde78cd431365",
    "InstanceId": "i-05830139120bed5e0",
    "InstanceType": "m5.large",
    "NodeName": "",
    "Sources": [
    "Timestamp": "1664810189159",
    "Type": "NodeFS",
    "Version": "0",
    "device": "/dev/nvme0n1p1",
    "fstype": "vfs",
    "kubernetes": {
        "host": ""
    "node_filesystem_available": 82884431872,
    "node_filesystem_capacity": 85886742528,
    "node_filesystem_inodes": 41941952,
    "node_filesystem_inodes_free": 41858329,
    "node_filesystem_usage": 3002310656,
    "node_filesystem_utilization": 3.4956625057950177

不論是 Pod 或 NodeFS 資訊,皆被被多更新了 CloudWatchMetrics Array 格式,其定義 namespace 為 ContainerInsights、對應 DimensionsMetrics


Container Insight 所提供的 metrics,皆由 Container Agent 以 DaemonSet 方式部署透過 cadvisor 收集 node 及 pod metrics,並於 log 增加 CloudWatchMetrics 格式使 CloudWatch Metrics 讀取。

以上資訊透過 EKS 所提供 Logs 來驗證上游 Kubernetes 運作原理,倘若上述內文有所錯誤,隨時可以留言或是私訊我。


