昨日 討論了以 FluentD 為例討論 log agent 部署至每個 node 上收集 /var/log/containers
目錄。接續討論 Container Insight 部署完之後,其這些 metrics 來源及收集方式。
$ curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml | sed "s/{{cluster_name}}/ironman-2022/;s/{{region_name}}/eu-west-1/" | kubectl apply -f -
其 source code template 也可於 GitHub - CloudWatch Agent for Container Insights Kubernetes Monitoring[2] 查看。
2. 由於 CloudWatch agent 及 Fluentd 皆需要具有 CloudWatch Logs 或 Metrics IAM 權限,因此以下範例透過關聯 CloudWatchAgentAdminPolicy
方式設定 IRSA [3]。
透過 eksctl
命令建立 IAM role 給與 ServiceAccount fluentd
及 cloudwatch-agent
使用。
$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=cloudwatch-agent \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \
--approve
$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=fluentd \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \
--approve
根據 CloudWatch Insight 需求[4],CloudWatch Agent 需要額外 EC2 DescribeVolumes
權限,若關聯的 Policy 沒有此權限則需要額外設定。
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:DescribeVolumes"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
kubectl rollout
重啟 DaemonSet cloudwatch-agent
及 fluentd-cloudwatch
。$ kubectl -n amazon-cloudwatch rollout restart ds cloudwatch-agent
daemonset.apps/cloudwatch-agent restarted
$ kubectl -n amazon-cloudwatch rollout restart ds fluentd-cloudwatch
daemonset.apps/fluentd-cloudwatch restarted
為避免混淆 FluentD 及 CloudWatch Agent 所負責事項,以下仍會依照 FluentD 及 CloudWatch Agent 設定檔進行解釋。
預設安裝好上述 QuickStart template ,則可以在 CloudWatch Log groups 查看到以下 Log group:
/aws/containerinsights/ironman-2022/application
/aws/containerinsights/ironman-2022/dataplane
/aws/containerinsights/ironman-2022/host
/aws/containerinsights/ironman-2022/performance
<match **>
@type cloudwatch_logs
@id out_cloudwatch_logs_systemd
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
log_stream_name_key stream_name
auto_create_stream true
remove_log_stream_name_key true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
<match **>
@type cloudwatch_logs
@id out_cloudwatch_logs_containers
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/application"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
<match host.**>
@type cloudwatch_logs
@id out_cloudwatch_logs_host_logs
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
其中 application 部分則是 Day 17 所提及的 Pod Container log 。另外,在此 QuickStart template 中所使用的 fluentd 版本已經是較舊版本: fluent/fluentd-kubernetes-daemonset:v1.7.3-debian-cloudwatch-1.0
。我們也可以透過查看 fluentd
container 內的 Gemfile
查看 plugin 版本:
若有需要使用較新 fluentd
版本 ,則可以參考 Fluentd Daemonset for Kubernetes[6]。
$ kubectl -n amazon-cloudwatch exec -it fluentd-cloudwatch-qkf9m -- cat /fluentd/Gemfile
Defaulted container "fluentd-cloudwatch" out of: fluentd-cloudwatch, copy-fluentd-config (init), update-log-driver (init)
# AUTOMATICALLY GENERATED
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/Gemfile.erb
source "https://rubygems.org"
gem "fluentd", "1.7.3"
gem "oj", "3.8.1"
gem "fluent-plugin-multi-format-parser", "~> 1.0.0"
gem "fluent-plugin-concat", "~> 2.3.0"
gem "fluent-plugin-grok-parser", "~> 2.5.0"
gem "fluent-plugin-prometheus", "~> 1.5.0"
gem 'fluent-plugin-json-in-json-2', ">= 1.0.2"
gem "fluent-plugin-record-modifier", "~> 2.0.0"
gem "fluent-plugin-rewrite-tag-filter", "~> 2.2.0"
gem "aws-sdk-cloudwatchlogs", "~> 1.0"
gem "fluent-plugin-cloudwatch-logs", "~> 0.7.4"
gem "fluent-plugin-kubernetes_metadata_filter", "~> 2.3.0"
gem "ffi"
gem "fluent-plugin-systemd", "~> 1.0.1"
根據 Fluentd 設定檔,可以確認以下目錄:
application
:所有於 /var/log/containers
目錄下的 application log。host
:Node 上的 /var/log/dmesg
、/var/log/secure
及 /var/log/messages
目錄。dataplane
:Node 上的 /var/log/journal
目錄,主要收集 kubelet.service
、kubeproxy.service
及 docker.service
。根據 ConfigMap cwagentconfig
,僅設定了 metrics_collected
屬於 kubernetes 類型,及定義 cluster_name
。
$ kubectl -n amazon-cloudwatch get cm cwagentconfig -o yaml
apiVersion: v1
data:
cwagentconfig.json: |
{
"agent": {
"region": "eu-west-1"
},
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "ironman-2022",
"metrics_collection_interval": 60
}
},
"force_flush_interval": 5
}
}
kind: ConfigMap
...
...
查看 CloudWatch Agent logs,可以查看到預設 Config 會被轉譯成 TOML 文件格式:
$ kubectl -n amazon-cloudwatch logs cloudwatch-agent-fdhmb
...
...
...
Configuration validation first phase succeeded
2022/10/03 12:46:34 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2022/10/03 12:46:34 D! toml config [agent]
collection_jitter = "0s"
debug = false
flush_interval = "1s"
flush_jitter = "0s"
hostname = "ip-192-168-42-19.eu-west-1.compute.internal"
interval = "60s"
logfile = ""
logtarget = "lumberjack"
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = true
precision = ""
quiet = false
round_interval = false
[inputs]
[[inputs.cadvisor]]
container_orchestrator = "eks"
interval = "60s"
mode = "detail"
[inputs.cadvisor.tags]
metricPath = "logs"
[[inputs.k8sapiserver]]
interval = "60s"
node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
[inputs.k8sapiserver.tags]
metricPath = "logs_k8sapiserver"
[outputs]
[[outputs.cloudwatchlogs]]
force_flush_interval = "5s"
log_stream_name = "ip-192-168-42-19.eu-west-1.compute.internal"
region = "eu-west-1"
tagexclude = ["metricPath"]
[outputs.cloudwatchlogs.tagpass]
metricPath = ["logs", "logs_k8sapiserver"]
[processors]
[[processors.ec2tagger]]
disk_device_tag_key = "device"
ebs_device_keys = ["*"]
ec2_instance_tag_keys = ["aws:autoscaling:groupName"]
ec2_metadata_tags = ["InstanceId", "InstanceType"]
[processors.ec2tagger.tagpass]
metricPath = ["logs"]
[[processors.k8sdecorator]]
cluster_name = "ironman-2022"
host_ip = "192.168.42.19"
node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
order = 1
prefer_full_pod_name = false
tag_service = true
[processors.k8sdecorator.tagpass]
metricPath = ["logs", "logs_k8sapiserver"]
2022-10-03T12:46:34Z I! Starting AmazonCloudWatchAgent 1.247354.0
2022-10-03T12:46:34Z I! AWS SDK log level not set
2022-10-03T12:46:34Z I! Loaded inputs: cadvisor k8sapiserver
2022-10-03T12:46:34Z I! Loaded aggregators:
2022-10-03T12:46:34Z I! Loaded processors: ec2tagger k8sdecorator
2022-10-03T12:46:34Z I! Loaded outputs: cloudwatchlogs
2022-10-03T12:46:34Z I! Tags enabled:
2022-10-03T12:46:34Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-192-168-42-19.eu-west-1.compute.internal", Flush Interval:1s
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [logagent] starting
2022-10-03T12:46:34Z I! [logagent] found plugin cloudwatchlogs is a log backend
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
I1003 12:46:34.492882 1 leaderelection.go:248] attempting to acquire leader lease amazon-cloudwatch/cwagent-clusterleader...
I1003 12:46:34.504069 1 leaderelection.go:258] successfully acquired lease amazon-cloudwatch/cwagent-clusterleader
2022-10-03T12:46:34Z I! k8sapiserver Switch New Leader: ip-192-168-42-19.eu-west-1.compute.internal
2022-10-03T12:46:34Z I! k8sapiserver OnStartedLeading: ip-192-168-42-19.eu-west-1.compute.internal
I1003 12:46:34.504480 1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"amazon-cloudwatch", Name:"cwagent-clusterleader", UID:"63166128-5ea1-40c7-ae47-67ac27ed5830", APIVersion:"v1", ResourceVersion:"5330123", Fi
eldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-192-168-42-19.eu-west-1.compute.internal became leader
W1003 12:46:34.610597 1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:40Z I! [outputs.cloudwatchlogs] First time sending logs to /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal since startup so sequenceToken is nil, learned new token:(0xc000fbe130): Th
e given sequenceToken is invalid. The next expected sequenceToken is: 49615097157874583574523473011765967269436169076596539730
2022-10-03T12:46:40Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 103.736985ms before retrying.
2022-10-03T12:47:34Z I! number of namespace to running pod num map[amazon-cloudwatch:8 default:1 demo:4 kube-system:10]
格式上大致可以分為:
cadvisor
及 k8sapiserver
。ec2tagger
及 k8sdecorator
cloudwatchlogs
,log stream 為 ip-192-168-42-19.eu-west-1.compute.internal
。此為該 CloudWatch Agent Pod 所在 node。同時,我們也能確認 Logs 被更新 log stream /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal
。
查看 Log event:
{
"AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
"CloudWatchMetrics": [
{
"Metrics": [
{
"Unit": "Percent",
"Name": "pod_cpu_utilization"
},
{
"Unit": "Percent",
"Name": "pod_memory_utilization"
},
{
"Unit": "Bytes/Second",
"Name": "pod_network_rx_bytes"
},
{
"Unit": "Bytes/Second",
"Name": "pod_network_tx_bytes"
},
{
"Unit": "Percent",
"Name": "pod_memory_utilization_over_pod_limit"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
],
[
"Namespace",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
},
{
"Metrics": [
{
"Unit": "Percent",
"Name": "pod_cpu_reserved_capacity"
},
{
"Unit": "Percent",
"Name": "pod_memory_reserved_capacity"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
},
{
"Metrics": [
{
"Unit": "Count",
"Name": "pod_number_of_container_restarts"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
]
],
"Namespace": "ContainerInsights"
}
],
"ClusterName": "ironman-2022",
"InstanceId": "i-03b91195eefbde8e3",
"InstanceType": "m5.large",
"Namespace": "amazon-cloudwatch",
"NodeName": "ip-192-168-71-180.eu-west-1.compute.internal",
"PodName": "fluentd-cloudwatch",
"Sources": [
"cadvisor",
"pod",
"calculated"
],
"Timestamp": "1664806540622",
"Type": "Pod",
"Version": "0",
"kubernetes": {
"host": "ip-192-168-71-180.eu-west-1.compute.internal",
"labels": {
"controller-revision-hash": "66c8bfcf8f",
"k8s-app": "fluentd-cloudwatch",
"pod-template-generation": "2"
},
"namespace_name": "amazon-cloudwatch",
"pod_id": "8e6ebccc-2136-4c79-80e4-1d023677a66a",
"pod_name": "fluentd-cloudwatch-2tfpp",
"pod_owners": [
{
"owner_kind": "DaemonSet",
"owner_name": "fluentd-cloudwatch"
}
]
},
"pod_cpu_request": 100,
"pod_cpu_reserved_capacity": 5,
"pod_cpu_usage_system": 0.49339748943631484,
"pod_cpu_usage_total": 2.067604974446889,
"pod_cpu_usage_user": 1.6446582981210494,
"pod_cpu_utilization": 0.10338024872234447,
"pod_memory_cache": 0,
"pod_memory_failcnt": 0,
"pod_memory_hierarchical_pgfault": 0,
"pod_memory_hierarchical_pgmajfault": 0,
"pod_memory_limit": 419430400,
"pod_memory_mapped_file": 0,
"pod_memory_max_usage": 137388032,
"pod_memory_pgfault": 0,
"pod_memory_pgmajfault": 0,
"pod_memory_request": 209715200,
"pod_memory_reserved_capacity": 2.5818072348088124,
"pod_memory_rss": 130351104,
"pod_memory_swap": 0,
"pod_memory_usage": 137129984,
"pod_memory_utilization": 1.6016785781100062,
"pod_memory_utilization_over_pod_limit": 31.0185546875,
"pod_memory_working_set": 130101248,
"pod_network_rx_bytes": 4.580653646182247,
"pod_network_rx_dropped": 0,
"pod_network_rx_errors": 0,
"pod_network_rx_packets": 0.08268327881195393,
"pod_network_total_bytes": 7.755691552561277,
"pod_network_tx_bytes": 3.1750379063790306,
"pod_network_tx_dropped": 0,
"pod_network_tx_errors": 0,
"pod_network_tx_packets": 0.06614662304956315,
"pod_number_of_container_restarts": 0,
"pod_number_of_containers": 1,
"pod_number_of_running_containers": 1,
"pod_status": "Running"
}
{
"AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
"CloudWatchMetrics": [
{
"Metrics": [
{
"Unit": "Percent",
"Name": "node_filesystem_utilization"
}
],
"Dimensions": [
[
"NodeName",
"InstanceId",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
}
],
"ClusterName": "ironman-2022",
"EBSVolumeId": "aws://eu-west-1c/vol-0f2bbde78cd431365",
"InstanceId": "i-05830139120bed5e0",
"InstanceType": "m5.large",
"NodeName": "ip-192-168-42-19.eu-west-1.compute.internal",
"Sources": [
"cadvisor",
"calculated"
],
"Timestamp": "1664810189159",
"Type": "NodeFS",
"Version": "0",
"device": "/dev/nvme0n1p1",
"fstype": "vfs",
"kubernetes": {
"host": "ip-192-168-42-19.eu-west-1.compute.internal"
},
"node_filesystem_available": 82884431872,
"node_filesystem_capacity": 85886742528,
"node_filesystem_inodes": 41941952,
"node_filesystem_inodes_free": 41858329,
"node_filesystem_usage": 3002310656,
"node_filesystem_utilization": 3.4956625057950177
}
不論是 Pod 或 NodeFS 資訊,皆被被多更新了 CloudWatchMetrics
Array 格式,其定義 namespace 為 ContainerInsights
、對應 Dimensions
及 Metrics
。
Container Insight 所提供的 metrics,皆由 Container Agent 以 DaemonSet 方式部署透過 cadvisor 收集 node 及 pod metrics,並於 log 增加 CloudWatchMetrics
格式使 CloudWatch Metrics 讀取。
以上資訊透過 EKS 所提供 Logs 來驗證上游 Kubernetes 運作原理,倘若上述內文有所錯誤,隨時可以留言或是私訊我。