Troubleshooting Container Insights on Amazon EKS - Amazon CloudWatch
Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.

Troubleshooting Container Insights on Amazon EKS

This section covers common issues that you might encounter when you set up or operate Container Insights on Amazon EKS. Use the following tables and diagnostic commands to identify and resolve problems regardless of whether you use the OTel or Classic approach.

For approach-specific setup guidance, see Quick start: OTel Container Insights on Amazon EKS or Setup guide (AWS CLI). To compare approaches, see Compare Container Insights approaches.

Metrics not appearing in CloudWatch

If you don't see metrics in the ContainerInsights namespace, use the following table to identify the cause.

Symptom Cause Resolution
No metrics in the ContainerInsights namespace IAM role lacks cloudwatch:PutMetricData permission Attach the CloudWatchAgentServerPolicy managed policy to the agent IAM role.
Metrics appear for some nodes but not others Agent DaemonSet not scheduled on all nodes because of taints Add tolerations to the agent DaemonSet to allow scheduling on tainted nodes.
Metrics stop appearing Agent pod is OOMKilled or restarting Increase the memory limits in the agent pod resource specification.
Metrics are stale or zero Network connectivity is blocked Check VPC security groups and verify that a CloudWatch VPC endpoint exists.
Enhanced metrics are missing Agent not configured for Enhanced Observability Set enhancedObservability: true in the agent configuration.

Agent pods not starting

If agent pods fail to start or remain in a non-running state, use the following table to diagnose the issue.

Symptom Cause Resolution
ImagePullBackOff Amazon ECR is unreachable or the image tag is incorrect Verify the image URI and confirm that your nodes can access Amazon ECR.
Pending Insufficient CPU or memory on the node Scale the node group or reduce resource requests in the agent pod specification.
CrashLoopBackOff Invalid configuration or missing volume mount Check pod logs for configuration errors by running kubectl logs on the affected pod.
FailedScheduling Node affinity or taints prevent scheduling Review the nodeSelector and tolerations in the DaemonSet spec.
Exit code 1 Service account lacks IRSA annotation Verify that the service account has the eks.amazonaws.com/role-arn annotation.

Add-on installation failures

If the amazon-cloudwatch-observability add-on fails to install or reports an unhealthy status, use the following table to troubleshoot.

Symptom Cause Resolution
CREATE_FAILED Conflicting resources from a previous installation Delete conflicting resources and use --resolve-conflicts OVERWRITE when you create the add-on.
OIDC provider not found No IAM OIDC identity provider exists for the cluster Create the provider by running eksctl utils associate-iam-oidc-provider.
Version conflict Add-on version is incompatible with the Kubernetes version List compatible versions by running aws eks describe-addon-versions.
DEGRADED status Health checks are failing because of missing permissions Check pod logs and verify that the IRSA role has the required policies attached.

Log delivery issues

If container logs don't appear in Amazon CloudWatch Logs, use the following table to identify the cause.

Symptom Cause Resolution
Log group doesn't exist Missing logs:CreateLogGroup permission Add Amazon CloudWatch Logs permissions to the agent IAM role.
Log group exists but is empty Agent not configured for logs, or Region mismatch Verify that the agent configuration includes log collection and that the Region matches your cluster Region.
Logs are delayed more than 5 minutes Flush interval is too high or the node is under heavy load Reduce the force_flush_interval value in the agent configuration.
Performance logs are missing Agent is configured for application logs only Verify that the Container Insights performance log section is present in the agent configuration.

Migration-specific issues

If you experience issues while migrating between Container Insights approaches, use the following table. For the full migration workflow, see Migration guides.

Symptom Cause Resolution
Duplicate metrics during parallel run Both approaches are publishing metrics simultaneously This behavior is expected during a parallel run. Disable the legacy approach after you validate the new approach.
Different metric values between approaches Different calculation methods Small differences (less than 5%) are expected. Large differences indicate a configuration mismatch between approaches.
Rollback fails Custom configuration was not reapplied Re-apply your complete configuration values when you roll back.
Alarms fire during migration Metric gaps during the switchover period Temporarily set the missing data treatment to notBreaching on affected alarms.

OTel Container Insights issues

The following issues are specific to the OTel Container Insights approach. For general setup guidance, see Quick start: OTel Container Insights on Amazon EKS.

Symptom Cause Resolution
403 Forbidden exporter error IAM role is missing CloudWatch permissions Verify that the CloudWatchAgentServerPolicy is attached to the agent role.
Connection refused on metrics endpoint Collector cannot reach the kubelet Verify that hostNetwork: true is set in the pod spec, or confirm that the service account has the required permissions.
High memory usage Batch processor queue is too large Reduce the batch/timeout and batch/send_batch_size values in the collector configuration.
Custom metrics not appearing Receiver not configured for the application endpoint Add a Prometheus receiver that targets your application metrics port in the collector configuration.

General diagnostic commands

Use the following commands to gather information about your Container Insights deployment.

To check agent pod status, run the following command.

kubectl get pods -n amazon-cloudwatch

To view agent pod logs, run the following command.

kubectl logs -n amazon-cloudwatch -l app.kubernetes.io/name=cloudwatch-agent --tail=50

To check the agent DaemonSet status, run the following command.

kubectl get daemonset -n amazon-cloudwatch

To verify the IAM role on a service account, run the following command.

kubectl get serviceaccount -n amazon-cloudwatch -o yaml

To check the cluster add-on status, run the following command. Replace cluster-name with the name of your Amazon EKS cluster.

aws eks describe-addon --cluster-name cluster-name --addon-name amazon-cloudwatch-observability

To list Container Insights log groups, run the following command. Replace cluster-name with the name of your Amazon EKS cluster.

aws logs describe-log-groups --log-group-name-prefix "/aws/containerinsights/cluster-name"

For more information about setting up and operating Container Insights on Amazon EKS, see the following topics.