AWSSupport-TroubleshootEKSDNSFailure
Description
The AWSSupport-TroubleshootEKSDNSFailure runbook helps troubleshoot
issues with CoreDNS pods and configuration in Amazon Elastic Kubernetes Service (Amazon EKS) when applications or
pods encounter DNS resolution failures. The runbook validates VPC DNS settings, inspects
the CoreDNS deployment and ConfigMap, checks Horizontal Pod Autoscaler (HPA)
configuration, collects CoreDNS logs, and performs DNS resolution checks on worker nodes.
Optionally, a probing Amazon Elastic Compute Cloud instance can be created in the same subnet as the
problematic worker node to perform DNS resolution checks without requiring direct access
to the node.
Important
The Amazon EKS cluster's authentication mode must be set to API or
API_AND_CONFIG_MAP. This runbook deploys a AWS Lambda (Lambda)
function as a proxy to make authenticated Kubernetes API calls and cleans up all
created resources at the end of execution.
Document type
Automation
Owner
Amazon
Platforms
Linux, macOS, Windows
Parameters
-
AutomationAssumeRole
Type: AWS::IAM::Role::Arn
Description: (Optional) The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation uses the permissions of the user that starts this runbook.
-
EksClusterName
Type: String
Description: (Required) The name of the target Amazon EKS cluster.
-
DnsName
Type: String
Default: amazon.com
Description: (Optional) The stub domain suffix for the domain name that the application or pod is failing to resolve.
-
ProblematicNodeInstanceId
Type: AWS::EC2::Instance::Id
Description: (Optional) The instance ID of the worker node where the application experiencing DNS resolution errors is running. When provided, the runbook creates a probing Amazon EC2 instance in the same subnet to perform DNS resolution checks. Use this parameter if the worker node is not in a public subnet or if installing
bind-utilson the worker node is not permitted. -
CoreDnsNamespace
Type: String
Default: kube-system
Description: (Optional) The Kubernetes namespace for CoreDNS pods.
-
S3BucketName
Type: AWS::S3::Bucket::Name
Description: (Optional) The name of the Amazon Simple Storage Service bucket where CoreDNS troubleshooting logs are uploaded.
-
LambdaRoleArn
Type: AWS::IAM::Role::Arn
Description: (Optional) The ARN of the IAM role for the Lambda proxy function. If not provided, the runbook creates a role named
Automation-K8sProxy-Role-<ExecutionId>with theAWSLambdaBasicExecutionRoleandAWSLambdaVPCAccessExecutionRolemanaged policies. It is recommended to provide your own role.
Required IAM permissions
The AutomationAssumeRole parameter requires the following actions to
use the runbook successfully.
-
eks:DescribeCluster -
ec2:DescribeVpcs -
ec2:DescribeInstances -
ec2:DescribeSubnets -
ec2:DescribeSecurityGroups -
cloudformation:CreateStack -
cloudformation:DescribeStacks -
cloudformation:DeleteStack -
lambda:InvokeFunction -
s3:GetBucketPublicAccessBlock -
s3:GetBucketAcl -
s3:PutObject -
ssm:DescribeInstanceInformation -
ssm:SendCommand -
ssm:GetCommandInvocation -
ssm:StartAutomationExecution -
ssm:GetAutomationExecution
Document Steps
-
AssertIfTargetClusterExists- Verifies that the Amazon EKS cluster specified inEksClusterNameexists and is in theACTIVEstate. If the cluster is not found or not active, the runbook skips toGenerateReport. -
UpdateEksClusterExists- Sets the internaleksClusterExistsvariable totruefor use in report generation. -
GetVpcDnsSettings- Retrieves theenableDnsSupportandenableDnsHostnamessettings for the Amazon Virtual Private Cloud associated with the Amazon EKS cluster. -
BranchOnVpcDnsSettings- Checks whether both VPC DNS settings are enabled. If either is disabled, the runbook skips toGenerateReport. Otherwise, proceeds toDeployK8sAuthApisResources. -
DeployK8sAuthApisResources- ExecutesAWSSupport-SetupK8sApiProxyForEKSto deploy a Lambda function as a proxy for making authenticated Kubernetes API calls to the Amazon EKS cluster. -
RetrieveCoreDNSDeployment- Retrieves information about the CoreDNS deployment, including pod readiness, container status, and the readiness of nodes hosting CoreDNS pods. Also retrieves the CoreDNS cluster IP. -
RetrieveAndInspectCoreDNSConfigMap- Retrieves the CoreDNS ConfigMap from the Amazon EKS cluster and checks for configuration issues, including stub domain settings for the domain specified inDnsName. -
ValidateHpaConfiguration- Checks whether a Horizontal Pod Autoscaler (HPA) is configured for the CoreDNS deployment in the specified namespace. -
CheckS3BucketPublicStatus- Validates that the Amazon S3 bucket specified inS3BucketNamedoes not allow public or anonymous read or write access. -
CollectLogToS3- Collects CoreDNS pod logs and uploads them to the specified Amazon S3 bucket. -
BranchOnProblematicNodeInstanceId- Checks whetherProblematicNodeInstanceIdis provided and CoreDNS host nodes exist. If both conditions are met, proceeds toVerifyThatProblematicNodeInstanceBelongsToCluster. Otherwise, branches toBranchOnCoreDnsDeployment. -
VerifyThatProblematicNodeInstanceBelongsToCluster- Confirms that the instance specified inProblematicNodeInstanceIdis a worker node in the Amazon EKS cluster. -
UpdateProblematicNodeInstanceStanding- Sets the internalproblematicNodeInstanceStandingvariable totrue. -
GetProblematicInstanceDetails- Retrieves the AMI ID, instance type, subnet ID, and security group IDs of the problematic worker node for use when creating the probing Amazon EC2 instance. -
CreateProbingInfrastructure- Creates an instance profile and probing Amazon EC2 instance via an AWS CloudFormation stack in the same subnet as the problematic worker node. The stack is namedAWSSupport-TroubleshootEKSDNSFailure-<ExecutionId>. -
GetProbingInstanceId- Retrieves the probing Amazon EC2 instance ID from the CloudFormation stack outputs. -
WaitForProbingInstanceSSMAgentStateToBeOnline- Waits for the Amazon EC2 Systems Manager Agent on the probing Amazon EC2 instance to report anOnlinestatus before proceeding. -
RetrieveCoreDNSPodsIPFromProblematicNode- Retrieves the CoreDNS pod IP addresses as seen from the problematic worker node. -
PerformDNSResolutionOnProbingEC2Instance- Runs DNS resolution checks on the probing Amazon EC2 instance using the cluster IP and CoreDNS pod IP. -
DeleteCloudFormationStack- Deletes the CloudFormation stack that created the probing Amazon EC2 instance and instance profile. -
UpdateCfnStackDeleted- Sets the internalcfnStackDeletedvariable totrue. -
BranchOnCoreDnsDeployment- Checks whetherProblematicNodeInstanceIdwas not provided and CoreDNS host nodes exist. If both conditions are met, proceeds toPerformDNSResolutionOnCoreDnsWorkerNodes. Otherwise, proceeds toCleanupK8sAuthenticationInfrastructure. -
BranchOnCoreDnsNodesExistForRunCommandSteps- Checks whether CoreDNS host nodes exist before runningaws:runCommandsteps. If no nodes exist (for example, when CoreDNS has zero replicas), skips toCleanupK8sAuthenticationInfrastructure. -
PerformDNSResolutionOnCoreDnsWorkerNodes- Runs DNS resolution checks directly on the CoreDNS worker nodes using the cluster IP and CoreDNS pod IP. -
VerifyNameserverMatchAndKubeProxyLogsAndIPTableEntries- Verifies that the nameserver IP matches the cluster IP, checks kube-proxy pod access to the API server, and validates kube-dns iptables entries on the CoreDNS worker nodes. -
VerifyPPSThrottlingOnENIs- Checks the DNS packets-per-second (PPS) limit per Elastic Network Interface (ENI) on the CoreDNS worker nodes to identify potential throttling. -
UpdateChecksOnNodes- Sets the internalchecksOnNodesvariable totrueto indicate that node-level checks were performed. -
CleanupK8sAuthenticationInfrastructure- ExecutesAWSSupport-SetupK8sApiProxyForEKSwith theCleanupoperation to remove the Lambda proxy function and associated resources created during the automation. -
UpdateK8sInfrastructreDeleted- Sets the internalK8sInfrastructreDeletedvariable totrue. -
CleanUpAllResources- Performs a comprehensive cleanup of any remaining resources, including the CloudFormation stack and Lambda proxy function, in case earlier cleanup steps did not complete successfully. -
CollectOutputFromAllRunCommandSteps- Collects and consolidates the output from allaws:runCommandsteps that were executed during the automation. -
GenerateReport- Compiles the results from all preceding steps into a comprehensive evaluation report covering VPC DNS settings, CoreDNS deployment health, ConfigMap configuration, HPA configuration, log collection status, and DNS resolution check results.
Outputs
GenerateReport.EvalReport - A comprehensive report of all DNS
troubleshooting checks performed, including findings and recommended remediation
steps.