Jobs stuck in RUNNABLE due to capacity
- Insufficient instance capacity
-
All connected compute environments have insufficient capacity errors. When requested, AWS Batch detects Amazon EC2 instances that experience insufficient capacity errors. Manually canceling the job will allow the subsequent job to move to the head of the queue.
-
statusReasonmessage while the job is stuck:CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY - Service cannot fulfill the capacity requested for instance type [instanceTypeName] -
reasonused forjobStateTimeLimitActions:CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY -
statusReasonmessage after the job is canceled byjobStateTimeLimitActions:Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
Note:
-
The AWS Batch service role requires
autoscaling:DescribeScalingActivitiespermission for this detection to work. If you use the Using service-linked roles for AWS Batch service-linked role (SLR) or the AWS managed policy: AWSBatchServiceRole policy managed policy, then you don't need to take any action because their permission policies are updated. -
If you use the SLR or the managed policy, you must add the
autoscaling:DescribeScalingActivitiesandec2:DescribeSpotFleetRequestHistorypermissions so that you can receive blocked job queue events and updated job status when inRUNNABLE. In addition, AWS Batch needs these permissions to performcancellationactions through thejobStateTimeLimitActionsparameter even if they are configured on the job queue. -
In the case of a multi-node parallel (MNP) job, if the attached high-priority, Amazon EC2 compute environment experiences
insufficient capacityerrors, it blocks the queue even if a lower priority compute environment does experience this error.
-