Track service job capacity utilization
AWS Batch provides multiple API operations that you can use together to track capacity utilization for service jobs in a queue. The monitoring workflow depends on the type of scheduling policy that is attached to your job queue.
For job queues that use a first-in, first-out (FIFO) scheduling policy:
-
Check total queue utilization (
GetJobQueueSnapshot). -
List jobs by status, such as
SCHEDULEDandRUNNING(ListServiceJobs). -
Examine any given job (
DescribeServiceJob).
For job queues that use a fair-share (FSS) or quota-management (QM) scheduling policy:
-
Check total queue utilization (
GetJobQueueSnapshot). -
View per-share utilization (
GetJobQueueSnapshot). -
List jobs by status and share that are actively contributing to utilization, such as
SCHEDULEDandRUNNING(ListServiceJobs). -
Examine any given job (
DescribeServiceJob).
The following sections walk through each step in detail.
For information about tracking capacity utilization for ECS, EKS, and Fargate compute jobs, see Track compute job capacity utilization.
Topics
Check queue utilization
The queueUtilization field in the GetJobQueueSnapshot response provides a point-in-time view of how
much compute capacity is consumed by jobs dispatched from a queue. Capacity is measured in
instance count for service jobs.
For job queues that use a fair-share or quota-management scheduling policy, the response also includes a per-share breakdown so you can see how capacity is distributed across shares. For more information, see View per-share utilization.
View capacity utilization (AWS CLI)
Use the get-job-queue-snapshot command to retrieve a snapshot of the capacity utilization for a job queue.
aws batch get-job-queue-snapshot \ --job-queuemy-job-queue
The response varies depending on the scheduling policy that is attached to your job queue. Choose the tab for your scheduling policy type to see an example response.
View per-share utilization
For job queues with a fair-share or quota-management scheduling policy, the
queueUtilization response from GetJobQueueSnapshot includes a
utilization object with a topCapacityUtilization array
that lists the top active shares by consumption.
This information helps you:
-
Identify which shares consume the most resources.
-
Verify that resources are distributed across shares as expected.
-
Detect shares that may be saturating or under-utilizing their allocation.
-
Determine whether to adjust your scheduling policy configuration.
For more information about fair-share scheduling policies, see Fair-share scheduling policies.
For more information about quota shares, see Quota shares.
List service jobs by status and share
After you identify the overall queue and per-share utilization, use the ListServiceJobs API operation to find the service jobs that are actively
contributing to utilization. You can filter by job status to see jobs that are
RUNNING, SCHEDULED, or in another state. For queues with a
fair-share or quota-management scheduling policy, you can also filter by share identifier to narrow
results to a specific share.
Note
The SHARE_IDENTIFIER and QUOTA_SHARE_NAME filters are the
only filters that can be combined with the
jobStatus parameter. When you use other filters, the jobStatus
parameter is ignored.
List service jobs (AWS CLI)
Use the list-service-jobs command
with the --job-status parameter to filter by status.
View running service jobs in your queue:
aws batch list-service-jobs \ --job-queuemy-job-queue\ --job-status RUNNING
For queues with a fair-share scheduling policy, use the --filters parameter with
SHARE_IDENTIFIER to list jobs for a specific share.
For queues with quota-management scheduling policy, use QUOTA_SHARE_NAME
to list jobs for a specific quota share. This is useful when you identify a share with high
capacity consumption and want to see which jobs are responsible.
List only RUNNING service jobs for a share from a fair-share queue:
aws batch list-service-jobs \ --job-queuemy-job-queue\ --job-status RUNNING \ --filters name=SHARE_IDENTIFIER,values="team-a"
For queues with a quota-management scheduling policy, use the
QUOTA_SHARE_NAME filter:
aws batch list-service-jobs \ --job-queuemy-job-queue\ --job-status RUNNING \ --filters name=QUOTA_SHARE_NAME,values="my-quota-share"
The following is an example response for listing running service jobs filtered by share identifier in a fair-share queue.
{ "jobSummaryList": [ { "jobArn": "arn:aws:batch:us-east-1:123456789012:service-job/a4d6c728-8ee8-4c65-8e2a-9a5e8f4b7c3d", "jobId": "a4d6c728-8ee8-4c65-8e2a-9a5e8f4b7c3d", "jobName": "my-training-job", "serviceJobType": "SAGEMAKER_TRAINING", "status": "RUNNING", "shareIdentifier": "team-a", "createdAt": 1700000000000, "scheduledAt": 1700000060000, "startedAt": 1700000120000, "capacityUsage": [ { "capacityUnit": "ml.m5.large", "quantity": 5.0 } ], "latestAttempt": { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:us-east-1:123456789012:training-job/my-training-job" } } } ] }
In this example, the response includes the shareIdentifier field showing
the job belongs to the team-a share, and the capacityUsage array
showing that the job consumes 5 ml.m5.large instances. The
latestAttempt object contains the service resource identifier that you can use
to get additional details from the target service.
Examine a specific service job
After you identify a service job of interest, use the DescribeServiceJob operation to get comprehensive information about
the job, including its current status, service resource identifiers, and detailed attempt
information.
View detailed information about a specific service job:
aws batch describe-service-job \ --job-ida4d6c728-8ee8-4c65-8e2a-9a5e8f4b7c3d
This command returns comprehensive information about the job, including:
-
Job ARN and current status
-
Service resource identifiers (such as SageMaker Training job ARN)
-
Scheduling priority and retry configuration
-
Service request payload containing the original service parameters
-
Detailed attempt information with start and stop times
-
Status messages from the target service
Examine underlying SageMaker Training job
When monitoring SageMaker Training jobs through AWS Batch, you can access both AWS Batch job information and the underlying SageMaker Training job details.
The service resource identifier in the job details contains the SageMaker Training job ARN:
{ "latestAttempt": { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:us-east-1:123456789012:training-job/my-training-job" } } }
You can use this ARN to get additional details directly from SageMaker:
aws sagemaker describe-training-job \ --training-job-namemy-training-job
Monitor job progress by checking both AWS Batch status and SageMaker Training job status. The AWS Batch job status shows the overall job lifecycle, while the SageMaker Training job status provides service-specific details about the training process.