SageMaker HyperPod cluster events reference
This page provides a complete reference of all structured events emitted by Amazon SageMaker HyperPod clusters. Events provide visibility into cluster, instance group, and instance-level operations including provisioning, scaling, patching, and orchestrator-specific lifecycle changes.
Cluster events are available for HyperPod clusters with
NodeProvisioningMode set to Continuous. Events are accessible
through the ListClusterEvents and DescribeClusterEvent APIs, the
SageMaker AI console Events tab, and Amazon EventBridge.
Cluster event record
Each cluster event is represented as a structured record containing identification,
timing, scope, severity, and operation-specific metadata. The following example shows a
complete event record as delivered through the DescribeClusterEvent API and
Amazon EventBridge:
{ "version": "0", "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c", "detail-type": "SageMaker HyperPod Cluster Event", "source": "aws.sagemaker", "account": "111122223333", "time": "2026-06-01T17:20:25Z", "region": "us-west-2", "resources": [ "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster" ], "detail": { "EventDetails": { "EventId": "83ea0bb5-be77-45e8-a458-0a87f778a205", "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster", "ClusterName": "sample-cluster", "InstanceGroupName": "p5Inst", "InstanceId": "i-0391f86fa0fe0d465", "ResourceType": "Instance", "EventTime": 1748794825350, "EventLevel": "Error", "Description": "Instance creation in Cluster sample-cluster and InstanceGroup p5Inst failed", "EventDetails": { "EventMetadata": { "Instance": { "FailureMessage": "We currently do not have sufficient capacity to launch new ml.p5.48xlarge instances. Please try again.", "NodeLogicalId": "df268d19-f035-4f28-9b80-b956b92ae21e" } } } } } }
Event record fields
The detail.EventDetails object contains the following fields:
| Field | Type | Required | Description |
|---|---|---|---|
EventId |
String (UUID) | Yes | Unique identifier for the event. |
ClusterArn |
String | Yes | ARN of the HyperPod cluster. |
ClusterName |
String | Yes | Name of the HyperPod cluster. |
EventTime |
Timestamp | Yes | When the event occurred (epoch milliseconds). |
ResourceType |
String | Yes | Scope of the event: Cluster,
InstanceGroup, or
Instance. |
EventLevel |
String | Yes | Severity classification: Info,
Warn, or Error. |
Description |
String | No | Human-readable summary of the event. |
InstanceGroupName |
String | No | Instance group name (present when
ResourceType is InstanceGroup
or Instance). |
InstanceId |
String | No | EC2 instance ID (present when ResourceType
is Instance). |
EventDetails |
Object | No | Additional metadata specific to the resource type and operation. |
Event levels
| Level | Meaning |
|---|---|
Info |
Operation completed successfully or is progressing normally. |
Warn |
Operation completed with a non-critical issue or a condition that may require future attention. |
Error |
Operation failed or requires immediate attention. |
Resource types
| ResourceType | Scope | Example events |
|---|---|---|
Cluster |
Whole-cluster operations | Cluster creation/update started, cluster operation failed |
InstanceGroup |
Instance group operations | Scaling started/completed, patching scheduled, FSx lifecycle |
Instance |
Individual instance operations | EC2 provisioning, lifecycle script execution, ENI management, termination |
EventDetails metadata
Cluster events include an EventMetadata object within the
EventDetails field that provides operation-specific context
beyond what the event description conveys. The contents of
EventMetadata vary by resource type and event type. For the
complete schema and supported fields, see EventMetadata in the Amazon SageMaker AI API Reference.
EventBridge envelope fields
When delivered through Amazon EventBridge, the event record is wrapped in the standard EventBridge envelope:
| Field | Description |
|---|---|
version |
EventBridge schema version (always
"0"). |
id |
Unique EventBridge event ID. |
detail-type |
SageMaker HyperPod Cluster Event |
source |
aws.sagemaker |
account |
AWS account ID that owns the cluster. |
time |
ISO 8601 timestamp of the event. |
region |
AWS Region where the cluster resides. |
resources |
Array containing the cluster ARN. |
detail |
Contains the EventDetails object described
above. |
Common events (EKS and Slurm)
The following events are emitted for all HyperPod clusters regardless of
orchestrator. The Description column shows the value of
the Description field in the event record as it appears in the API response
and the console Events tab.
Cluster lifecycle
| Event | Description |
|---|---|
| Cluster operation started | Cluster <cluster-name> <operation> started successfully |
| Cluster operation start failed | Failed to start Cluster <cluster-name> <operation> |
| Cluster operation completed | Cluster <cluster-name> <operation> completed successfully |
| Cluster operation failed | Cluster <cluster-name> <operation> failed |
Instance group lifecycle
| Event | Description |
|---|---|
| Instance group operation started | InstanceGroup <instance-group-name> <operation> started successfully in Cluster <cluster-name> |
| Instance group operation start failed | Failed to start InstanceGroup <instance-group-name> <operation> in Cluster <cluster-name> |
| Instance group operation completed | Instance Group <instance-group-name> <operation> in Cluster <cluster-name> completed successfully |
| Instance group operation failed | Instance Group <instance-group-name> <operation> in Cluster <cluster-name> failed |
Instance group network configuration
| Event | Description |
|---|---|
| Network configuration found | Found Subnet <subnet-id> in AZ <availability-zone> with SecurityGroupIds <security-group-ids> for IG <instance-group-name> in Cluster <cluster-name> |
| Network configuration failed | Failed to process Instance Group Network Configuration details for IG <instance-group-name> in Cluster <cluster-name> |
| Custom AMI override found | Found Custom AMI Override <ami-id> for IG <instance-group-name> in Cluster <cluster-name> |
| Custom AMI override failed | Failed to process Custom AMI Override details for IG <instance-group-name> in Cluster <cluster-name> |
| Platform network configuration used | Using HyperPod Platform provided network configuration for IG <instance-group-name> in Cluster <cluster-name> |
| Network configuration determined | Instance Group network configuration successfully determined for IG <instance-group-name> in Cluster <cluster-name> |
Instance creation
| Event | Description |
|---|---|
| Instance operation started | Instance <operation> started successfully in Cluster <cluster-name> and IG <instance-group-name> |
| Instance operation start failed | Failed to start Instance <operation> in Cluster <cluster-name> and IG <instance-group-name> |
| Capacity reservation found | Found CapacityReservation ID <reservation-id> for Cluster <cluster-name> and IG <instance-group-name>, using reserved capacity |
| Capacity reservation not found | No CapacityReservation found for Cluster <cluster-name> and IG <instance-group-name>, using on-demand pool |
| Instance payload setup failed | Failed to process CapacityReservationDetails for Cluster <cluster-name> and IG <instance-group-name> |
| Customer ENI created | Successfully created Customer ENI for instance in Cluster <cluster-name> and IG <instance-group-name> |
| Customer ENI creation failed | Failed to create Customer ENI for instance in Cluster <cluster-name> and IG <instance-group-name> |
| EC2 instance provisioned | EC2 Instance <instance-id> successfully provisioned in Cluster <cluster-name> and IG <instance-group-name> |
| EC2 instance creation failed | Failed to provision EC2 Instance in Cluster <cluster-name> and IG <instance-group-name> |
| Lifecycle script status updated | Instance lifecycle script execution for EC2 Instance <instance-id> has <status> |
| Lifecycle script status update failed | Failed to update Instance lifecycle script execution status for EC2InstanceId <instance-id> |
| Instance creation failed with lifecycle logs | Instance lifecycle script execution for EC2 Instance <instance-id> has Failed. To view lifecycle script logs, visit log group... |
| Unused ENI cleanup succeeded | Successfully deleted unused Customer ENIs for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
| Unused ENI cleanup failed | Failed to delete unused Customer ENIs for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
Instance deletion
| Event | Description |
|---|---|
| EC2 instance termination in progress | Termination of EC2 Instance <instance-id> is currently in progress in Cluster <cluster-name> and IG <instance-group-name> |
| EC2 instance termination failed | Failed to terminate EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
| Customer ENI deleted | Customer ENI successfully deleted for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
| Customer ENI deletion failed | Failed to delete Customer ENI for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
Instance reboot
| Event | Description |
|---|---|
| EC2 instance reboot in progress | Reboot of EC2 Instance <instance-id> is currently in progress on Cluster <cluster-name> and IG <instance-group-name> |
| EC2 instance reboot request failed | Failed to submit reboot request for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
Instance operation (generic)
| Event | Description |
|---|---|
| Instance operation completed | Instance <operation> <instance-id> in Cluster <cluster-name> and IG <instance-group-name> completed successfully |
| Instance operation failed | Instance <operation> <instance-id> in Cluster <cluster-name> and IG <instance-group-name> failed |
Instance replacement
| Event | Description |
|---|---|
| Instance replacement started | Instance <instance-id> is starting as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name> |
| Instance replacement start failed | Instance <instance-id> failed to start as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name> |
| Instance replacement completed | Instance <instance-id> <operation> completed successfully as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name> |
| Instance replacement failed | Instance <instance-id> <operation> failed as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name> |
FSx filesystem lifecycle
| Event | Description |
|---|---|
| FSx creation started | FSx creation started for IG <instance-group-name> in Cluster <cluster-name> |
| FSx creation failed | Failed to create FSx for IG <instance-group-name> in Cluster <cluster-name> |
| FSx creation completed | FSx creation successfully completed for IG <instance-group-name> in Cluster <cluster-name> |
| FSx deletion started | FSx deletion started for IG <instance-group-name> in Cluster <cluster-name> |
| FSx deletion failed | Failed to delete FSx for IG <instance-group-name> in Cluster <cluster-name> |
| FSx deletion completed | FSx deletion successfully completed for IG <instance-group-name> in Cluster <cluster-name> |
| FSx update started | FSx update started for IG <instance-group-name> in Cluster <cluster-name> |
| FSx update failed | Failed to update FSx for IG <instance-group-name> in Cluster <cluster-name> |
| FSx update completed | FSx update successfully completed for IG <instance-group-name> in Cluster <cluster-name> |
Patching (common steps)
These patching events are emitted for both EKS and Slurm clusters during
UpdateClusterSoftware operations.
| Event | Description |
|---|---|
| Instance group patching scheduled | InstanceGroup <instance-group-name> in Cluster <cluster-name> has been scheduled for UpdateClusterSoftware to latest. |
| Instance group patching schedule failed | Failed to schedule UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>. |
| Instance group patching started | UpdateClusterSoftware initiated for IG <instance-group-name> in Cluster <cluster-name> using <strategy> strategy. |
| Instance group patching start failed | Failed to initiate UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>. |
| Next patching batch selected | Next update batch selected for IG <instance-group-name> in Cluster <cluster-name>. |
| Next patching batch selection failed | Failed to select the next update batch for IG <instance-group-name> in Cluster <cluster-name>. |
| Failed instances queued for replacement | Failed instances in IG <instance-group-name> in Cluster <cluster-name> queued for node replacement. |
| Failed instance replacement queueing failed | Failed to queue instances for node replacement in IG <instance-group-name> in Cluster <cluster-name>. |
| Instance group patching completed | UpdateClusterSoftware completed successfully for IG <instance-group-name> in Cluster <cluster-name>. |
| Instance group patching completion failed | Failed to complete UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>. |
| Root volume replacement started | Root volume replacement started for Instance <instance-id> in IG <instance-group-name>. |
| Root volume replacement failed | Failed to start root volume replacement for Instance <instance-id> in IG <instance-group-name>. |
| Instance patching succeeded | Instance <instance-id> in IG <instance-group-name> updated successfully. |
EKS-specific events
The following events are emitted only for HyperPod clusters orchestrated with Amazon EKS.
Access entry management
| Event | Description |
|---|---|
| SLR access entry operation succeeded | SLR Access Entry <operation> successful for Cluster <cluster-name> |
| SLR access entry operation failed | SLR Access Entry <operation> failed for Cluster <cluster-name> |
| EKS access entries operation succeeded | EKS Access Entries <operation> successful for Cluster <cluster-name> |
| EKS access entries operation failed | EKS Access Entries <operation> failed for Cluster <cluster-name> |
Kubernetes configuration updates
| Event | Description |
|---|---|
| Kubernetes config update succeeded | Successfully updated Kubernetes config for instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
| Kubernetes config update failed | Failed to update Kubernetes config for instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name> |
Karpenter autoscaling
| Event | Description |
|---|---|
| Autoscaling operation succeeded | AutoScaling <operation> <status> successfully in Cluster <cluster-name> |
| Autoscaling operation failed | Failed to <operation> AutoScaling in Cluster <cluster-name> |
| Karpenter CRD installation succeeded | CustomResourceDefinition installation completed successfully in EKS Cluster <cluster-name> |
| Karpenter CRD installation failed | CustomResourceDefinition installation failed for EKS Cluster <cluster-name> |
| Karpenter SLR access policy update succeeded | <operation> access policies with AmazonSageMakerHyperPodServiceRole access entry in EKS cluster <cluster-name> successfully |
| Karpenter SLR access policy update failed | Failed to <operation> access policies with AmazonSageMakerHyperPodServiceRole access entry in EKS cluster <cluster-name> |
Patching — EKS instance-level
| Event | Description |
|---|---|
| Instance patching preparation succeeded | Instance <instance-id> in IG <instance-group-name> cordoned and pods evicted. |
| Instance patching skipped (PDB violation) | UpdateClusterSoftware for Instance <instance-id> in IG <instance-group-name> skipped due to PodDisruptionBudget constraint. |
| Instance patching preparation failed | Failed to prepare instance <instance-id> in IG <instance-group-name> for UpdateClusterSoftware. |
| Instance restored to schedulable state | Instance <instance-id> in IG <instance-group-name> restored to schedulable state. |
| Instance restore to schedulable failed | Failed to restore instance <instance-id> in IG <instance-group-name> to schedulable state. |
Patching — EKS rollback
| Event | Description |
|---|---|
| Bake time started | Baking period started for IG <instance-group-name> in Cluster <cluster-name>. Monitoring alarms [<alarm-names>] for <duration> seconds. |
| Bake time completed | Baking period completed for IG <instance-group-name> in Cluster <cluster-name>. No alarms triggered during the <duration>-second baking period. |
| Bake time alarm triggered | Baking period failed for IG <instance-group-name> in Cluster <cluster-name>. Alarms [<alarm-names>] entered ALARM state. Initiating auto-rollback. |
| Bake time evaluation failed | Failed to evaluate alarms during baking period for IG <instance-group-name> in Cluster <cluster-name>. |
| Instance group patching rollback initiated | UpdateClusterSoftware failed for IG <instance-group-name> in Cluster <cluster-name>. Initiating rollback. |
| Instance group patching rollback failed | Rollback failed for IG <instance-group-name> in Cluster <cluster-name>. Some instances may be in FailedMaintenance state. |
| Instance patching rollback initiated | Instance <instance-id> in IG <instance-group-name> failed to update. Rollback initiated. |
| Instance patching rollback succeeded | Instance <instance-id> in IG <instance-group-name> rolled back successfully to previous AMI. |
| Instance patching rollback failed | UpdateClusterSoftware rollback failed for instance <instance-id> in IG <instance-group-name>. |
Slurm-specific events
The following events are emitted only for HyperPod clusters orchestrated with Slurm.
| Event | Description |
|---|---|
| Provisioning parameters found | Found provisioning_parameters.json in LifeCycleScript S3 Path for controller group <instance-group-name> |
| Provisioning parameters not found | No provisioning_parameters.json found in LifeCycleScript S3 Path for controller group <instance-group-name> |
| Slurm munge key created | Successfully created and stored munge key |
| Slurm drift validation passed | Slurm configuration drift validation passed |
| Slurm drift detected | Slurm configuration drift detected: <drift-details> |
| Slurm cluster rollback completed | Cluster creation failed: controller and login nodes did not become ready within the expected time |
| Slurm reconfiguration succeeded | Slurm was reconfigured successfully. Slurm config updated to match desired state |
EventBridge integration
HyperPod sends cluster events to Amazon EventBridge using three detail types:
| Detail type | Description |
|---|---|
SageMaker HyperPod Cluster Event |
Operational events for provisioning, scaling, patching, and
orchestrator-specific operations. Includes
EventLevel for severity filtering. |
SageMaker HyperPod Cluster State Change |
Cluster-level status transitions (for example, Creating to InService). Includes full cluster configuration. |
SageMaker HyperPod Cluster Node Health
Event |
Health monitoring events from the HyperPod Health Monitoring Agent (HMA). Includes health status, reason, repair action, and recommendation. |
Event pattern examples
All HyperPod cluster events:
{ "source": ["aws.sagemaker"], "detail-type": ["SageMaker HyperPod Cluster Event"] }
Error events only (for alerting):
{ "source": ["aws.sagemaker"], "detail-type": ["SageMaker HyperPod Cluster Event"], "detail": { "EventDetails": { "EventLevel": ["Error"] } } }
Events for a specific cluster:
{ "source": ["aws.sagemaker"], "detail-type": ["SageMaker HyperPod Cluster Event"], "resources": ["arn:aws:sagemaker:us-west-2:111122223333:cluster/my-cluster-id"] }
Node health events:
{ "source": ["aws.sagemaker"], "detail-type": ["SageMaker HyperPod Cluster Node Health Event"] }
API reference
-
ListClusterEvents — List events with filtering, sorting, and pagination
-
DescribeClusterEvent — Get full details for a specific event
-
ClusterEventSummary — Event summary data type
-
ClusterEventDetail — Event detail data type
See also
-
SageMaker HyperPod Slurm cluster events — Slurm cluster events with CLI usage and common scenarios
-
SageMaker HyperPod EKS cluster events — EKS cluster events with CLI usage and common scenarios