Cluster event record Common events (EKS and Slurm)EKS-specific events Slurm-specific events EventBridge integration API reference See also

SageMaker HyperPod cluster events reference

This page provides a complete reference of all structured events emitted by Amazon SageMaker HyperPod clusters. Events provide visibility into cluster, instance group, and instance-level operations including provisioning, scaling, patching, and orchestrator-specific lifecycle changes.

Cluster events are available for HyperPod clusters with NodeProvisioningMode set to Continuous. Events are accessible through the ListClusterEvents and DescribeClusterEvent APIs, the SageMaker AI console Events tab, and Amazon EventBridge.

Cluster event record

Each cluster event is represented as a structured record containing identification, timing, scope, severity, and operation-specific metadata. The following example shows a complete event record as delivered through the DescribeClusterEvent API and Amazon EventBridge:


{
  "version": "0",
  "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
  "detail-type": "SageMaker HyperPod Cluster Event",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2026-06-01T17:20:25Z",
  "region": "us-west-2",
  "resources": [
    "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
  ],
  "detail": {
    "EventDetails": {
      "EventId": "83ea0bb5-be77-45e8-a458-0a87f778a205",
      "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
      "ClusterName": "sample-cluster",
      "InstanceGroupName": "p5Inst",
      "InstanceId": "i-0391f86fa0fe0d465",
      "ResourceType": "Instance",
      "EventTime": 1748794825350,
      "EventLevel": "Error",
      "Description": "Instance creation in Cluster sample-cluster and InstanceGroup p5Inst failed",
      "EventDetails": {
        "EventMetadata": {
          "Instance": {
            "FailureMessage": "We currently do not have sufficient capacity to launch new ml.p5.48xlarge instances. Please try again.",
            "NodeLogicalId": "df268d19-f035-4f28-9b80-b956b92ae21e"
          }
        }
      }
    }
  }
}

Event record fields

The detail.EventDetails object contains the following fields:

Field	Type	Required	Description
`EventId`	String (UUID)	Yes	Unique identifier for the event.
`ClusterArn`	String	Yes	ARN of the HyperPod cluster.
`ClusterName`	String	Yes	Name of the HyperPod cluster.
`EventTime`	Timestamp	Yes	When the event occurred (epoch milliseconds).
`ResourceType`	String	Yes	Scope of the event: `Cluster`, `InstanceGroup`, or `Instance`.
`EventLevel`	String	Yes	Severity classification: `Info`, `Warn`, or `Error`.
`Description`	String	No	Human-readable summary of the event.
`InstanceGroupName`	String	No	Instance group name (present when `ResourceType` is `InstanceGroup` or `Instance`).
`InstanceId`	String	No	EC2 instance ID (present when `ResourceType` is `Instance`).
`EventDetails`	Object	No	Additional metadata specific to the resource type and operation.

Event levels

Level	Meaning
`Info`	Operation completed successfully or is progressing normally.
`Warn`	Operation completed with a non-critical issue or a condition that may require future attention.
`Error`	Operation failed or requires immediate attention.

Resource types

ResourceType	Scope	Example events
`Cluster`	Whole-cluster operations	Cluster creation/update started, cluster operation failed
`InstanceGroup`	Instance group operations	Scaling started/completed, patching scheduled, FSx lifecycle
`Instance`	Individual instance operations	EC2 provisioning, lifecycle script execution, ENI management, termination

EventDetails metadata

Cluster events include an EventMetadata object within the EventDetails field that provides operation-specific context beyond what the event description conveys. The contents of EventMetadata vary by resource type and event type. For the complete schema and supported fields, see EventMetadata in the Amazon SageMaker AI API Reference.

EventBridge envelope fields

When delivered through Amazon EventBridge, the event record is wrapped in the standard EventBridge envelope:

Field	Description
`version`	EventBridge schema version (always `"0"`).
`id`	Unique EventBridge event ID.
`detail-type`	`SageMaker HyperPod Cluster Event`
`source`	`aws.sagemaker`
`account`	AWS account ID that owns the cluster.
`time`	ISO 8601 timestamp of the event.
`region`	AWS Region where the cluster resides.
`resources`	Array containing the cluster ARN.
`detail`	Contains the `EventDetails` object described above.

Common events (EKS and Slurm)

The following events are emitted for all HyperPod clusters regardless of orchestrator. The Description column shows the value of the Description field in the event record as it appears in the API response and the console Events tab.

Cluster lifecycle

Event	Description
Cluster operation started	Cluster <cluster-name> <operation> started successfully
Cluster operation start failed	Failed to start Cluster <cluster-name> <operation>
Cluster operation completed	Cluster <cluster-name> <operation> completed successfully
Cluster operation failed	Cluster <cluster-name> <operation> failed

Instance group lifecycle

Event	Description
Instance group operation started	InstanceGroup <instance-group-name> <operation> started successfully in Cluster <cluster-name>
Instance group operation start failed	Failed to start InstanceGroup <instance-group-name> <operation> in Cluster <cluster-name>
Instance group operation completed	Instance Group <instance-group-name> <operation> in Cluster <cluster-name> completed successfully
Instance group operation failed	Instance Group <instance-group-name> <operation> in Cluster <cluster-name> failed

Instance group network configuration

Event	Description
Network configuration found	Found Subnet <subnet-id> in AZ <availability-zone> with SecurityGroupIds <security-group-ids> for IG <instance-group-name> in Cluster <cluster-name>
Network configuration failed	Failed to process Instance Group Network Configuration details for IG <instance-group-name> in Cluster <cluster-name>
Custom AMI override found	Found Custom AMI Override <ami-id> for IG <instance-group-name> in Cluster <cluster-name>
Custom AMI override failed	Failed to process Custom AMI Override details for IG <instance-group-name> in Cluster <cluster-name>
Platform network configuration used	Using HyperPod Platform provided network configuration for IG <instance-group-name> in Cluster <cluster-name>
Network configuration determined	Instance Group network configuration successfully determined for IG <instance-group-name> in Cluster <cluster-name>

Instance creation

Event	Description
Instance operation started	Instance <operation> started successfully in Cluster <cluster-name> and IG <instance-group-name>
Instance operation start failed	Failed to start Instance <operation> in Cluster <cluster-name> and IG <instance-group-name>
Capacity reservation found	Found CapacityReservation ID <reservation-id> for Cluster <cluster-name> and IG <instance-group-name>, using reserved capacity
Capacity reservation not found	No CapacityReservation found for Cluster <cluster-name> and IG <instance-group-name>, using on-demand pool
Instance payload setup failed	Failed to process CapacityReservationDetails for Cluster <cluster-name> and IG <instance-group-name>
Customer ENI created	Successfully created Customer ENI for instance in Cluster <cluster-name> and IG <instance-group-name>
Customer ENI creation failed	Failed to create Customer ENI for instance in Cluster <cluster-name> and IG <instance-group-name>
EC2 instance provisioned	EC2 Instance <instance-id> successfully provisioned in Cluster <cluster-name> and IG <instance-group-name>
EC2 instance creation failed	Failed to provision EC2 Instance in Cluster <cluster-name> and IG <instance-group-name>
Lifecycle script status updated	Instance lifecycle script execution for EC2 Instance <instance-id> has <status>
Lifecycle script status update failed	Failed to update Instance lifecycle script execution status for EC2InstanceId <instance-id>
Instance creation failed with lifecycle logs	Instance lifecycle script execution for EC2 Instance <instance-id> has Failed. To view lifecycle script logs, visit log group...
Unused ENI cleanup succeeded	Successfully deleted unused Customer ENIs for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>
Unused ENI cleanup failed	Failed to delete unused Customer ENIs for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>

Instance deletion

Event	Description
EC2 instance termination in progress	Termination of EC2 Instance <instance-id> is currently in progress in Cluster <cluster-name> and IG <instance-group-name>
EC2 instance termination failed	Failed to terminate EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>
Customer ENI deleted	Customer ENI successfully deleted for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>
Customer ENI deletion failed	Failed to delete Customer ENI for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>

Instance reboot

Event	Description
EC2 instance reboot in progress	Reboot of EC2 Instance <instance-id> is currently in progress on Cluster <cluster-name> and IG <instance-group-name>
EC2 instance reboot request failed	Failed to submit reboot request for EC2 Instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>

Instance operation (generic)

Event	Description
Instance operation completed	Instance <operation> <instance-id> in Cluster <cluster-name> and IG <instance-group-name> completed successfully
Instance operation failed	Instance <operation> <instance-id> in Cluster <cluster-name> and IG <instance-group-name> failed

Instance replacement

Event	Description
Instance replacement started	Instance <instance-id> is starting as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name>
Instance replacement start failed	Instance <instance-id> failed to start as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name>
Instance replacement completed	Instance <instance-id> <operation> completed successfully as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name>
Instance replacement failed	Instance <instance-id> <operation> failed as part of instance replacement in Cluster <cluster-name> and IG <instance-group-name>

FSx filesystem lifecycle

Event	Description
FSx creation started	FSx creation started for IG <instance-group-name> in Cluster <cluster-name>
FSx creation failed	Failed to create FSx for IG <instance-group-name> in Cluster <cluster-name>
FSx creation completed	FSx creation successfully completed for IG <instance-group-name> in Cluster <cluster-name>
FSx deletion started	FSx deletion started for IG <instance-group-name> in Cluster <cluster-name>
FSx deletion failed	Failed to delete FSx for IG <instance-group-name> in Cluster <cluster-name>
FSx deletion completed	FSx deletion successfully completed for IG <instance-group-name> in Cluster <cluster-name>
FSx update started	FSx update started for IG <instance-group-name> in Cluster <cluster-name>
FSx update failed	Failed to update FSx for IG <instance-group-name> in Cluster <cluster-name>
FSx update completed	FSx update successfully completed for IG <instance-group-name> in Cluster <cluster-name>

Patching (common steps)

These patching events are emitted for both EKS and Slurm clusters during UpdateClusterSoftware operations.

Event	Description
Instance group patching scheduled	InstanceGroup <instance-group-name> in Cluster <cluster-name> has been scheduled for UpdateClusterSoftware to latest.
Instance group patching schedule failed	Failed to schedule UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>.
Instance group patching started	UpdateClusterSoftware initiated for IG <instance-group-name> in Cluster <cluster-name> using <strategy> strategy.
Instance group patching start failed	Failed to initiate UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>.
Next patching batch selected	Next update batch selected for IG <instance-group-name> in Cluster <cluster-name>.
Next patching batch selection failed	Failed to select the next update batch for IG <instance-group-name> in Cluster <cluster-name>.
Failed instances queued for replacement	Failed instances in IG <instance-group-name> in Cluster <cluster-name> queued for node replacement.
Failed instance replacement queueing failed	Failed to queue instances for node replacement in IG <instance-group-name> in Cluster <cluster-name>.
Instance group patching completed	UpdateClusterSoftware completed successfully for IG <instance-group-name> in Cluster <cluster-name>.
Instance group patching completion failed	Failed to complete UpdateClusterSoftware for IG <instance-group-name> in Cluster <cluster-name>.
Root volume replacement started	Root volume replacement started for Instance <instance-id> in IG <instance-group-name>.
Root volume replacement failed	Failed to start root volume replacement for Instance <instance-id> in IG <instance-group-name>.
Instance patching succeeded	Instance <instance-id> in IG <instance-group-name> updated successfully.

EKS-specific events

The following events are emitted only for HyperPod clusters orchestrated with Amazon EKS.

Access entry management

Event	Description
SLR access entry operation succeeded	SLR Access Entry <operation> successful for Cluster <cluster-name>
SLR access entry operation failed	SLR Access Entry <operation> failed for Cluster <cluster-name>
EKS access entries operation succeeded	EKS Access Entries <operation> successful for Cluster <cluster-name>
EKS access entries operation failed	EKS Access Entries <operation> failed for Cluster <cluster-name>

Kubernetes configuration updates

Event	Description
Kubernetes config update succeeded	Successfully updated Kubernetes config for instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>
Kubernetes config update failed	Failed to update Kubernetes config for instance <instance-id> in Cluster <cluster-name> and IG <instance-group-name>

Karpenter autoscaling

Event	Description
Autoscaling operation succeeded	AutoScaling <operation> <status> successfully in Cluster <cluster-name>
Autoscaling operation failed	Failed to <operation> AutoScaling in Cluster <cluster-name>
Karpenter CRD installation succeeded	CustomResourceDefinition installation completed successfully in EKS Cluster <cluster-name>
Karpenter CRD installation failed	CustomResourceDefinition installation failed for EKS Cluster <cluster-name>
Karpenter SLR access policy update succeeded	<operation> access policies with AmazonSageMakerHyperPodServiceRole access entry in EKS cluster <cluster-name> successfully
Karpenter SLR access policy update failed	Failed to <operation> access policies with AmazonSageMakerHyperPodServiceRole access entry in EKS cluster <cluster-name>

Patching — EKS instance-level

Event	Description
Instance patching preparation succeeded	Instance <instance-id> in IG <instance-group-name> cordoned and pods evicted.
Instance patching skipped (PDB violation)	UpdateClusterSoftware for Instance <instance-id> in IG <instance-group-name> skipped due to PodDisruptionBudget constraint.
Instance patching preparation failed	Failed to prepare instance <instance-id> in IG <instance-group-name> for UpdateClusterSoftware.
Instance restored to schedulable state	Instance <instance-id> in IG <instance-group-name> restored to schedulable state.
Instance restore to schedulable failed	Failed to restore instance <instance-id> in IG <instance-group-name> to schedulable state.

Patching — EKS rollback

Event	Description
Bake time started	Baking period started for IG <instance-group-name> in Cluster <cluster-name>. Monitoring alarms [<alarm-names>] for <duration> seconds.
Bake time completed	Baking period completed for IG <instance-group-name> in Cluster <cluster-name>. No alarms triggered during the <duration>-second baking period.
Bake time alarm triggered	Baking period failed for IG <instance-group-name> in Cluster <cluster-name>. Alarms [<alarm-names>] entered ALARM state. Initiating auto-rollback.
Bake time evaluation failed	Failed to evaluate alarms during baking period for IG <instance-group-name> in Cluster <cluster-name>.
Instance group patching rollback initiated	UpdateClusterSoftware failed for IG <instance-group-name> in Cluster <cluster-name>. Initiating rollback.
Instance group patching rollback failed	Rollback failed for IG <instance-group-name> in Cluster <cluster-name>. Some instances may be in FailedMaintenance state.
Instance patching rollback initiated	Instance <instance-id> in IG <instance-group-name> failed to update. Rollback initiated.
Instance patching rollback succeeded	Instance <instance-id> in IG <instance-group-name> rolled back successfully to previous AMI.
Instance patching rollback failed	UpdateClusterSoftware rollback failed for instance <instance-id> in IG <instance-group-name>.

Slurm-specific events

The following events are emitted only for HyperPod clusters orchestrated with Slurm.

Event	Description
Provisioning parameters found	Found provisioning_parameters.json in LifeCycleScript S3 Path for controller group <instance-group-name>
Provisioning parameters not found	No provisioning_parameters.json found in LifeCycleScript S3 Path for controller group <instance-group-name>
Slurm munge key created	Successfully created and stored munge key
Slurm drift validation passed	Slurm configuration drift validation passed
Slurm drift detected	Slurm configuration drift detected: <drift-details>
Slurm cluster rollback completed	Cluster creation failed: controller and login nodes did not become ready within the expected time
Slurm reconfiguration succeeded	Slurm was reconfigured successfully. Slurm config updated to match desired state

EventBridge integration

HyperPod sends cluster events to Amazon EventBridge using three detail types:

Detail type	Description
`SageMaker HyperPod Cluster Event`	Operational events for provisioning, scaling, patching, and orchestrator-specific operations. Includes `EventLevel` for severity filtering.
`SageMaker HyperPod Cluster State Change`	Cluster-level status transitions (for example, Creating to InService). Includes full cluster configuration.
`SageMaker HyperPod Cluster Node Health Event`	Health monitoring events from the HyperPod Health Monitoring Agent (HMA). Includes health status, reason, repair action, and recommendation.

Event pattern examples

All HyperPod cluster events:


{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster Event"]
}

Error events only (for alerting):


{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster Event"],
  "detail": {
    "EventDetails": {
      "EventLevel": ["Error"]
    }
  }
}

Events for a specific cluster:


{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster Event"],
  "resources": ["arn:aws:sagemaker:us-west-2:111122223333:cluster/my-cluster-id"]
}

Node health events:


{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster Node Health Event"]
}

API reference

ListClusterEvents — List events with filtering, sorting, and pagination
DescribeClusterEvent — Get full details for a specific event
ClusterEventSummary — Event summary data type
ClusterEventDetail — Event detail data type