SageMaker HyperPod cluster metrics - Amazon SageMaker AI
Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.

SageMaker HyperPod cluster metrics

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.

Metric category Enabled by default? Additional advanced metrics available? Available under which Grafana dashboards?
Training metrics Yes Yes Training
Inference metrics Yes No Inference
Task governance metrics No Yes None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard.
Scaling metrics No Yes None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard.
Cluster metrics Yes Yes Cluster
Instance metrics Yes Yes Cluster
Accelerated compute metrics Yes Yes Task, Cluster
Network metrics No Yes Cluster
File system Yes No File system

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

Metrics availability on Restricted Instance Groups

When your cluster contains Restricted Instance Groups, most metrics categories are available on restricted nodes with the following exceptions and considerations:

Metric category Available on RIG nodes? Notes
Training metrics Yes Kubeflow and Kubernetes pod metrics are collected. Advanced training KPI metrics (from Training Metrics Agent) are not available from the RIG nodes.
Inference metrics No Inference workloads are not supported on Restricted Instance Groups.
Task governance metrics No Kueue metrics are collected from the standard nodes only, if any.
Scaling metrics No KEDA metrics are collected from the standard nodes only, if any.
Cluster metrics Yes Kube State Metrics and API server metrics are available. Kube State Metrics is preferentially scheduled on standard nodes but can run on restricted nodes in RIG-only clusters.
Instance metrics Yes Node Exporter and cAdvisor metrics are collected on all nodes including restricted nodes.
Accelerated compute metrics Yes DCGM Exporter runs on GPU-enabled restricted nodes. Neuron Monitor runs on Neuron-enabled restricted nodes when advanced mode is enabled.
Network metrics Yes EFA Exporter runs on EFA-enabled restricted nodes when advanced mode is enabled.
File system metrics No Not available for Restricted Instance Groups attached FSX volumes.
Note

Container log collection with Fluent Bit is not deployed on restricted nodes. Cluster logs from restricted nodes are available through the SageMaker HyperPod platform independently of the observability add-on. You can view these logs in the Cluster Logs dashboard.

Training metrics

Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.

Metric name or type Description Enabled by default? Metric source
Kubeflow metrics https://github.com/kubeflow/trainer Yes Kubeflow
Kubernetes pod metrics https://github.com/kubernetes/kube-state-metrics Yes Kubernetes
training_uptime_percentage Percentage of training time out of the total window size No SageMaker HyperPod training operator
training_manual_recovery_count Total number of manual restarts performed on the job No SageMaker HyperPod training operator
training_manual_downtime_ms Total time in milliseconds the job was down due to manual interventions No SageMaker HyperPod training operator
training_auto_recovery_count Total number of automatic recoveries No SageMaker HyperPod training operator
training_auto_recovery_downtime Total infrastructure overhead time in milliseconds during fault recovery No SageMaker HyperPod training operator
training_fault_count Total number of faults encountered during training No SageMaker HyperPod training operator
training_fault_type_count Distribution of faults by type No SageMaker HyperPod training operator
training_fault_recovery_time_ms Recovery time in milliseconds for each type of fault No SageMaker HyperPod training operator
training_time_ms Total time in milliseconds spent in actual training No SageMaker HyperPod training operator

Inference metrics

Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.

Metric name or type Description Enabled by default? Metric source
model_invocations_total Total number of invocation requests to the model Yes SageMaker HyperPod inference operator
model_errors_total Total number of errors during model invocation Yes SageMaker HyperPod inference operator
model_concurrent_requests Active concurrent model requests Yes SageMaker HyperPod inference operator
model_latency_milliseconds Model invocation latency in milliseconds Yes SageMaker HyperPod inference operator
model_ttfb_milliseconds Model time to first byte latency in milliseconds Yes SageMaker HyperPod inference operator
TGI These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md. Yes Model container
LMI These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md. Yes Model container

Task governance metrics

Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.

Metric name or type Description Enabled by default? Metric source
Kueue See https://kueue.sigs.k8s.io/docs/reference/metrics/. No Kueue

Scaling metrics

Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.

Metric name or type Description Enabled by default? Metric source
KEDA Operator Metrics See https://keda.sh/docs/2.17/integrations/prometheus/#operator. No Kubernetes Event-driven Autoscaler (KEDA)
KEDA Webhook Metrics See https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks. No Kubernetes Event-driven Autoscaler (KEDA)
KEDA Metrics server Metrics See https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server. No Kubernetes Event-driven Autoscaler (KEDA)

Cluster metrics

Use these metrics to monitor overall cluster health and resource allocation.

Metric name or type Description Enabled by default? Metric source
Cluster health Kubernetes API server metrics. See https://kubernetes.io/docs/reference/instrumentation/metrics/. Yes Kubernetes
Kubestate See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources. Limited Kubernetes
KubeState Advanced See https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources. No Kubernetes

Instance metrics

Use these metrics to monitor individual instance performance and health.

Metric name or type Description Enabled by default? Metric source
Node Metrics See https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default. Yes Kubernetes
Container Metrics Container metrics exposed by Cadvisor. See https://github.com/google/cadvisor. Yes Kubernetes

Accelerated compute metrics

Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

Note

When GPU partitioning with MIG (Multi-Instance GPU) is enabled on your cluster, DCGM metrics automatically provide partition-level granularity for monitoring individual MIG instances. Each MIG partition is exposed as a separate GPU device with its own metrics for temperature, power, memory utilization, and compute activity. This allows you to track resource usage and health for each GPU partition independently, enabling precise monitoring of workloads running on fractional GPU resources. For more information about configuring GPU partitioning, see Using GPU partitions in Amazon SageMaker HyperPod.

Metric name or type Description Enabled by default? Metric source
NVIDIA GPU DCGM metrics. See https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv. Limited

NVIDIA Data Center GPU Manager (DCGM)

NVIDIA GPU (advanced)

DCGM metrics that are commented out in the following CSV file:

https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv

No

NVIDIA Data Center GPU Manager (DCGM)

AWS Trainium Neuron metrics. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters. No AWS Neuron Monitor

Network metrics

Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.

Metric name or type Description Enabled by default? Metric source
EFA See https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md. No Elastic Fabric Adapter

File system metrics

Metric name or type Description Enabled by default? Metric source
File system Amazon FSx for Lustre metrics from Amazon CloudWatch:

Monitoring with Amazon CloudWatch.

Yes Amazon FSx for Lustre