Setting up the SageMaker HyperPod observability add-on - Amazon SageMaker AI
Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.

Setting up the SageMaker HyperPod observability add-on

The following list describes the prerequisites for setting up the observability add-on.

To have metrics for your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster sent to a Amazon Managed Service for Prometheus workspace and to optionally view them in Amazon Managed Grafana, first attach the following managed policies and permissions to your console role.

  • To use Amazon Managed Grafana, enable AWS IAM Identity Center (IAM Identity Center) in an AWS Region where Amazon Managed Grafana is available. For instructions, see Getting started with IAM Identity Center in the AWS IAM Identity Center User Guide. For a list of AWS Regions where Amazon Managed Grafana is available, see Supported Regions in the Amazon Managed Grafana User Guide.

  • Create at least one user in IAM Identity Center.

  • Ensure that the Amazon EKS Pod Identity Agent add-on is installed in your Amazon EKS cluster. The Amazon EKS Pod Identity Agent add-on makes it possible for the SageMaker HyperPod observability add-on to get the credentials to interact with Amazon Managed Service for Prometheus and CloudWatch Logs. To check whether your Amazon EKS cluster has the add-on, go to the Amazon EKS console, and check your cluster's Add-ons tab. For information about how to install the add-on if it's not installed, see Create add-on (AWS Management Console) in the Amazon EKS User Guide.

    Note

    The Amazon EKS Pod Identity Agent is required for standard instance groups. For Restricted Instance Groups (RIG), the Pod Identity Agent is not available due to network isolation constraints. The cluster's instance group execution IAM role is used to interact with Amazon Managed Service for Prometheus. For information about how to configure that role, see Additional prerequisites for Restricted Instance Groups.

  • Ensure that you have at least one node in your SageMaker HyperPod cluster before installing SageMaker HyperPod observability add-on. The smallest Amazon EC2 instance type that works in this case is 4xlarge. This minimum node size requirement ensures that the node can accommodate all the pods that the SageMaker HyperPod observability add-on creates alongside any other already running pods on the cluster.

  • Add the following policies and permissions to your role.

    • AWS managed policy: AmazonSageMakerHyperPodObservabilityAdminAccess

    • AWS managed policy: AWSGrafanaWorkspacePermissionManagementV2

    • AWS managed policy: AmazonSageMakerFullAccess

    • Additional permissions to set up required IAM roles for Amazon Managed Grafana and Amazon Elastic Kubernetes Service add-on access:

      JSON
      { "Version":"2012-10-17", "Statement": [ { "Sid": "CreateRoleAccess", "Effect": "Allow", "Action": [ "iam:CreateRole", "iam:CreatePolicy", "iam:AttachRolePolicy", "iam:ListRoles" ], "Resource": [ "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityGrafanaAccess*", "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityAddonAccess*", "arn:aws:iam::*:policy/service-role/HyperPodObservabilityAddonPolicy*", "arn:aws:iam::*:policy/service-role/HyperPodObservabilityGrafanaPolicy*" ] } ] }
    • Additional permissions needed to manage IAM Identity Center users for Amazon Managed Grafana:

      JSON
      { "Version":"2012-10-17", "Statement": [ { "Sid": "SSOAccess", "Effect": "Allow", "Action": [ "sso:ListProfileAssociations", "sso-directory:SearchUsers", "sso-directory:SearchGroups", "sso:AssociateProfile", "sso:DisassociateProfile" ], "Resource": [ "*" ] } ] }

Additional prerequisites for Restricted Instance Groups

If your cluster contains Restricted Instance Groups, the instance group execution role must have permissions to write metrics to Amazon Managed Service for Prometheus. When you use Quick setup to create your cluster with observability enabled, these permissions are added to the execution role automatically.

If you are using Custom setup or adding observability to an existing RIG cluster, ensure that the execution role for each Restricted Instance Group has the following permissions:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "PrometheusAccess", "Effect": "Allow", "Action": "aps:RemoteWrite", "Resource": "arn:aws:aps:us-east-1:account_id:workspace/workspace-ID" } ] }

Replace us-east-1, account_id, and workspace-ID with your AWS Region, account ID, and Amazon Managed Service for Prometheus workspace ID.

After you ensure that you have met the above prerequisites, you can install the observability add-on.

To quickly install the observability add-on
  1. Open the Amazon SageMaker AI console at https://eusc-de-east-1.console.amazonaws-eusc.eu/sagemaker/.

  2. Go to your cluster's details page.

  3. On the Dashboard tab, locate the add-on named HyperPod Monitoring & Observability, and choose Quick install.

To do a custom-install of the observability add-on
  1. Go to your cluster's details page.

  2. On the Dashboard tab, locate the add-on named HyperPod Monitoring & Observability, and choose Custom install.

  3. Specify the metrics categories that you want to see. For more information about these metrics categories, see SageMaker HyperPod cluster metrics.

  4. Specify whether you want to enable Amazon CloudWatch Logs.

  5. Specify whether you want the service to create a new Amazon Managed Service for Prometheus workspace.

  6. To be able to view the metrics in Amazon Managed Grafana dashboards, check the box labeled Use an Amazon Managed Grafana workspace. You can specify your own workspace or let the service create a new one for you.

    Note

    Amazon Managed Grafana isn't available in all AWS Regions in which Amazon Managed Service for Prometheus is available. However, you can set up a Grafana workspace in any AWS Region and configure it to get metrics data from a Prometheus workspace that resides in a different AWS Region. For information, see Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source and Connect to Amazon Managed Service for Prometheus and open-source Prometheus data sources.