Using topology-aware scheduling in Amazon SageMaker HyperPod
Data transfer efficiency is a critical factor in high-performance computing (HPC) and machine learning workloads. When using UltraServers with Amazon SageMaker HyperPod, SageMaker HyperPod automatically applies topology labels to your resources. Topology-aware scheduling helps allocate resources to minimize data transfer overheads by considering both instance topology (how resources are connected within an instance) and network topology (how instances are connected with each other). For more information about instance topology, see Amazon EC2 instance topology.
Topology-aware scheduling works with both clusters on Slurm and Amazon EKS. For general information about
how topology works with Slurm, see the Topology
guide in the Slurm documentation
In Amazon SageMaker HyperPod, data transfer overheads typically come from three main sources:
-
GPU-to-GPU data transfer: Modern technologies like NVLink and NVLink switches allow high-throughput data transfer between GPUs without involving other compute resources. This is extremely efficient but usually limited to a single instance.
-
GPU-to-CPU data transfer: Non-uniform memory access (NUMA) systems have multiple system buses on a single motherboard. In a typical EC2 instance architecture like p5.48xlarge, there are two different system buses, each with a CPU and 4 GPUs. For optimal performance, processes that load or read data to/from GPUs should execute on a CPU connected to the same system bus as the GPU.
-
Network communications between instances: Instances transfer data through a chain of network switches. The shortest path typically corresponds to the lowest latency.
UltraServer architecture
SageMaker HyperPod supports UltraServer architecture with p6e-gb200.36xlarge instances. An UltraServer contains up to 18 p6e-gb200.36xlarge instances, with 4 GPUs on each instance. All GPUs across all nodes are interconnected through NVLink switches, enabling data transfer between any two GPUs without using network interfaces.
This architecture provides a significant performance boost compared to individual instances. To leverage this architecture effectively, jobs should be submitted to compute nodes from a single UltraServer.
EKS topology label
In accordance with EC2 instance topology, HyperPod automatically labels your nodes with the following labels:
-
topology.kubernetes.io/region - the AWS Region that the node resides in.
-
topology.kubernetes.io/zone - the Availability Zone that the node resides in.
-
topology.k8s.aws/network-node-layer - NetworkNodes describes the network node set of an instance. In each network node set, the network nodes are listed in a hierarchical order from top to bottom. The network node that is connected to the instance is the last network node in the list. There are up to four network node layers, and each node is tagged with a label. Available layers are
topology.k8s.aws/network-node-layer-1,topology.k8s.aws/network-node-layer-2,topology.k8s.aws/network-node-layer-3. -
topology.k8s.aws/ultraserver-id - An identifier used to label each of the instances belonging to the same NVLink domain in an Ultraserver. To learn more about using UltraServers with SageMaker HyperPod, see Using UltraServers in Amazon SageMaker HyperPod.
Using these labels, you can use topology-aware scheduling in HyperPod task governance to apply topology labels and annotations to optimize training efficiency of your workloads. For more information, see Using topology-aware scheduling in Amazon SageMaker HyperPod task governance.
Slurm network topology plugins
Slurm provides built-in plugins for network topology awareness. SageMaker HyperPod automatically selects and configures the appropriate topology plugin based on the instance types in your cluster.
Automatic topology selection
When you create a HyperPod Slurm cluster, the system inspects all instance groups and their associated instance types, identifies the GPU communication characteristics of each instance type, and configures Slurm with the appropriate topology plugin. This process runs automatically and does not require any configuration.
HyperPod manages topology through a dynamically generated topology.conf file.
As the cluster evolves through scaling operations or node replacements, HyperPod continuously
reconciles the topology configuration to reflect the current cluster state. For more information, see
Dynamic topology updates.
Using the topology/tree plugin
The topology/tree plugin models hierarchical communication structures with multiple
bandwidth tiers. Tree topology enables Slurm to place jobs in a way that minimizes cross-tier
communication and maximizes locality.
Tree topology is used for instance types with hierarchical interconnects, where distributed training
workloads benefit from locality-aware placement. This includes instance types such as
ml.p5.48xlarge, ml.p5e.48xlarge, and ml.p5en.48xlarge.
SageMaker HyperPod automatically configures the topology/tree plugin when your cluster uses
these instance types. The generated topology.conf maps nodes into a switch hierarchy
that reflects the communication tiers of your hardware.
Ensure your slurm.conf includes:
TopologyPlugin=topology/tree
Configuration
SageMaker HyperPod automatically configures the topology/tree plugin based on information
provided by Amazon EC2. For more details about Amazon EC2 topology, see
Amazon EC2 instance topology.
When the topology/tree plugin is used, the Slurm topology.conf
looks like the following:
SwitchName=nn-6fe9d8a965d34d181 Switches=nn-0b53107754517bf0e SwitchName=nn-0b53107754517bf0e Switches=nn-424c855d4ad825aa4,nn-95acd7c656329fc30 SwitchName=nn-424c855d4ad825aa4 Nodes=ip-10-1-111-198 SwitchName=nn-95acd7c656329fc30 Nodes=ip-10-1-53-231
Usage
When the topology/tree plugin is configured, Slurm tries to allocate machines that
are close to each other. You can force Slurm to allocate machines on a single switch by passing the
--switch command line parameter to sbatch or srun:
sbatch --switch=1 ....
Using the topology/block plugin
NVIDIA developed a topology/block plugin that provides hierarchical scheduling across
blocks of nodes with the following characteristics:
A block is a consecutive range of nodes
Blocks cannot overlap with each other
All nodes in a block are allocated to a job before the next block is used
The planning block size is the smallest block size configured
Every higher block level size is a power of two than the previous one
This plugin allocates nodes based on the defined network topology.
Block topology models uniform, high-bandwidth communication domains where all GPUs participate in a single high-speed domain with near-uniform latency. Block topology treats all nodes as part of a single cohesive communication unit. UltraServer architecture in SageMaker HyperPod supports the block plugin.
Block topology is used for instance types such as ml.p6e-gb200.NVL72 and
ml.p6e-gb300.NVL72.
Configuration
SageMaker HyperPod automatically configures the topology/block plugin. If you want to
configure the plugin manually, specify the following in the topology.conf file
in your Slurm configuration directory:
BlockName=us1 Nodes=ultraserver1-[0-17] BlockName=us2 Nodes=ultraserver2-[0-17] BlockSizes=18
Ensure your slurm.conf includes:
TopologyPlugin=topology/block
Usage
When submitting jobs, you can use the following additional arguments with sbatch
and srun commands:
--segment=N: Specify the number of nodes to group together. The size of the segment must be less than or equal to the planning block size.--exclusive=topo: Request that no other jobs be placed on the same block. This is useful for benchmarking and performance-sensitive applications.
The following are sample scenarios you might consider when thinking about allocating blocks.
Allocate a whole block of nodes on an empty system
sbatch -N18
Allocate two blocks of nodes on an empty system
sbatch -N36
Allocate 18 nodes on one block + 6 nodes on another block
sbatch -N24
Allocate 12 nodes on one block and 12 nodes on another block
sbatch -N24 --segment=12
With --exclusive=topo, job must be placed on block with no other jobs
sbatch -N12 --exclusive=topo
Topology selection for clusters with mixed instance types
HyperPod currently uses Slurm 24.11, which supports only a single topology configuration per cluster. This means that per-partition topology selection is not supported, multiple topology models cannot coexist within a single cluster, and all nodes must conform to a single topology definition.
When your cluster contains multiple instance types, HyperPod selects a topology that is compatible across all of them. The following table shows an example of how HyperPod resolves topology for a cluster with mixed instance types.
| Instance group | Instance type | Preferred topology |
|---|---|---|
IG-1 |
ml.p5.48xlarge |
Tree |
IG-2 |
ml.p6e-gb300.NVL72 |
Block |
In this example, block topology is optimal for ml.p6e-gb300.NVL72, but tree topology is compatible with both ml.p5.48xlarge and ml.p6e-gb300.NVL72. HyperPod selects tree topology as the cluster-wide topology to ensure that all nodes can participate in scheduling correctly and no instance type is excluded or misrepresented.
Important
For workloads where topology-aware scheduling is critical to performance, we recommend creating separate clusters for each instance type rather than combining different instance types in a single cluster. This ensures that each cluster uses the optimal topology for its hardware, delivering the best possible workload performance. For example, instead of combining ml.p5.48xlarge and ml.p6e-gb300.NVL72 instances in a single cluster where tree topology is selected as a compatible compromise, create a dedicated cluster for each instance type so that each uses its ideal topology model.
Disable or change topology plugin
When a Slurm cluster is created, HyperPod automatically selects the optimal topology plugin.
To manually change the topology plugin, update the TopologyPlugin value in
slurm.conf on the controller node.
Example:
# Set this value to disable topology plugin TopologyPlugin=topology/flat
Dynamic topology updates
Topology-aware scheduling continuously maintains topology correctness as your cluster changes.
The topology is automatically recalculated and the topology.conf file is regenerated
when any of the following events occur:
Scale-up: New nodes are added to the cluster.
Scale-down: Nodes are removed from the cluster.
Node replacement: Failed or unhealthy nodes are replaced, or nodes are manually replaced using the BatchReplaceClusterNodes API.
When the topology is updated, new nodes are incorporated into the correct topology structure, removed nodes are pruned, and the Slurm configuration is updated without requiring manual intervention. This ensures that the topology always reflects the actual cluster state.
Note
Advanced users can override the topology behavior by logging into the Slurm controller node and
manually modifying slurm.conf and topology.conf. However,
manual changes may be overwritten by HyperPod during subsequent cluster updates, including
scaling operations, node replacements, and other cluster lifecycle events. If you modify these files
manually, verify your changes after any cluster update.
Best practices for UltraServer topology
For optimal performance with UltraServer architecture in SageMaker HyperPod:
-
Set appropriate block sizes: Configure
BlockSizes=18(or 17 if one node is spare) to match the UltraServer architecture. -
Use segments for better availability: Use
--segment=16,--segment=8, or--segment=9withsrunandsbatchcommands to improve job scheduling flexibility. -
Consider job size and segment size:
If
BlockSizes=18, jobs with up to 18 instances will always run on a single UltraServer.If
BlockSizes=16, jobs with fewer than 16 instances will always run on a single UltraServer, while jobs with 18 instances may run on one or two UltraServers.
When thinking about segmenting, consider the following
With
--segment=1, each instance can run on a separate UltraServer.With
-N 18 --segment 9, 9 nodes will be placed on one UltraServer, and another 9 nodes can be placed on the same or another UltraServer.With
-N 24 --segment 8, the job can run on 2 or 3 UltraServers, with every 8 nodes placed together on the same server.
Limitations in SageMaker HyperPod topology aware scheduling
The topology/block plugin has limitations with heterogeneous clusters (clusters with different instance types):
Only nodes listed in blocks are schedulable by Slurm
Every block must have at least
BlockSizes[0]nodes
For heterogeneous clusters, consider these alternatives:
Do not use the block plugin with heterogeneous clusters. Instead, isolate UltraServer nodes in a different partition.
Create a separate cluster with UltraServers only in the same VPC and use Slurm's multicluster setup.