SageMaker HyperPod through Slurm orchestration provides the following cluster resiliency features.
Health monitoring agent
Automatic node recovery and auto-resume
Manually replace or reboot a node using Slurm
Javascript is disabled or is unavailable in your browser.
To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.