Prerequisites Kubernetes environment setup Checkpointless training modification instructions

Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models

The following sequence of steps is required to run checkpointless training with your custom model on HyperPod.

Prerequisites

Before you start setting up your environment, make sure you have:

Enabled Amazon EKS support in Amazon SageMaker HyperPod
Set up HyperPod training operator (v1.2+)
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:
- JSON
- JSONGZ (Compressed JSON)
- ARROW
Download the hugging face model weights and covert to Nemo supported format.
Setup your environment

Kubernetes environment setup

To set up your Kubernetes environment, do the following:

Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.
```
python3 -m venv ${PWD}/venv
source venv/bin/activate
```
Set up kubectl and eksctl

Connect to your Kubernetes cluster


aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"

Install dependencies


# install SageMaker HyperPod checkpointless training.
git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
cd sagemaker-hyperpod-checkpointless-training

Checkpointless training modification instructions

To incrementally adopt checkpointless training for custom models, follow the integration guide (here we use Llama 3 70b pretraining as an example), which involves:

Fast communicator creation
Memory-mapped dataloader (MMAP)
In-process & Checkpointless recovery

Component 1: Fast communicator creation

This is to optimize time to establish connections between the workers. There is no code changes needed and only requires setting env variables


  # Enable Rootless features
  export HPCT_USE_ROOTLESS=1 && \
  sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \

  hyperpodrun --nproc_per_node=8 \
              ...
              --inprocess-restart \
              ...

The full change can be found in the llama3 70 pretrain launch job config.

Component 2: Memory-mapped dataloader (MMAP)

MMAP caches to store pre-fetched data samples & enable immediate training start without needing to wait for data preprocessing. It requires minimal code changes to adopt by wrapping existing dataloader.


data_module = MMAPDataModule(
  data_module=base_data_module,
  mmap_config=CacheResumeMMAPConfig(cache_dir=…)
)

Components 3 and 4: In-process and checkpointless recovery

This enables failure recovery without restart training processes or loading from checkpoints. Additional code changes needed (strategy & training config update, wrap existing main)


@HPWrapper(
  health_check=CudaHealthCheck(),
  hp_api_factory=HPAgentK8sAPIFactory(),
  abort_timeout=60.0,
...)
def run_main(
  cfg,
  caller: Optional[HPCallWrapper] = None):
...


CheckpointlessMegatronStrategy(
  **self.cfg.strategy,
  ddp=self.ddp,
)

The full change can be found in the llama3 70 pretrain entry script and the corresponding training config change can be found in the llama3 70b training config.

Launch training

You can now launch the checkpointless training using kubectl.


kubectl apply -f your_job_config.yaml

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

PEFT-LoRA Llama 3 70b

Training features