Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b - Amazon SageMaker AI

Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

Prerequisites

Before you start setting up your environment, make sure you have:

Kubernetes environment setup

To set up your Kubernetes environment, do the following:

  1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

    python3 -m venv ${PWD}/venv source venv/bin/activate
  2. Set up kubectl and eksctl

  3. Install Helm

  4. Connect to your Kubernetes cluster

    aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
  5. Install dependencies using one of the following methods:

    1. Method 1: SageMaker HyperPod recipes method:

      # install SageMaker HyperPod Recipes. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt
    2. Method 2: kubectl with pre-defined job yaml method

      # install SageMaker HyperPod checkpointless training. git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git cd sagemaker-hyperpod-checkpointless-training

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

Method 1: Launch the training job with the recipes launcher

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

  1. Update launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh

    A Deep Learning container. To find the most recent release of the checkpointless training container, see checkpointless training release notes.

    #!/bin/bash SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"} TRAIN_DIR="${TRAIN_DIR}" VAL_DIR="${VAL_DIR}" EXP_DIR="${EXP_DIR}" LOG_DIR="${LOG_DIR}" CONTAINER_MOUNT="/data" CONTAINER="${CONTAINER}" MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}" HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \ recipes=fine-tuning/llama/checkpointless_llama3_70b_lora \ recipes.dataset.dataset_path="${TRAIN_DIR}" \ recipes.exp_manager.exp_dir="${EXP_DIR}" \ recipes.log_dir="${LOG_DIR}" \ recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \ base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \ git.use_default=false \ cluster=k8s \ cluster_type=k8s \ container="${CONTAINER}" \ +cluster.hostNetwork=true \ +cluster.persistent_volume_claims.0.claimName=fsx-claim \ +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \ +recipes.dataset.val_dataset_path="${VAL_DIR}" \ ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
  2. Launch the training job

    bash launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh
  3. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

    kubectl get pods NAME READY STATUS RESTARTS AGE llama-3-70b-worker-0 0/1 running 0 36s
  4. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

    kubectl describe pod <name of pod>
  5. After the job STATUS changes to Running, you can examine the log by using the following command.

    kubectl logs <name of pod>

    The STATUS will turn to Completed when you run kubectl get pods

Method 2: Launch the training job with kubectl with pre-defined yaml

Another option is to launch the training through kubectl with a pre-defined job yaml.

  1. Update the examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml

    • image: A Deep Learning container. To find the most recent release of the checkpointless training container, see checkpointless training release notes.

    • resume.restore_config.path=<path_to_pretrained_weights>: The path to downloaded pretrained model weights in Nemo format in Prerequisites step.

    • dataset.dataset_path=<path_to_dataset>: The path to the dataset that stored in the shared storage

  2. Submit the job using kubectl with peft_llama3_70b_checkpointless_p5.yaml

    kubectl apply -f examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml
  3. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

    kubectl get pods NAME READY STATUS RESTARTS AGE llama3-70b-lora-checkpointless-worker-0 0/1 running 0 36s
  4. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

    kubectl describe pod <name of pod>
  5. After the job STATUS changes to Running, you can examine the log by using the following command.

    kubectl logs <name of pod>

    The STATUS will turn to Completed when you run kubectl get pods