aws-neuron · flamingofugang · Aug 22, 2025 · Aug 22, 2025 · jianyinglangaws · Aug 28, 2025
diff --git a/labs/Hyperpod/Dockerfile b/labs/Hyperpod/Dockerfile
@@ -0,0 +1,13 @@
+FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04
+
+RUN git clone https://github.com/aws-neuron/neuronx-distributed.git
+COPY ./src /workspace
+RUN cp -r neuronx-distributed/examples/training/llama/* workspace/
+RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/* workspace/
+RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 workspace/config_8b_llama3.1
+RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3 workspace/config_8b_llama3
+RUN mv workspace/tp_zero1_llama_hf_pretrain.py workspace/train.py
+
+WORKDIR /workspace
+
+RUN pip install -r requirements.txt
diff --git a/labs/Hyperpod/README.md b/labs/Hyperpod/README.md
@@ -0,0 +1,131 @@
+# Build on Trainium Start Guide Using SageMaker Hyperpod
+
+In this tutorial, we will use Neuronx-Distributed (NxD) library to train llama3 model like [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd)
+
+If you want to use SageMaker AI Studio space to run this workshop, and it is a new account or account without VPC, SageMaker domain yet, follow [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/env-setup/01-env-sm-code-editorThe) to create the SageMaker AI domain and VC code editor space. The SageMaker Domain is created in the default VPC. Once deployed, open SageMaker AI studio, run the Code Editor default space.
+
+### Step 1 Build the Container
+
+Similar to [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd/01-setup), we need first build the container image to run the training job, using the latest Neuron SDK base container:
+
+```bash
+region=us-east-2
+dlc_account_id=763104351884
+aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
+
+docker pull 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04
+```
+
+clone the repo and go to the folder:
+```bash
+cd ~
+git clone https://github.com/aws-neuron/neuron-workshops
+cd neuron-workshops/labs/Hyperpod
+```
+
+We will build docker image using the Dockerfile in this directory.
+```bash
+export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
+export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
+export IMAGE=llama3_trn
+export TAG=:latest
+docker build $DOCKER_NETWORK -t ${REGISTRY}${IMAGE}${TAG} .
+```
+
+Then push the image to the ECR private registry
+```bash
+# Create registry if needed
+export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
+if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
+    echo "Creating repository ${REGISTRY}${IMAGE} ..."
+    aws ecr create-repository --repository-name ${IMAGE}
+else
+    echo "Repository ${REGISTRY}${IMAGE} already exists"
+fi
+
+# Login to registry
+echo "Logging in to $REGISTRY ..."
+aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
+
+# Push image to registry
+docker image push ${REGISTRY}${IMAGE}${TAG}
+```
+
+### Step 2 Create Hyperpod Cluster  
+You can use [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn) to create the Hyperpod cluster with EKS.
+
+Here are the parameters to change to use ml.trn1.32xlarge instance in us-west-2:
+1. Set AvailabilityZoneId to usw2-az4 to better get on-demand instance
+2. Set UsingSMCodeEditor to True if you want to access the cluster from VS code editor in SageMaker AI domain.
+3. Set AcceleratedInstanceType to ml.trn1.32xlarge
+4. Set kubernetes version to 1.32
+
+Once CFN deployment finished successfully, you can manually verify the VPC, subnet, and SG are same as CFN deployment output, and you can execute the [same commands in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn#environment-variables) to setup environment variables.
+
+You will also need to set up an FSx for Lustre File System through [Dynamic Provisioning in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre). It is noteworthy the namespace of this PVC is default, and your training job pod will need to be in the same namespace.
+
+
+### Step 3 Start Training Job
+
+Let us launches a training job to train 8B Llama 3.1 model. First, update the HF_ACCESS_TOKEN in generate-jobspec.sh file. Then execute it:
+```bash
+./generate-jobspec.sh
+```
+the script creates 2 yaml files tokenize_data.yaml and llama3_train.yaml. 
+
+Next download the dataset and tokenize it from Hugginface Hub using tokenize_data.yaml job. The job stores the dataset in Fsx Lustre for training the model next.
+
+```bash
+kubectl apply -f ./tokenize_data.yaml
+```
+
+To list all of pods in different namespaces:
+```bash
+kubectl get pods --all-namespaces
+```
+
+The tokenize-data pod should run in default namespace. To describe the pod:
+```bash
+kubectl describe pod tokenize-data 
+```
+
+To check logs:
+```bash
+kubectl logs -f tokenize-data 
+```
+
+Once the tokenize-data pod is complete, you can use the train_llama3.yaml job spec file to train llama 3.1 8B model with the tokenized data from previous step.
+
+```bash
+kubectl apply -f ./llama3_train.yaml
+```
+
+You should be able to see two pods (etcd and trn1-llama3-worker-0) are running. Similarly, to check logs:
+```bash
+kubectl logs -f trn1-llama3-worker-0
+```
+
+If the pod is not in running state, you can delete it:
+```bash
+kubectl delete -f ./llama3_train.yaml
+```
+
+Once job start running successfully, you can run command line inside the container:
+```bash
+kubectl exec -it trn1-llama3-worker-0 —- neuron-top
+```
+
+You may see something similar to this:
+<img src="figures/neuron-top.png" width="888">
+
+Ctrl+C to exit the visualization.
+
+You can check the running job status on Hyperpod Task Governance as well:
+<img src="figures/taskgovernance.png" width="888">
+
+To cleanup, you can delete all of the pods:
+```bash
+kubectl delete -f ./llama3_train.yaml
+kubectl delete -f ./tokenize_data.yaml
+```
diff --git a/labs/Hyperpod/figures/neuron-top.png b/labs/Hyperpod/figures/neuron-top.png
diff --git a/labs/Hyperpod/figures/taskgovernance.png b/labs/Hyperpod/figures/taskgovernance.png
diff --git a/labs/Hyperpod/generate-jobspec.sh b/labs/Hyperpod/generate-jobspec.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+
+export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
+export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
+export IMAGE=llama3_trn
+export TAG=:latest
+export IMAGE_URI=${REGISTRY}${IMAGE}${TAG}
+
+export JOB_NAME=trn1-llama3-training
+export NUM_NODES=1
+export INSTANCE_TYPE=ml.trn1.32xlarge
+export EFA_PER_NODE=8
+export NEURON_PER_NODE=16
+export FI_PROVIDER=efa
+
+
+export FSX_CLAIM=fsx-claim # Change this according to the pvc created.
+
+# Tokenize_data configs
+
+export HF_ACCESS_TOKEN=hf_xxxxxx
+export TOKENIZED_DATA_PATH=/fsx/tokenized_data
+export DATASET_NAME=wikicorpus
+export dATASET_CONFIG_NAME=raw_en
+export HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B # change this to meta-llama/Meta-Llama-3-8B if you want to train llama3 8B model
+
+
+export NEURON_CACHE_DIR=/fsx/neuron_cache
+export CHECKPOINT_DIR=/fsx/checkpoints
+export NUM_KEPT_CHECKPOINTS=2
+export CHECKPOINT_FREQ=100
+export NUM_NODES=1
+export MAX_STEPS=1000
+export STEPS_THIS_RUN=100
+export BATCH_SIZE=1
+
+export MODEL_PATH=config_8b_llama3
+
+
+cat tokenize_data.yaml-template | envsubst > tokenize_data.yaml
+
+cat llama3_train.yaml-template | envsubst > llama3_train.yaml
diff --git a/labs/Hyperpod/llama3_train.yaml-template b/labs/Hyperpod/llama3_train.yaml-template
@@ -0,0 +1,175 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: etcd
+spec:
+  ports:
+    - name: etcd-client-port
+      port: 2379
+      protocol: TCP
+      targetPort: 2379
+  selector:
+    app: etcd
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: etcd
+  name: etcd
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: etcd
+  template:
+    metadata:
+      labels:
+        app: etcd
+    spec:
+      containers:
+        - name: etcd
+          command: ["/usr/local/bin/etcd"]
+          args:
+            - "--data-dir"
+            - "/var/lib/etcd"
+            - "--enable-v2"
+            - "--listen-client-urls"
+            - "http://0.0.0.0:2379"
+            - "--advertise-client-urls"
+            - "http://0.0.0.0:2379"
+            - "--initial-cluster-state"
+            - "new"
+          image: quay.io/coreos/etcd:v3.5.19
+          ports:
+            - containerPort: 2379
+              name: client
+              protocol: TCP
+            - containerPort: 2380
+              name: server
+              protocol: TCP
+      restartPolicy: Always
+---
+apiVersion: "kubeflow.org/v1"
+kind: PyTorchJob
+metadata:
+  name: trn1-llama3
+spec:
+  elasticPolicy:
+    rdzvBackend: etcd
+    rdzvHost: etcd
+    rdzvPort: 2379
+    minReplicas: 1
+    maxReplicas: 64
+    maxRestarts: 100
+    metrics:
+      - type: Resource
+        resource:
+          name: cpuyeah
+          target:
+            type: Utilization
+            averageUtilization: 90
+  pytorchReplicaSpecs:
+    Worker:
+      replicas: 1
+      restartPolicy: OnFailure
+      template:
+        metadata:
+          labels:
+            app: trn1-llama3
+        spec:
+          volumes:
+            - name: shmem
+              hostPath: 
+                path: /dev/shm
+            - name: persistent-storage
+              persistentVolumeClaim:
+                claimName: ${FSX_CLAIM}
+            - name: local
+              hostPath:
+                path: /dev
+            - name: hyperpod
+              hostPath:
+                path: /var/log/aws/clusters
+          nodeSelector:
+           node.kubernetes.io/instance-type: ${INSTANCE_TYPE}
+          containers:
+            - name: pytorch
+              image: ${IMAGE_URI}
+              imagePullPolicy: Always
+              resources:
+                requests:
+                  aws.amazon.com/neuron: ${NEURON_PER_NODE}
+                  vpc.amazonaws.com/efa: ${EFA_PER_NODE}
+                limits:
+                  aws.amazon.com/neuron: ${NEURON_PER_NODE}
+                  vpc.amazonaws.com/efa: ${EFA_PER_NODE}
+              env:
+              - name: LOGLEVEL
+                value: "DEBUG"
+              - name: FI_PROVIDER
+                value: efa
+              - name: FI_EFA_USE_DEVICE_RDMA
+                value: "1"
+              - name: FI_EFA_FORK_SAFE
+                value: "1"
+              - name: FI_LOG_LEVEL
+                value: "1"
+              - name: FI_EFA_ENABLE_SHM_TRANSFER
+                value: "1"
+              - name: NEURON_RT_NUM_CORES
+                value: "32"
+              - name: NUM_NEURONCORES
+                value: "32"
+              - name: TPU_NUM_DEVICES
+                value: "32"
+              - name: TPU_CHIPS_PER_HOST_BOUNDS
+                value: "32"
+              - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE
+                value: "/local/nccl_trace_rank_"
+              - name: PYTORCH_CUDA_ALLOC_CONF
+                value: "expandable_segments:True"
+              - name: MALLOC_ARENA_MAX
+                value: "64"
+              - name: NCCL_SOCKET_IFNAME
+                value: "^lo"
+              - name: NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS
+                value: "3"
+              - name: NEURON_FUSE_SOFTMAX
+                value: "1"
+              - name: NEURON_CC_FLAGS
+                value: "--model-type transformer --distribution-strategy=llm-training --cache_dir=${NEURON_CACHE_DIR}"
+              command: 
+                - torchrun
+                - --nproc_per_node=32
+                - --nnodes=$NUM_NODES
+                - train.py
+                - --model_path=${MODEL_PATH}
+                - --data_dir=${TOKENIZED_DATA_PATH}/${DATASET_NAME}_llama3_tokenized_8k
+                - --tensor_parallel_size=32
+                - --batch_size=${BATCH_SIZE}
+                - --steps_this_run=${STEPS_THIS_RUN}
+                - --max_steps=${MAX_STEPS}
+                - --warmup_steps=100
+                - --lr=1.5e-4
+                - --grad_accum_usteps=16
+                - --seq_len=8192
+                - --sequence_parallel_enabled
+                - --selective_checkpoint_enabled
+                - --logging_interval=10
+                - --qkv_linear
+                - --kv_replicator=4
+                - --use_flash_attention=1
+                - --use_zero_1
+                - --use_mix_precision
+                - --checkpoint_freq=${CHECKPOINT_FREQ}
+                - --num_kept_checkpoint=${NUM_KEPT_CHECKPOINTS}
+                - --checkpoint_dir=${CHECKPOINT_DIR}
+              volumeMounts:
+                - name: shmem
+                  mountPath: /dev/shm
+                - name: persistent-storage
+                  mountPath: /fsx
+                - name: hyperpod
+                  mountPath: /var/log/aws/clusters