Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions labs/Hyperpod/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04

RUN git clone https://github.com/aws-neuron/neuronx-distributed.git
COPY ./src /workspace
RUN cp -r neuronx-distributed/examples/training/llama/* workspace/
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/* workspace/
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 workspace/config_8b_llama3.1
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3 workspace/config_8b_llama3
RUN mv workspace/tp_zero1_llama_hf_pretrain.py workspace/train.py

WORKDIR /workspace

RUN pip install -r requirements.txt
131 changes: 131 additions & 0 deletions labs/Hyperpod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Build on Trainium Start Guide Using SageMaker Hyperpod

In this tutorial, we will use Neuronx-Distributed (NxD) library to train llama3 model like [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd)

If you want to use SageMaker AI Studio space to run this workshop, and it is a new account or account without VPC, SageMaker domain yet, follow [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/env-setup/01-env-sm-code-editorThe) to create the SageMaker AI domain and VC code editor space. The SageMaker Domain is created in the default VPC. Once deployed, open SageMaker AI studio, run the Code Editor default space.

### Step 1 Build the Container

Similar to [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd/01-setup), we need first build the container image to run the training job, using the latest Neuron SDK base container:

```bash
region=us-east-2
dlc_account_id=763104351884
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker pull 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04
```

clone the repo and go to the folder:
```bash
cd ~
git clone https://github.com/aws-neuron/neuron-workshops
cd neuron-workshops/labs/Hyperpod
```

We will build docker image using the Dockerfile in this directory.
```bash
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=llama3_trn
export TAG=:latest
docker build $DOCKER_NETWORK -t ${REGISTRY}${IMAGE}${TAG} .
```

Then push the image to the ECR private registry
```bash
# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
echo "Creating repository ${REGISTRY}${IMAGE} ..."
aws ecr create-repository --repository-name ${IMAGE}
else
echo "Repository ${REGISTRY}${IMAGE} already exists"
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}
```

### Step 2 Create Hyperpod Cluster
You can use [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn) to create the Hyperpod cluster with EKS.

Here are the parameters to change to use ml.trn1.32xlarge instance in us-west-2:
1. Set AvailabilityZoneId to usw2-az4 to better get on-demand instance
2. Set UsingSMCodeEditor to True if you want to access the cluster from VS code editor in SageMaker AI domain.
3. Set AcceleratedInstanceType to ml.trn1.32xlarge
4. Set kubernetes version to 1.32

Once CFN deployment finished successfully, you can manually verify the VPC, subnet, and SG are same as CFN deployment output, and you can execute the [same commands in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn#environment-variables) to setup environment variables.

You will also need to set up an FSx for Lustre File System through [Dynamic Provisioning in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre). It is noteworthy the namespace of this PVC is default, and your training job pod will need to be in the same namespace.


### Step 3 Start Training Job

Let us launches a training job to train 8B Llama 3.1 model. First, update the HF_ACCESS_TOKEN in generate-jobspec.sh file. Then execute it:
```bash
./generate-jobspec.sh
```
the script creates 2 yaml files tokenize_data.yaml and llama3_train.yaml.

Next download the dataset and tokenize it from Hugginface Hub using tokenize_data.yaml job. The job stores the dataset in Fsx Lustre for training the model next.

```bash
kubectl apply -f ./tokenize_data.yaml
```

To list all of pods in different namespaces:
```bash
kubectl get pods --all-namespaces
```

The tokenize-data pod should run in default namespace. To describe the pod:
```bash
kubectl describe pod tokenize-data
```

To check logs:
```bash
kubectl logs -f tokenize-data
```

Once the tokenize-data pod is complete, you can use the train_llama3.yaml job spec file to train llama 3.1 8B model with the tokenized data from previous step.

```bash
kubectl apply -f ./llama3_train.yaml
```

You should be able to see two pods (etcd and trn1-llama3-worker-0) are running. Similarly, to check logs:
```bash
kubectl logs -f trn1-llama3-worker-0
```

If the pod is not in running state, you can delete it:
```bash
kubectl delete -f ./llama3_train.yaml
```

Once job start running successfully, you can run command line inside the container:
```bash
kubectl exec -it trn1-llama3-worker-0 —- neuron-top
```

You may see something similar to this:
<img src="figures/neuron-top.png" width="888">

Ctrl+C to exit the visualization.

You can check the running job status on Hyperpod Task Governance as well:
<img src="figures/taskgovernance.png" width="888">

To cleanup, you can delete all of the pods:
```bash
kubectl delete -f ./llama3_train.yaml
kubectl delete -f ./tokenize_data.yaml
```
Binary file added labs/Hyperpod/figures/neuron-top.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added labs/Hyperpod/figures/taskgovernance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 43 additions & 0 deletions labs/Hyperpod/generate-jobspec.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=llama3_trn
export TAG=:latest
export IMAGE_URI=${REGISTRY}${IMAGE}${TAG}

export JOB_NAME=trn1-llama3-training
export NUM_NODES=1
export INSTANCE_TYPE=ml.trn1.32xlarge
export EFA_PER_NODE=8
export NEURON_PER_NODE=16
export FI_PROVIDER=efa


export FSX_CLAIM=fsx-claim # Change this according to the pvc created.

# Tokenize_data configs

export HF_ACCESS_TOKEN=hf_xxxxxx
export TOKENIZED_DATA_PATH=/fsx/tokenized_data
export DATASET_NAME=wikicorpus
export dATASET_CONFIG_NAME=raw_en
export HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B # change this to meta-llama/Meta-Llama-3-8B if you want to train llama3 8B model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to the NousResearch/Meta-Llama-3-8B-Instruct for the workshop so users do not need to provide the huggingface token ID and get approval from Meta?



export NEURON_CACHE_DIR=/fsx/neuron_cache
export CHECKPOINT_DIR=/fsx/checkpoints
export NUM_KEPT_CHECKPOINTS=2
export CHECKPOINT_FREQ=100
export NUM_NODES=1
export MAX_STEPS=1000
export STEPS_THIS_RUN=100
export BATCH_SIZE=1

export MODEL_PATH=config_8b_llama3


cat tokenize_data.yaml-template | envsubst > tokenize_data.yaml

cat llama3_train.yaml-template | envsubst > llama3_train.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we need to create the compilation script too. Can you double check this?

175 changes: 175 additions & 0 deletions labs/Hyperpod/llama3_train.yaml-template
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
apiVersion: v1
kind: Service
metadata:
name: etcd
spec:
ports:
- name: etcd-client-port
port: 2379
protocol: TCP
targetPort: 2379
selector:
app: etcd

---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: etcd
name: etcd
spec:
replicas: 1
selector:
matchLabels:
app: etcd
template:
metadata:
labels:
app: etcd
spec:
containers:
- name: etcd
command: ["/usr/local/bin/etcd"]
args:
- "--data-dir"
- "/var/lib/etcd"
- "--enable-v2"
- "--listen-client-urls"
- "http://0.0.0.0:2379"
- "--advertise-client-urls"
- "http://0.0.0.0:2379"
- "--initial-cluster-state"
- "new"
image: quay.io/coreos/etcd:v3.5.19
ports:
- containerPort: 2379
name: client
protocol: TCP
- containerPort: 2380
name: server
protocol: TCP
restartPolicy: Always
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: trn1-llama3
spec:
elasticPolicy:
rdzvBackend: etcd
rdzvHost: etcd
rdzvPort: 2379
minReplicas: 1
maxReplicas: 64
maxRestarts: 100
metrics:
- type: Resource
resource:
name: cpuyeah
target:
type: Utilization
averageUtilization: 90
pytorchReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
labels:
app: trn1-llama3
spec:
volumes:
- name: shmem
hostPath:
path: /dev/shm
- name: persistent-storage
persistentVolumeClaim:
claimName: ${FSX_CLAIM}
- name: local
hostPath:
path: /dev
- name: hyperpod
hostPath:
path: /var/log/aws/clusters
nodeSelector:
node.kubernetes.io/instance-type: ${INSTANCE_TYPE}
containers:
- name: pytorch
image: ${IMAGE_URI}
imagePullPolicy: Always
resources:
requests:
aws.amazon.com/neuron: ${NEURON_PER_NODE}
vpc.amazonaws.com/efa: ${EFA_PER_NODE}
limits:
aws.amazon.com/neuron: ${NEURON_PER_NODE}
vpc.amazonaws.com/efa: ${EFA_PER_NODE}
env:
- name: LOGLEVEL
value: "DEBUG"
- name: FI_PROVIDER
value: efa
- name: FI_EFA_USE_DEVICE_RDMA
value: "1"
- name: FI_EFA_FORK_SAFE
value: "1"
- name: FI_LOG_LEVEL
value: "1"
- name: FI_EFA_ENABLE_SHM_TRANSFER
value: "1"
- name: NEURON_RT_NUM_CORES
value: "32"
- name: NUM_NEURONCORES
value: "32"
- name: TPU_NUM_DEVICES
value: "32"
- name: TPU_CHIPS_PER_HOST_BOUNDS
value: "32"
- name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE
value: "/local/nccl_trace_rank_"
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
- name: MALLOC_ARENA_MAX
value: "64"
- name: NCCL_SOCKET_IFNAME
value: "^lo"
- name: NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS
value: "3"
- name: NEURON_FUSE_SOFTMAX
value: "1"
- name: NEURON_CC_FLAGS
value: "--model-type transformer --distribution-strategy=llm-training --cache_dir=${NEURON_CACHE_DIR}"
command:
- torchrun
- --nproc_per_node=32
- --nnodes=$NUM_NODES
- train.py
- --model_path=${MODEL_PATH}
- --data_dir=${TOKENIZED_DATA_PATH}/${DATASET_NAME}_llama3_tokenized_8k
- --tensor_parallel_size=32
- --batch_size=${BATCH_SIZE}
- --steps_this_run=${STEPS_THIS_RUN}
- --max_steps=${MAX_STEPS}
- --warmup_steps=100
- --lr=1.5e-4
- --grad_accum_usteps=16
- --seq_len=8192
- --sequence_parallel_enabled
- --selective_checkpoint_enabled
- --logging_interval=10
- --qkv_linear
- --kv_replicator=4
- --use_flash_attention=1
- --use_zero_1
- --use_mix_precision
- --checkpoint_freq=${CHECKPOINT_FREQ}
- --num_kept_checkpoint=${NUM_KEPT_CHECKPOINTS}
- --checkpoint_dir=${CHECKPOINT_DIR}
volumeMounts:
- name: shmem
mountPath: /dev/shm
- name: persistent-storage
mountPath: /fsx
- name: hyperpod
mountPath: /var/log/aws/clusters
Loading