generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 8
add Hyperpod lab for NxD llama3 model training #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
flamingofugang
wants to merge
2
commits into
aws-neuron:main
Choose a base branch
from
flamingofugang:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04 | ||
|
||
RUN git clone https://github.com/aws-neuron/neuronx-distributed.git | ||
COPY ./src /workspace | ||
RUN cp -r neuronx-distributed/examples/training/llama/* workspace/ | ||
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/* workspace/ | ||
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 workspace/config_8b_llama3.1 | ||
RUN cp -r neuronx-distributed/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3 workspace/config_8b_llama3 | ||
RUN mv workspace/tp_zero1_llama_hf_pretrain.py workspace/train.py | ||
|
||
WORKDIR /workspace | ||
|
||
RUN pip install -r requirements.txt |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Build on Trainium Start Guide Using SageMaker Hyperpod | ||
|
||
In this tutorial, we will use Neuronx-Distributed (NxD) library to train llama3 model like [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd) | ||
|
||
If you want to use SageMaker AI Studio space to run this workshop, and it is a new account or account without VPC, SageMaker domain yet, follow [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/env-setup/01-env-sm-code-editorThe) to create the SageMaker AI domain and VC code editor space. The SageMaker Domain is created in the default VPC. Once deployed, open SageMaker AI studio, run the Code Editor default space. | ||
|
||
### Step 1 Build the Container | ||
|
||
Similar to [this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/03-trainium-nxd/01-setup), we need first build the container image to run the training job, using the latest Neuron SDK base container: | ||
|
||
```bash | ||
region=us-east-2 | ||
dlc_account_id=763104351884 | ||
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com | ||
|
||
docker pull 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04 | ||
``` | ||
|
||
clone the repo and go to the folder: | ||
```bash | ||
cd ~ | ||
git clone https://github.com/aws-neuron/neuron-workshops | ||
cd neuron-workshops/labs/Hyperpod | ||
``` | ||
|
||
We will build docker image using the Dockerfile in this directory. | ||
```bash | ||
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') | ||
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text) | ||
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/ | ||
export IMAGE=llama3_trn | ||
export TAG=:latest | ||
docker build $DOCKER_NETWORK -t ${REGISTRY}${IMAGE}${TAG} . | ||
``` | ||
|
||
Then push the image to the ECR private registry | ||
```bash | ||
# Create registry if needed | ||
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l) | ||
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then | ||
echo "Creating repository ${REGISTRY}${IMAGE} ..." | ||
aws ecr create-repository --repository-name ${IMAGE} | ||
else | ||
echo "Repository ${REGISTRY}${IMAGE} already exists" | ||
fi | ||
|
||
# Login to registry | ||
echo "Logging in to $REGISTRY ..." | ||
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY | ||
|
||
# Push image to registry | ||
docker image push ${REGISTRY}${IMAGE}${TAG} | ||
``` | ||
|
||
### Step 2 Create Hyperpod Cluster | ||
You can use [the CloudFormation deployment here](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn) to create the Hyperpod cluster with EKS. | ||
|
||
Here are the parameters to change to use ml.trn1.32xlarge instance in us-west-2: | ||
1. Set AvailabilityZoneId to usw2-az4 to better get on-demand instance | ||
2. Set UsingSMCodeEditor to True if you want to access the cluster from VS code editor in SageMaker AI domain. | ||
3. Set AcceleratedInstanceType to ml.trn1.32xlarge | ||
4. Set kubernetes version to 1.32 | ||
|
||
Once CFN deployment finished successfully, you can manually verify the VPC, subnet, and SG are same as CFN deployment output, and you can execute the [same commands in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/00-workshop-infra-cfn#environment-variables) to setup environment variables. | ||
|
||
You will also need to set up an FSx for Lustre File System through [Dynamic Provisioning in this workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre). It is noteworthy the namespace of this PVC is default, and your training job pod will need to be in the same namespace. | ||
|
||
|
||
### Step 3 Start Training Job | ||
|
||
Let us launches a training job to train 8B Llama 3.1 model. First, update the HF_ACCESS_TOKEN in generate-jobspec.sh file. Then execute it: | ||
```bash | ||
./generate-jobspec.sh | ||
``` | ||
the script creates 2 yaml files tokenize_data.yaml and llama3_train.yaml. | ||
|
||
Next download the dataset and tokenize it from Hugginface Hub using tokenize_data.yaml job. The job stores the dataset in Fsx Lustre for training the model next. | ||
|
||
```bash | ||
kubectl apply -f ./tokenize_data.yaml | ||
``` | ||
|
||
To list all of pods in different namespaces: | ||
```bash | ||
kubectl get pods --all-namespaces | ||
``` | ||
|
||
The tokenize-data pod should run in default namespace. To describe the pod: | ||
```bash | ||
kubectl describe pod tokenize-data | ||
``` | ||
|
||
To check logs: | ||
```bash | ||
kubectl logs -f tokenize-data | ||
``` | ||
|
||
Once the tokenize-data pod is complete, you can use the train_llama3.yaml job spec file to train llama 3.1 8B model with the tokenized data from previous step. | ||
|
||
```bash | ||
kubectl apply -f ./llama3_train.yaml | ||
``` | ||
|
||
You should be able to see two pods (etcd and trn1-llama3-worker-0) are running. Similarly, to check logs: | ||
```bash | ||
kubectl logs -f trn1-llama3-worker-0 | ||
``` | ||
|
||
If the pod is not in running state, you can delete it: | ||
```bash | ||
kubectl delete -f ./llama3_train.yaml | ||
``` | ||
|
||
Once job start running successfully, you can run command line inside the container: | ||
```bash | ||
kubectl exec -it trn1-llama3-worker-0 —- neuron-top | ||
``` | ||
|
||
You may see something similar to this: | ||
<img src="figures/neuron-top.png" width="888"> | ||
|
||
Ctrl+C to exit the visualization. | ||
|
||
You can check the running job status on Hyperpod Task Governance as well: | ||
<img src="figures/taskgovernance.png" width="888"> | ||
|
||
To cleanup, you can delete all of the pods: | ||
```bash | ||
kubectl delete -f ./llama3_train.yaml | ||
kubectl delete -f ./tokenize_data.yaml | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
#!/bin/bash | ||
|
||
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') | ||
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text) | ||
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/ | ||
export IMAGE=llama3_trn | ||
export TAG=:latest | ||
export IMAGE_URI=${REGISTRY}${IMAGE}${TAG} | ||
|
||
export JOB_NAME=trn1-llama3-training | ||
export NUM_NODES=1 | ||
export INSTANCE_TYPE=ml.trn1.32xlarge | ||
export EFA_PER_NODE=8 | ||
export NEURON_PER_NODE=16 | ||
export FI_PROVIDER=efa | ||
|
||
|
||
export FSX_CLAIM=fsx-claim # Change this according to the pvc created. | ||
|
||
# Tokenize_data configs | ||
|
||
export HF_ACCESS_TOKEN=hf_xxxxxx | ||
export TOKENIZED_DATA_PATH=/fsx/tokenized_data | ||
export DATASET_NAME=wikicorpus | ||
export dATASET_CONFIG_NAME=raw_en | ||
export HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B # change this to meta-llama/Meta-Llama-3-8B if you want to train llama3 8B model | ||
|
||
|
||
export NEURON_CACHE_DIR=/fsx/neuron_cache | ||
export CHECKPOINT_DIR=/fsx/checkpoints | ||
export NUM_KEPT_CHECKPOINTS=2 | ||
export CHECKPOINT_FREQ=100 | ||
export NUM_NODES=1 | ||
export MAX_STEPS=1000 | ||
export STEPS_THIS_RUN=100 | ||
export BATCH_SIZE=1 | ||
|
||
export MODEL_PATH=config_8b_llama3 | ||
|
||
|
||
cat tokenize_data.yaml-template | envsubst > tokenize_data.yaml | ||
|
||
cat llama3_train.yaml-template | envsubst > llama3_train.yaml | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I remember that we need to create the compilation script too. Can you double check this? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: etcd | ||
spec: | ||
ports: | ||
- name: etcd-client-port | ||
port: 2379 | ||
protocol: TCP | ||
targetPort: 2379 | ||
selector: | ||
app: etcd | ||
|
||
--- | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
labels: | ||
app: etcd | ||
name: etcd | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: etcd | ||
template: | ||
metadata: | ||
labels: | ||
app: etcd | ||
spec: | ||
containers: | ||
- name: etcd | ||
command: ["/usr/local/bin/etcd"] | ||
args: | ||
- "--data-dir" | ||
- "/var/lib/etcd" | ||
- "--enable-v2" | ||
- "--listen-client-urls" | ||
- "http://0.0.0.0:2379" | ||
- "--advertise-client-urls" | ||
- "http://0.0.0.0:2379" | ||
- "--initial-cluster-state" | ||
- "new" | ||
image: quay.io/coreos/etcd:v3.5.19 | ||
ports: | ||
- containerPort: 2379 | ||
name: client | ||
protocol: TCP | ||
- containerPort: 2380 | ||
name: server | ||
protocol: TCP | ||
restartPolicy: Always | ||
--- | ||
apiVersion: "kubeflow.org/v1" | ||
kind: PyTorchJob | ||
metadata: | ||
name: trn1-llama3 | ||
spec: | ||
elasticPolicy: | ||
rdzvBackend: etcd | ||
rdzvHost: etcd | ||
rdzvPort: 2379 | ||
minReplicas: 1 | ||
maxReplicas: 64 | ||
maxRestarts: 100 | ||
metrics: | ||
- type: Resource | ||
resource: | ||
name: cpuyeah | ||
target: | ||
type: Utilization | ||
averageUtilization: 90 | ||
pytorchReplicaSpecs: | ||
Worker: | ||
replicas: 1 | ||
restartPolicy: OnFailure | ||
template: | ||
metadata: | ||
labels: | ||
app: trn1-llama3 | ||
spec: | ||
volumes: | ||
- name: shmem | ||
hostPath: | ||
path: /dev/shm | ||
- name: persistent-storage | ||
persistentVolumeClaim: | ||
claimName: ${FSX_CLAIM} | ||
- name: local | ||
hostPath: | ||
path: /dev | ||
- name: hyperpod | ||
hostPath: | ||
path: /var/log/aws/clusters | ||
nodeSelector: | ||
node.kubernetes.io/instance-type: ${INSTANCE_TYPE} | ||
containers: | ||
- name: pytorch | ||
image: ${IMAGE_URI} | ||
imagePullPolicy: Always | ||
resources: | ||
requests: | ||
aws.amazon.com/neuron: ${NEURON_PER_NODE} | ||
vpc.amazonaws.com/efa: ${EFA_PER_NODE} | ||
limits: | ||
aws.amazon.com/neuron: ${NEURON_PER_NODE} | ||
vpc.amazonaws.com/efa: ${EFA_PER_NODE} | ||
env: | ||
- name: LOGLEVEL | ||
value: "DEBUG" | ||
- name: FI_PROVIDER | ||
value: efa | ||
- name: FI_EFA_USE_DEVICE_RDMA | ||
value: "1" | ||
- name: FI_EFA_FORK_SAFE | ||
value: "1" | ||
- name: FI_LOG_LEVEL | ||
value: "1" | ||
- name: FI_EFA_ENABLE_SHM_TRANSFER | ||
value: "1" | ||
- name: NEURON_RT_NUM_CORES | ||
value: "32" | ||
- name: NUM_NEURONCORES | ||
value: "32" | ||
- name: TPU_NUM_DEVICES | ||
value: "32" | ||
- name: TPU_CHIPS_PER_HOST_BOUNDS | ||
value: "32" | ||
- name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE | ||
value: "/local/nccl_trace_rank_" | ||
- name: PYTORCH_CUDA_ALLOC_CONF | ||
value: "expandable_segments:True" | ||
- name: MALLOC_ARENA_MAX | ||
value: "64" | ||
- name: NCCL_SOCKET_IFNAME | ||
value: "^lo" | ||
- name: NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS | ||
value: "3" | ||
- name: NEURON_FUSE_SOFTMAX | ||
value: "1" | ||
- name: NEURON_CC_FLAGS | ||
value: "--model-type transformer --distribution-strategy=llm-training --cache_dir=${NEURON_CACHE_DIR}" | ||
command: | ||
- torchrun | ||
- --nproc_per_node=32 | ||
- --nnodes=$NUM_NODES | ||
- train.py | ||
- --model_path=${MODEL_PATH} | ||
- --data_dir=${TOKENIZED_DATA_PATH}/${DATASET_NAME}_llama3_tokenized_8k | ||
- --tensor_parallel_size=32 | ||
- --batch_size=${BATCH_SIZE} | ||
- --steps_this_run=${STEPS_THIS_RUN} | ||
- --max_steps=${MAX_STEPS} | ||
- --warmup_steps=100 | ||
- --lr=1.5e-4 | ||
- --grad_accum_usteps=16 | ||
- --seq_len=8192 | ||
- --sequence_parallel_enabled | ||
- --selective_checkpoint_enabled | ||
- --logging_interval=10 | ||
- --qkv_linear | ||
- --kv_replicator=4 | ||
- --use_flash_attention=1 | ||
- --use_zero_1 | ||
- --use_mix_precision | ||
- --checkpoint_freq=${CHECKPOINT_FREQ} | ||
- --num_kept_checkpoint=${NUM_KEPT_CHECKPOINTS} | ||
- --checkpoint_dir=${CHECKPOINT_DIR} | ||
volumeMounts: | ||
- name: shmem | ||
mountPath: /dev/shm | ||
- name: persistent-storage | ||
mountPath: /fsx | ||
- name: hyperpod | ||
mountPath: /var/log/aws/clusters |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to the NousResearch/Meta-Llama-3-8B-Instruct for the workshop so users do not need to provide the huggingface token ID and get approval from Meta?