-
Notifications
You must be signed in to change notification settings - Fork 633
Added support for AWS Trainium accelerator #2690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing to SkyPilot! It is exciting ; ) Left some comments 🙂
I tried
$ sky show-gpus Trainium
Resources 'Trainium' not found. Try 'sky show-gpus --all' to show available accelerators.
and got the error above (Inferentia
seems to have the same problem). Could you resolve this?
Also, seems like the workflow has failed. Do we need to update the skypilot-catalog repo too? cc @Michaelvll for a look
@@ -287,6 +287,12 @@ def get_additional_columns(row) -> pd.Series: | |||
if row['InstanceType'] == 'p4de.24xlarge': | |||
acc_name = 'A100-80GB' | |||
acc_count = 8 | |||
if row['InstanceType'] == 'trn1.2xlarge': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a reference link for this instance type? Such as https://aws.amazon.com/ec2/instance-types/trn1/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good idea
Hi @mmcclean-aws, sorry for the delay! The catalog file is generated by the command: If you would like to update the information for the trainium, it would be nice to update the file https://github.com/skypilot-org/skypilot/blob/master/sky/clouds/service_catalog/data_fetchers/fetch_aws.py accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just merged the change of fetch_aws.py
change into the master first so that our catalog can be automatically updated with trainium. Merging this PR now. Thanks for adding the support for Trainium @mmcclean-aws!
Just tried sky launch --gpus trainium
and installed the neuron-ls
by directly log into the VM. It seems working!
ubuntu@ip-172-31-35-125:~$ /opt/aws/neuron/bin/neuron-ls
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON | PCI |
| DEVICE | CORES | MEMORY | BDF |
+--------+--------+--------+---------+
| 0 | 2 | 32 GB | 00:1d.0 |
+--------+--------+--------+---------+
Thanks for the contribution @mmcclean-aws!
Added support for the AWS Trainium Accelerator
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh