Skip to content

Added support for AWS Trainium accelerator #2690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

mmcclean-aws
Copy link
Contributor

Added support for the AWS Trainium Accelerator

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll requested a review from cblmemo October 18, 2023 07:36
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to SkyPilot! It is exciting ; ) Left some comments 🙂
I tried

$ sky show-gpus Trainium
Resources 'Trainium' not found. Try 'sky show-gpus --all' to show available accelerators.

and got the error above (Inferentia seems to have the same problem). Could you resolve this?
Also, seems like the workflow has failed. Do we need to update the skypilot-catalog repo too? cc @Michaelvll for a look

@@ -287,6 +287,12 @@ def get_additional_columns(row) -> pd.Series:
if row['InstanceType'] == 'p4de.24xlarge':
acc_name = 'A100-80GB'
acc_count = 8
if row['InstanceType'] == 'trn1.2xlarge':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a reference link for this instance type? Such as https://aws.amazon.com/ec2/instance-types/trn1/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good idea

@mmcclean-aws
Copy link
Contributor Author

@cblmemo I see the catalog csv file here is not up to date with the Trainium accelerator type. How can this be regenerated ?

@Michaelvll
Copy link
Collaborator

@cblmemo I see the catalog csv file here is not up to date with the Trainium accelerator type. How can this be regenerated ?

Hi @mmcclean-aws, sorry for the delay! The catalog file is generated by the command: python -m sky.clouds.service_catalog.data_fetchers.fetch_aws

If you would like to update the information for the trainium, it would be nice to update the file https://github.com/skypilot-org/skypilot/blob/master/sky/clouds/service_catalog/data_fetchers/fetch_aws.py accordingly.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just merged the change of fetch_aws.py change into the master first so that our catalog can be automatically updated with trainium. Merging this PR now. Thanks for adding the support for Trainium @mmcclean-aws!

Just tried sky launch --gpus trainium and installed the neuron-ls by directly log into the VM. It seems working!

ubuntu@ip-172-31-35-125:~$ /opt/aws/neuron/bin/neuron-ls
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1d.0 |
+--------+--------+--------+---------+

Thanks for the contribution @mmcclean-aws!

@Michaelvll Michaelvll linked an issue Nov 17, 2023 that may be closed by this pull request
@Michaelvll Michaelvll merged commit 623c10d into skypilot-org:master Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for AWS Trainium and Inferentia
3 participants