-
Notifications
You must be signed in to change notification settings - Fork 342
[BE] [CI] Set up 1xL4, 1xH100, 4xH100 CI workflows #2561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2561
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 99e42f0 with merge base dd6a4f5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
pytest test/float8/test_compile.py --verbose -s | ||
pytest test/float8/test_numerics_integration.py --verbose -s | ||
pytest test/float8/test_auto_filter.py --verbose -s | ||
pytest test/integration --verbose -s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These are not specific to float8, but I wanted to amke sure they were run on h100 node. It feels a little strange to have in thsi file, can we just keep all the tests closer to the workflow so when people wnat to expand h100 tests they can update.
Also the job name should be changed in retrospect to h100 tests or whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. Yeah I guess we can split workflows based on hardware type, that would also make scheduling/allocation faster if all tests that need 1xH100 share that job, all tests that need 4xH100 share that job, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitting tests by product and CI jobs by hardware type makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split the CI workflows into:
- 1xH100
- 4xH100
- 1xL4 (SM89)
(We could also name them SM89/SM90)
This is ready for another look.
Summary
Multi-GPU tests are not currently run in CI. This PR makes the following changes to support this:
As a next step I'll add our MoE training tests to the 4xH100 workflow