Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Jul 16, 2025

Summary

Multi-GPU tests are not currently run in CI. This PR makes the following changes to support this:

  • Instead of splitting CI tests by feature (namely, float8), we split by hardware:
    • 1xH100 tests (single gpu integration, float8 tests)
    • 4xH100 tests (multi-gpu float8 distributed tests)
    • 1xL4 tests (singlegp8 integration, float8 tests)

As a next step I'll add our MoE training tests to the 4xH100 workflow

Copy link

pytorch-bot bot commented Jul 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2561

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 99e42f0 with merge base dd6a4f5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 16, 2025
@danielvegamyhre danielvegamyhre added ci float8 topic: not user facing Use this tag if you don't want this PR to show up in release notes labels Jul 16, 2025
@danielvegamyhre danielvegamyhre requested review from vkuzo and drisspg July 16, 2025 16:49
@danielvegamyhre
Copy link
Contributor Author

cc @vkuzo @drisspg for review

pytest test/float8/test_compile.py --verbose -s
pytest test/float8/test_numerics_integration.py --verbose -s
pytest test/float8/test_auto_filter.py --verbose -s
pytest test/integration --verbose -s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These are not specific to float8, but I wanted to amke sure they were run on h100 node. It feels a little strange to have in thsi file, can we just keep all the tests closer to the workflow so when people wnat to expand h100 tests they can update.

Also the job name should be changed in retrospect to h100 tests or whatever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. Yeah I guess we can split workflows based on hardware type, that would also make scheduling/allocation faster if all tests that need 1xH100 share that job, all tests that need 4xH100 share that job, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting tests by product and CI jobs by hardware type makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the CI workflows into:

  • 1xH100
  • 4xH100
  • 1xL4 (SM89)

(We could also name them SM89/SM90)

This is ready for another look.

@danielvegamyhre danielvegamyhre changed the title [BE] [CI] Single and multi device float8 test workflows in ci [BE] [CI] Single and multi GPU CI workflows Jul 18, 2025
@danielvegamyhre danielvegamyhre changed the title [BE] [CI] Single and multi GPU CI workflows [BE] [CI] Set up single and multi GPU CI workflows Jul 18, 2025
@danielvegamyhre danielvegamyhre changed the title [BE] [CI] Set up single and multi GPU CI workflows [BE] [CI] Set up 1xL4, 1xH100, 4xH100 CI workflows Jul 18, 2025
@danielvegamyhre danielvegamyhre merged commit 3460951 into main Jul 18, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. float8 moe topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants