Skip to content

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented May 14, 2025

323.20 MB -> 324.36 MB

To help with the growing wheel size due to Blackwell allow for shipping PTX for heavy kernels that don't take advantage of new hardware features. Theres enough different gencodes now for certain kernels it makes sense to ship a single PTX implementation instead of multiple SASS. Currently mildly grows the wheel size but should help keep it capped as the Blackwell gencodes are added

Signed-off-by: Lucas Wilkinson <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label May 14, 2025
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson LucasWilkinson changed the title [WIP][Build] Allow shipping PTX on a per-file basis [Build] Allow shipping PTX on a per-file basis May 14, 2025
endif()

list(SORT SRC_CUDA_ARCHS COMPARE NATURAL ORDER ASCENDING)
list(SORT _SRC_CUDA_ARCHS COMPARE NATURAL ORDER ASCENDING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the _TGT_CUDA_ARCHS need to be sorted too?

Copy link
Collaborator Author

@LucasWilkinson LucasWilkinson May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a more general utility you're right they should be!; but the target arches come from extract_unique_cuda_archs_ascending so they are already sorted. I can open up a now PR to refactor some of this though; kinda want to preserve current behavior as much as possible in this one

Copy link
Contributor

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@simon-mo simon-mo enabled auto-merge (squash) May 14, 2025 22:44
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 14, 2025
@simon-mo simon-mo disabled auto-merge May 15, 2025 00:05
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relatively safe to me. There might be a regression for marlin because of slow bf16 convert on A100 (IIRC) that might transfer to newer hardware, but also might not. Ultimately shouldn't be that big of a deal. @jinzhen-lin please step in if you have concerns with this since you refactored marlin most recently.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failures look closely related

@jinzhen-lin
Copy link
Contributor

Seems relatively safe to me. There might be a regression for marlin because of slow bf16 convert on A100 (IIRC) that might transfer to newer hardware, but also might not. Ultimately shouldn't be that big of a deal. @jinzhen-lin please step in if you have concerns with this since you refactored marlin most recently.

Currently, I believe that the performance of use_atomic_add=True + bf16 may be impacted on sm90+. I will run some tests to verify.

Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson
Copy link
Collaborator Author

Failures resolved

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice middle ground, thanks for getting it working!

@simon-mo simon-mo merged commit c7852a6 into vllm-project:main May 15, 2025
87 of 90 checks passed
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants