Skip to content

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jul 17, 2025

Summary

Speed up cudagraph capture loops by calling gc.freeze before capture. This speeds up cudagraph capture a huge amount, especially for small models. Qwen3-0.6B goes from 35s to 2s.
For the "proper" approach we should possible use pytorch/pytorch#158193 in a future torch release.

Testing

Before

vllm serve Qwen/Qwen3-0.6B
...
Capturing CUDA graph shapes: 100%|███████| 67/67 [00:34<00:00,  1.92it/s]
INFO 07-17 22:13:03 [gpu_model_runner.py:2283] Graph capturing finished in 35 secs, took 0.59 GiB

After

vllm serve Qwen/Qwen3-0.6B
...
Capturing CUDA graph shapes: 100%|███████| 67/67 [00:02<00:00, 28.07it/s]
INFO 07-17 22:11:40 [gpu_model_runner.py:2294] Graph capturing finished in 2 secs, took 0.59 GiB

https://chatgpt.com/codex/tasks/task_e_687972e21944832987a7bb6219d4c65b

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jul 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to speed up CUDA graph capture by disabling garbage collection within the capture loop. This is achieved by introducing a new context manager. My review focuses on improving the implementation of this context manager to use more idiomatic and direct APIs for controlling garbage collection, which enhances code clarity and maintainability.

Signed-off-by: mgoin <[email protected]>
@mgoin mgoin marked this pull request as ready for review July 17, 2025 22:20
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that maybe we need to do gc.collect only once for the entire graphs (rather than once per graph), I think a more proper fix is to modify

stack.enter_context(patch("gc.collect", lambda: None))

cc @youkaichao @zou3519

@zou3519
Copy link
Collaborator

zou3519 commented Jul 18, 2025

On the PyTorch side, the motivation of pytorch/pytorch#158193 is that pytorch shouldn't force people to gc.collect when doing the cuda graph recording. Users should get that flexibility to choose.

For vLLM, doing GC at least for at the beginningn of all piecewise captures seems good. The idea is that we want to free up memory so that we have enough memory to do the CUDAGraph captures.

Doing GC once per shape is a reasonable thing to do, though -- the capture of each shape is a forward pass through the model, and it may be possible for there to be a reference cycle somewhere that holds onto memory. But maybe we should consider these situations to be bugs and fix them, the startup time savings are significant.

@yinghai
Copy link
Contributor

yinghai commented Jul 19, 2025

But

stack.enter_context(patch("gc.collect", lambda: None))
is piecewise cudagraph specific? The solution here doesn't seem to be worse. And if pytorch fixes this it'd be ideal.

@mgoin
Copy link
Member Author

mgoin commented Jul 19, 2025

Maybe we could patch the function to only call gc.collect every N invocations, where N could be like 10 or higher? Or a timer like the PyTorch PR. I think the piecewise method certainly isn't working and we should generalize it to full graphs too

@mgoin
Copy link
Member Author

mgoin commented Jul 19, 2025

I tested a version that only calls gc.collect every N calls. Here are my benchmarks trying a variety of Ns (on B200, which is why they are faster than the original description) for vllm serve Qwen/Qwen3-0.6B

# Main
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:10<00:00,  6.23it/s]

# N=1
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:10<00:00,  6.35it/s]

# N=2
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:05<00:00, 11.42it/s]

# N=5
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:03<00:00, 19.51it/s]

# N=10
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:02<00:00, 24.23it/s]

# N=20
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 37.71it/s]

# N=50
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 43.28it/s]

# N=100
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 48.32it/s]

# N=200
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 48.60it/s]

Function:

        @contextmanager
        def suppress_gc_collect(call_interval: int):
            """
            Reduce `gc.collect` frequency to speed up CUDA graph capture.
            Only calls the original gc.collect every N invocations.
            """
            original_gc_collect = gc.collect
            call_count = 0
            
            def throttled_gc_collect():
                nonlocal call_count
                call_count += 1
                if call_count % call_interval == 0:
                    return original_gc_collect()
                return None
                
            with patch("gc.collect", throttled_gc_collect):
                yield

Copy link

mergify bot commented Jul 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 19, 2025
@mergify mergify bot removed the needs-rebase label Jul 21, 2025
@mgoin
Copy link
Member Author

mgoin commented Jul 21, 2025

@WoosukKwon @zou3519 @njhill I implemented the collect-then-freeze approach, which seems to provide the same benefit, PTAL

main:      Capturing CUDA graph shapes: 100%|█████████| 67/67 [00:10<00:00,  6.66it/s]
gc.freeze: Capturing CUDA graph shapes: 100%|█████████| 67/67 [00:01<00:00, 46.41it/s]

@mgoin mgoin changed the title [Core] disable gc during cuda graph capture [Core] Freeze gc during cuda graph capture to speed up init Jul 21, 2025
@njhill
Copy link
Member

njhill commented Jul 21, 2025

@mgoin nice! Looks like it may still be slightly slower than disabling explicit gc.collect()s? I actually don't see a downside in doing both.

We had actually already intended to do a final gc.collect() + gc.freeze() upon startup completion to minimize any ongoing GC overhead (we already do this in the front-end proc where there's a lot more object churn). I can open a separate PR for that.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025
@mgoin mgoin enabled auto-merge (squash) July 24, 2025 00:15
@simon-mo simon-mo disabled auto-merge July 24, 2025 00:20
@simon-mo simon-mo merged commit f3137cd into main Jul 24, 2025
67 of 69 checks passed
@simon-mo simon-mo deleted the codex/monkeypatch-gc.collect-during-cudagraph-capture branch July 24, 2025 00:20
Comment on lines +2367 to +2380
@contextmanager
def freeze_gc():
# Optimize garbage collection during CUDA graph capture.
# Clean up, then freeze all remaining objects from being included
# in future collections.
gc.collect()
should_freeze = not envs.VLLM_ENABLE_CUDAGRAPH_GC
if should_freeze:
gc.freeze()
try:
yield
finally:
if should_freeze:
gc.unfreeze()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we should have this in utils.py or something like that. The model runner is becoming bloated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll try to do a separate PR that will consolidate with the current implementation in the piecewise backend

DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 24, 2025
hustxiayang pushed a commit to hustxiayang/vllm that referenced this pull request Jul 30, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codex ready ONLY add when PR is ready to merge/full CI is needed startup-ux v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants