[Core] Freeze gc during cuda graph capture to speed up init #21146

mgoin · 2025-07-17T22:07:23Z

Summary

Speed up cudagraph capture loops by calling gc.freeze before capture. This speeds up cudagraph capture a huge amount, especially for small models. Qwen3-0.6B goes from 35s to 2s.
For the "proper" approach we should possible use pytorch/pytorch#158193 in a future torch release.

Testing

Before

vllm serve Qwen/Qwen3-0.6B
...
Capturing CUDA graph shapes: 100%|███████| 67/67 [00:34<00:00,  1.92it/s]
INFO 07-17 22:13:03 [gpu_model_runner.py:2283] Graph capturing finished in 35 secs, took 0.59 GiB

After

vllm serve Qwen/Qwen3-0.6B
...
Capturing CUDA graph shapes: 100%|███████| 67/67 [00:02<00:00, 28.07it/s]
INFO 07-17 22:11:40 [gpu_model_runner.py:2294] Graph capturing finished in 2 secs, took 0.59 GiB

https://chatgpt.com/codex/tasks/task_e_687972e21944832987a7bb6219d4c65b

Signed-off-by: Codex <[email protected]>

github-actions · 2025-07-17T22:07:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request aims to speed up CUDA graph capture by disabling garbage collection within the capture loop. This is achieved by introducing a new context manager. My review focuses on improving the implementation of this context manager to use more idiomatic and direct APIs for controlling garbage collection, which enhances code clarity and maintainability.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: mgoin <[email protected]>

WoosukKwon

While I agree that maybe we need to do gc.collect only once for the entire graphs (rather than once per graph), I think a more proper fix is to modify

vllm/vllm/compilation/cuda_piecewise_backend.py

Line 179 in ac9fb73

stack.enter_context(patch("gc.collect", lambda: None))

cc @youkaichao @zou3519

zou3519 · 2025-07-18T13:19:13Z

On the PyTorch side, the motivation of pytorch/pytorch#158193 is that pytorch shouldn't force people to gc.collect when doing the cuda graph recording. Users should get that flexibility to choose.

For vLLM, doing GC at least for at the beginningn of all piecewise captures seems good. The idea is that we want to free up memory so that we have enough memory to do the CUDAGraph captures.

Doing GC once per shape is a reasonable thing to do, though -- the capture of each shape is a forward pass through the model, and it may be possible for there to be a reference cycle somewhere that holds onto memory. But maybe we should consider these situations to be bugs and fix them, the startup time savings are significant.

yinghai · 2025-07-19T05:56:41Z

But

vllm/vllm/compilation/cuda_piecewise_backend.py

Line 179 in ac9fb73

stack.enter_context(patch("gc.collect", lambda: None))

is piecewise cudagraph specific? The solution here doesn't seem to be worse. And if pytorch fixes this it'd be ideal.

mgoin · 2025-07-19T19:19:55Z

Maybe we could patch the function to only call gc.collect every N invocations, where N could be like 10 or higher? Or a timer like the PyTorch PR. I think the piecewise method certainly isn't working and we should generalize it to full graphs too

mgoin · 2025-07-19T20:47:29Z

I tested a version that only calls gc.collect every N calls. Here are my benchmarks trying a variety of Ns (on B200, which is why they are faster than the original description) for vllm serve Qwen/Qwen3-0.6B

# Main
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:10<00:00,  6.23it/s]

# N=1
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:10<00:00,  6.35it/s]

# N=2
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:05<00:00, 11.42it/s]

# N=5
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:03<00:00, 19.51it/s]

# N=10
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:02<00:00, 24.23it/s]

# N=20
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 37.71it/s]

# N=50
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 43.28it/s]

# N=100
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 48.32it/s]

# N=200
Capturing CUDA graph shapes: 100%|██████████████| 67/67 [00:01<00:00, 48.60it/s]

Function:

        @contextmanager
        def suppress_gc_collect(call_interval: int):
            """
            Reduce `gc.collect` frequency to speed up CUDA graph capture.
            Only calls the original gc.collect every N invocations.
            """
            original_gc_collect = gc.collect
            call_count = 0
            
            def throttled_gc_collect():
                nonlocal call_count
                call_count += 1
                if call_count % call_interval == 0:
                    return original_gc_collect()
                return None
                
            with patch("gc.collect", throttled_gc_collect):
                yield

mergify · 2025-07-19T20:54:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: mgoin <[email protected]>

mgoin · 2025-07-21T14:41:34Z

@WoosukKwon @zou3519 @njhill I implemented the collect-then-freeze approach, which seems to provide the same benefit, PTAL

main:      Capturing CUDA graph shapes: 100%|█████████| 67/67 [00:10<00:00,  6.66it/s]
gc.freeze: Capturing CUDA graph shapes: 100%|█████████| 67/67 [00:01<00:00, 46.41it/s]

njhill · 2025-07-21T18:40:54Z

@mgoin nice! Looks like it may still be slightly slower than disabling explicit gc.collect()s? I actually don't see a downside in doing both.

We had actually already intended to do a final gc.collect() + gc.freeze() upon startup completion to minimize any ongoing GC overhead (we already do this in the front-end proc where there's a lot more object churn). I can open a separate PR for that.

WoosukKwon · 2025-07-24T00:27:05Z

vllm/v1/worker/gpu_model_runner.py

+        @contextmanager
+        def freeze_gc():
+            # Optimize garbage collection during CUDA graph capture.
+            # Clean up, then freeze all remaining objects from being included
+            # in future collections.
+            gc.collect()
+            should_freeze = not envs.VLLM_ENABLE_CUDAGRAPH_GC
+            if should_freeze:
+                gc.freeze()
+            try:
+                yield
+            finally:
+                if should_freeze:
+                    gc.unfreeze()


Actually, I think we should have this in utils.py or something like that. The model runner is becoming bloated.

Okay I'll try to do a separate PR that will consolidate with the current implementation in the piecewise backend

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: 董巍 <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

[Core] disable gc during cuda graph capture

6204bb0

Signed-off-by: Codex <[email protected]>

mgoin added the codex label Jul 17, 2025 — with ChatGPT Codex Connector

mergify bot added the v1 label Jul 17, 2025

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Fix lint

14dc042

Signed-off-by: mgoin <[email protected]>

mgoin marked this pull request as ready for review July 17, 2025 22:20

mgoin requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 17, 2025 22:20

mgoin added the startup-ux label Jul 17, 2025

mgoin added this to Startup Time Jul 17, 2025

WoosukKwon requested changes Jul 17, 2025

View reviewed changes

Datta0 mentioned this pull request Jul 19, 2025

Speed up vLLM load unslothai/unsloth-zoo#208

Merged

mergify bot added the needs-rebase label Jul 19, 2025

njhill reviewed Jul 21, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Just use gc.freeze and unfreeze

961b66d

Signed-off-by: mgoin <[email protected]>

mergify bot removed the needs-rebase label Jul 21, 2025

mgoin changed the title ~~[Core] disable gc during cuda graph capture~~ [Core] Freeze gc during cuda graph capture to speed up init Jul 21, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025

WoosukKwon approved these changes Jul 23, 2025

View reviewed changes

mgoin enabled auto-merge (squash) July 24, 2025 00:15

simon-mo disabled auto-merge July 24, 2025 00:20

simon-mo merged commit f3137cd into main Jul 24, 2025
67 of 69 checks passed

simon-mo deleted the codex/monkeypatch-gc.collect-during-cudagraph-capture branch July 24, 2025 00:20

github-project-automation bot moved this to Done in Startup Time Jul 24, 2025

WoosukKwon reviewed Jul 24, 2025

View reviewed changes

noooop mentioned this pull request Jul 24, 2025

[RFC]: Shorten all of the CI by reducing cudagraph_capture_sizes for most of the unit tests #21469

Closed

1 task

b8zhong mentioned this pull request Jul 30, 2025

[Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x sgl-project/sglang#8577

Merged

hustxiayang pushed a commit to hustxiayang/vllm that referenced this pull request Jul 30, 2025

[Core] Freeze gc during cuda graph capture to speed up init (vllm-pro…

34f5f38

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

mgoin mentioned this pull request Jul 31, 2025

[Roadmap] vLLM Roadmap Q3 2025 #20336

Open

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Core] Freeze gc during cuda graph capture to speed up init (vllm-pro…

f097a98

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Core] Freeze gc during cuda graph capture to speed up init (vllm-pro…

9af7154

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

Jakub227 approved these changes Aug 9, 2025

View reviewed changes

diegocastanibm mentioned this pull request Aug 20, 2025

ON HOLD - [Core] Lazy/Delayed CUDA graph #23184

Open

4 tasks

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Core] Freeze gc during cuda graph capture to speed up init (vllm-pro…

fec84da

…ject#21146) Signed-off-by: Codex <[email protected]> Signed-off-by: mgoin <[email protected]>

micah-wil added a commit to ROCm/vllm that referenced this pull request Aug 28, 2025

Force gc.collect to work around regression caused by vllm-project#21146

6c805b9

micah-wil mentioned this pull request Sep 2, 2025

[Core] Run garbage collector after CUDA graph capture to fix throughput regression #24128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Freeze gc during cuda graph capture to speed up init #21146

[Core] Freeze gc during cuda graph capture to speed up init #21146

Uh oh!

mgoin commented Jul 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment •

edited

Loading

Uh oh!

zou3519 commented Jul 18, 2025

Uh oh!

yinghai commented Jul 19, 2025

Uh oh!

mgoin commented Jul 19, 2025 •

edited

Loading

Uh oh!

mgoin commented Jul 19, 2025

Uh oh!

mergify bot commented Jul 19, 2025

Uh oh!

Uh oh!

mgoin commented Jul 21, 2025

Uh oh!

njhill commented Jul 21, 2025

Uh oh!

Uh oh!

WoosukKwon Jul 24, 2025

Uh oh!

mgoin Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

[Core] Freeze gc during cuda graph capture to speed up init #21146

[Core] Freeze gc during cuda graph capture to speed up init #21146

Uh oh!

Conversation

mgoin commented Jul 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Jul 18, 2025

Uh oh!

yinghai commented Jul 19, 2025

Uh oh!

mgoin commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Jul 19, 2025

Uh oh!

mergify bot commented Jul 19, 2025

Uh oh!

Uh oh!

mgoin commented Jul 21, 2025

Uh oh!

njhill commented Jul 21, 2025

Uh oh!

Uh oh!

WoosukKwon Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin commented Jul 17, 2025 •

edited by github-actions bot

Loading

WoosukKwon left a comment •

edited

Loading

mgoin commented Jul 19, 2025 •

edited

Loading