Skip to content

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Jun 8, 2025

Co-authored with: MrAta ([email protected]) and @zhaochenyang20

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Motivation

In RL Ecosystem which use colocate design like verl, we need to offload training model and load serving model & KV Cache frequently.

Background

  • Currently SGLang is using torch_memory_saver to pause and resume.
  • torch_memory_saver is a open source repo that provided easy to use api to hack cudaMalloc and cudaFree to make sure the virtual address could be consistent after pause and resume, which is critical to ensure CUDA Graph work.
  • CUDA Graph is critical to make sure SGLang runs faster in decoding phases.

Here is the current behavior of VERL + SGLang

Image

  1. During Training, we have training model and optimizer state in the GPU Memory, and once training is done, we will offload optimizer state to cpu and keep the model weights in GPU, which is needed in Update Weight.
  2. During Update Weight, we awake the SGLang engine, so those paused memory of Model Weights and KV Cache will come back. Then we update model from training model to serving model on the fly using the api: update_weights_in_tensor
  3. After Model being updated, we delete the training model from GPU Memory.

Above design works pretty well so far, however, this would waste a big chunk of GPU Memory during rollout, which could cause a few issues we've seen so far:

  • Small KV Cache: We need to use relative lower number of mem fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV Cache has less tokens, we will hit RuntimeError: Prefill out of memory. Try to lower your batch size. when we try prefill large number of requests.
  • Out of Memory: If we use mem fraction ratio 0.8 and run RL for 32B model on 8 H100, it will OOM during update weight

Challenge

  • torch_memory_saver currently only supports Singleton, hence SGLang will pause and resume KV Cache + Weights together, they are treated as the same group of memory controlled by the singleton torch_memory_saver instance

Proposal

Image

  1. During Training, we do the same
  2. During Update Weight Stage 1, we awake the model weights from SGLang and then update weights
  3. During Update Weight Stage 2, we delete the training model weights from GPU Memory
  4. Awake the SGLang's KV Cache

Image

Benefit

With above feature, we can train larger model with same GPU, we can also make training/rollout more efficient given we can allocate larger KV Cache

Solution: Keep using Singleton and provide tag based pause/resume

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

Screenshot 2025-06-19 at 12 16 19 PM
Screenshot 2025-06-19 at 12 13 14 PM

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

@hebiao064 hebiao064 changed the title Support Multi-stage Awake [SGLang] Support Multi-stage Awake Jun 8, 2025
@hebiao064
Copy link
Collaborator Author

hebiao064 commented Jun 9, 2025

Another finding, for current GRPO training for QWEN 7B, mem-fraction 0.9 would fail with:

(WorkerDict pid=3022678) [torch_memory_saver.cpp] CUresult error  result=2 file=csrc/torch_memory_saver.cpp func=cu_mem_create line=103

Our goal is to make GRPO QWen 7B work with 0.9 or even 0.95 mem fraction by leveraging multi-stage awake

@hebiao064 hebiao064 marked this pull request as ready for review June 11, 2025 21:56
@hebiao064 hebiao064 changed the title [SGLang] Support Multi-stage Awake [rollout] Support Multi-stage Awake for SGLang Jun 19, 2025
@hebiao064
Copy link
Collaborator Author

Another finding, for current GRPO training for QWEN 7B, mem-fraction 0.9 would work (peak memory goes up to 97%) on H200

When we set mem-fraction to 0.95, it will fail with:

(WorkerDict pid=3022678) [torch_memory_saver.cpp] CUresult error  result=2 file=csrc/torch_memory_saver.cpp func=cu_mem_create line=103

Our goal is make GRPO QWen 7B work with 0.95 mem fraction by leveraging multi-stage awake

Screenshot 2025-06-19 at 12 13 14 PM

Screenshot 2025-06-19 at 12 16 19 PM

@zhaochenyang20
Copy link
Collaborator

Great work!

@zhaochenyang20
Copy link
Collaborator

This feature is supported in sglang main and torch mem savor main. We will release 0.4.8 these days.

@hebiao064 hebiao064 changed the title [rollout] Support Multi-stage Awake for SGLang [feat] Support Multi-stage Awake for SGLang Jun 20, 2025
@hebiao064 hebiao064 changed the title [feat] Support Multi-stage Awake for SGLang [rollout] feat: Support Multi-stage Awake for SGLang Jun 20, 2025
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@zhaochenyang20 zhaochenyang20 merged commit 2a62123 into volcengine:main Jun 23, 2025
35 of 37 checks passed
Sirius-L1 pushed a commit to Sirius-L1/verl that referenced this pull request Jun 24, 2025
Co-authored with: MrAta ([email protected])

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

### Motivation

In RL Ecosystem which use colocate design like
[verl](https://github.com/volcengine/verl/tree/main), we need to offload
training model and load serving model & KV Cache frequently.


#### Background
- Currently SGLang is using
[torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to
pause and resume.
- [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is
a open source repo that provided easy to use api to hack **cudaMalloc**
and **cudaFree** to make sure the virtual address could be consistent
after pause and resume, which is critical to ensure CUDA Graph work.
- CUDA Graph is critical to make sure SGLang runs faster in decoding
phases.


#### Here is the current behavior of VERL + SGLang


![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8)

1. During Training, we have training model and optimizer state in the
GPU Memory, and once training is done, we will offload optimizer state
to cpu and keep the model weights in GPU, which is needed in Update
Weight.
2. During Update Weight, we awake the SGLang engine, so those paused
memory of Model Weights and KV Cache will come back. Then we update
model from training model to serving model on the fly using the api:
`update_weights_in_tensor`
3. After Model being updated, we delete the training model from GPU
Memory.


Above design works pretty well so far, however, this would waste a big
chunk of GPU Memory during rollout, which could cause a few issues we've
seen so far:
- **Small KV Cache**: We need to use relative lower number of mem
fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV
Cache has less tokens, we will hit `RuntimeError: Prefill out of memory.
Try to lower your batch size.` when we try prefill large number of
requests.
- **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B
model on 8 H100, it will OOM during update weight


#### Challenge
- `torch_memory_saver` currently only supports Singleton, hence SGLang
will pause and resume KV Cache + Weights together, they are treated as
the same group of memory controlled by the singleton
`torch_memory_saver` instance

#### Proposal

![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f)

1. During Training, we do the same
2. During Update Weight Stage 1, we awake the model weights from SGLang
and then update weights
3. During Update Weight Stage 2, we delete the training model weights
from GPU Memory
4. Awake the SGLang's KV Cache



![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25)


### Benefit
With above feature, we can train larger model with same GPU, we can also
make training/rollout more efficient given we can allocate larger KV
Cache

### Solution: Keep using Singleton and provide tag based pause/resume

- [x] Support tag based resume/pause:
fzyzcjy/torch_memory_saver#20
- [x] Support Multiple Stage Awake in SGLang:
sgl-project/sglang#7099
- [ ] Support Multiple Stage Awake in verl:
volcengine#1911

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

![Screenshot 2025-06-19 at 12 16
19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc)
![Screenshot 2025-06-19 at 12 13
14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0)


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: Chayenne <[email protected]>
vermouth1992 pushed a commit that referenced this pull request Jun 28, 2025
### What does this PR do?

Fix issue #2189.

This bug was introduced in #1911, which relocated
`resume_memory_occupation` in resharding phase before calling
`get_torch_device().empty_cache()`.

Calling `resume_memory_occupation` without emptying cache before will
cause OOM on resharding phase of the first iteration, which prevents the
example `run_qwen2.5-3b_gsm8k_multiturn` to run.

Re-adding `get_torch_device().empty_cache()` solves the problem, and
allows the example to run again.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Tyizhanshen pushed a commit to HyperdriveHustle/verl that referenced this pull request Jul 1, 2025
Co-authored with: MrAta ([email protected])

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

### Motivation

In RL Ecosystem which use colocate design like
[verl](https://github.com/volcengine/verl/tree/main), we need to offload
training model and load serving model & KV Cache frequently.


#### Background
- Currently SGLang is using
[torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to
pause and resume.
- [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is
a open source repo that provided easy to use api to hack **cudaMalloc**
and **cudaFree** to make sure the virtual address could be consistent
after pause and resume, which is critical to ensure CUDA Graph work.
- CUDA Graph is critical to make sure SGLang runs faster in decoding
phases.


#### Here is the current behavior of VERL + SGLang


![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8)

1. During Training, we have training model and optimizer state in the
GPU Memory, and once training is done, we will offload optimizer state
to cpu and keep the model weights in GPU, which is needed in Update
Weight.
2. During Update Weight, we awake the SGLang engine, so those paused
memory of Model Weights and KV Cache will come back. Then we update
model from training model to serving model on the fly using the api:
`update_weights_in_tensor`
3. After Model being updated, we delete the training model from GPU
Memory.


Above design works pretty well so far, however, this would waste a big
chunk of GPU Memory during rollout, which could cause a few issues we've
seen so far:
- **Small KV Cache**: We need to use relative lower number of mem
fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV
Cache has less tokens, we will hit `RuntimeError: Prefill out of memory.
Try to lower your batch size.` when we try prefill large number of
requests.
- **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B
model on 8 H100, it will OOM during update weight


#### Challenge
- `torch_memory_saver` currently only supports Singleton, hence SGLang
will pause and resume KV Cache + Weights together, they are treated as
the same group of memory controlled by the singleton
`torch_memory_saver` instance

#### Proposal

![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f)

1. During Training, we do the same
2. During Update Weight Stage 1, we awake the model weights from SGLang
and then update weights
3. During Update Weight Stage 2, we delete the training model weights
from GPU Memory
4. Awake the SGLang's KV Cache



![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25)


### Benefit
With above feature, we can train larger model with same GPU, we can also
make training/rollout more efficient given we can allocate larger KV
Cache

### Solution: Keep using Singleton and provide tag based pause/resume

- [x] Support tag based resume/pause:
fzyzcjy/torch_memory_saver#20
- [x] Support Multiple Stage Awake in SGLang:
sgl-project/sglang#7099
- [ ] Support Multiple Stage Awake in verl:
volcengine#1911

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

![Screenshot 2025-06-19 at 12 16
19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc)
![Screenshot 2025-06-19 at 12 13
14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0)


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: Chayenne <[email protected]>
wuxibin89 pushed a commit that referenced this pull request Jul 7, 2025
### What does this PR do?

Fix a regression from #1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 7, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
lkc233 pushed a commit to lkc233/verl that referenced this pull request Jul 10, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
ArronHZG pushed a commit to imh966/verl that referenced this pull request Jul 10, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
Co-authored with: MrAta ([email protected])

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

### Motivation

In RL Ecosystem which use colocate design like
[verl](https://github.com/volcengine/verl/tree/main), we need to offload
training model and load serving model & KV Cache frequently.


#### Background
- Currently SGLang is using
[torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to
pause and resume.
- [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is
a open source repo that provided easy to use api to hack **cudaMalloc**
and **cudaFree** to make sure the virtual address could be consistent
after pause and resume, which is critical to ensure CUDA Graph work.
- CUDA Graph is critical to make sure SGLang runs faster in decoding
phases.


#### Here is the current behavior of VERL + SGLang


![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8)

1. During Training, we have training model and optimizer state in the
GPU Memory, and once training is done, we will offload optimizer state
to cpu and keep the model weights in GPU, which is needed in Update
Weight.
2. During Update Weight, we awake the SGLang engine, so those paused
memory of Model Weights and KV Cache will come back. Then we update
model from training model to serving model on the fly using the api:
`update_weights_in_tensor`
3. After Model being updated, we delete the training model from GPU
Memory.


Above design works pretty well so far, however, this would waste a big
chunk of GPU Memory during rollout, which could cause a few issues we've
seen so far:
- **Small KV Cache**: We need to use relative lower number of mem
fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV
Cache has less tokens, we will hit `RuntimeError: Prefill out of memory.
Try to lower your batch size.` when we try prefill large number of
requests.
- **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B
model on 8 H100, it will OOM during update weight


#### Challenge
- `torch_memory_saver` currently only supports Singleton, hence SGLang
will pause and resume KV Cache + Weights together, they are treated as
the same group of memory controlled by the singleton
`torch_memory_saver` instance

#### Proposal

![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f)

1. During Training, we do the same
2. During Update Weight Stage 1, we awake the model weights from SGLang
and then update weights
3. During Update Weight Stage 2, we delete the training model weights
from GPU Memory
4. Awake the SGLang's KV Cache



![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25)


### Benefit
With above feature, we can train larger model with same GPU, we can also
make training/rollout more efficient given we can allocate larger KV
Cache

### Solution: Keep using Singleton and provide tag based pause/resume

- [x] Support tag based resume/pause:
fzyzcjy/torch_memory_saver#20
- [x] Support Multiple Stage Awake in SGLang:
sgl-project/sglang#7099
- [ ] Support Multiple Stage Awake in verl:
volcengine#1911

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

![Screenshot 2025-06-19 at 12 16
19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc)
![Screenshot 2025-06-19 at 12 13
14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0)


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: Chayenne <[email protected]>
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
### What does this PR do?

Fix issue volcengine#2189.

This bug was introduced in volcengine#1911, which relocated
`resume_memory_occupation` in resharding phase before calling
`get_torch_device().empty_cache()`.

Calling `resume_memory_occupation` without emptying cache before will
cause OOM on resharding phase of the first iteration, which prevents the
example `run_qwen2.5-3b_gsm8k_multiturn` to run.

Re-adding `get_torch_device().empty_cache()` solves the problem, and
allows the example to run again.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
…engine#2365)

### What does this PR do?

Fix a regression from volcengine#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame volcengine#2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
### What does this PR do?

Fix issue volcengine#2189.

This bug was introduced in volcengine#1911, which relocated
`resume_memory_occupation` in resharding phase before calling
`get_torch_device().empty_cache()`.

Calling `resume_memory_occupation` without emptying cache before will
cause OOM on resharding phase of the first iteration, which prevents the
example `run_qwen2.5-3b_gsm8k_multiturn` to run.

Re-adding `get_torch_device().empty_cache()` solves the problem, and
allows the example to run again.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
### What does this PR do?

Fix issue volcengine#2189.

This bug was introduced in volcengine#1911, which relocated
`resume_memory_occupation` in resharding phase before calling
`get_torch_device().empty_cache()`.

Calling `resume_memory_occupation` without emptying cache before will
cause OOM on resharding phase of the first iteration, which prevents the
example `run_qwen2.5-3b_gsm8k_multiturn` to run.

Re-adding `get_torch_device().empty_cache()` solves the problem, and
allows the example to run again.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
torchtorchkimtorch pushed a commit to torchtorchkimtorch/verl_optimization_exp that referenced this pull request Oct 16, 2025
Co-authored with: MrAta ([email protected])

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

### Motivation

In RL Ecosystem which use colocate design like
[verl](https://github.com/volcengine/verl/tree/main), we need to offload
training model and load serving model & KV Cache frequently.


#### Background
- Currently SGLang is using
[torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) to
pause and resume.
- [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver) is
a open source repo that provided easy to use api to hack **cudaMalloc**
and **cudaFree** to make sure the virtual address could be consistent
after pause and resume, which is critical to ensure CUDA Graph work.
- CUDA Graph is critical to make sure SGLang runs faster in decoding
phases.


#### Here is the current behavior of VERL + SGLang


![Image](https://github.com/user-attachments/assets/e87e7dd6-f223-4de6-8f07-915eb2030ea8)

1. During Training, we have training model and optimizer state in the
GPU Memory, and once training is done, we will offload optimizer state
to cpu and keep the model weights in GPU, which is needed in Update
Weight.
2. During Update Weight, we awake the SGLang engine, so those paused
memory of Model Weights and KV Cache will come back. Then we update
model from training model to serving model on the fly using the api:
`update_weights_in_tensor`
3. After Model being updated, we delete the training model from GPU
Memory.


Above design works pretty well so far, however, this would waste a big
chunk of GPU Memory during rollout, which could cause a few issues we've
seen so far:
- **Small KV Cache**: We need to use relative lower number of mem
fraction ratio (e.g: 0.6), hence our KV Cache has less tokens. Given KV
Cache has less tokens, we will hit `RuntimeError: Prefill out of memory.
Try to lower your batch size.` when we try prefill large number of
requests.
- **Out of Memory**: If we use mem fraction ratio 0.8 and run RL for 32B
model on 8 H100, it will OOM during update weight


#### Challenge
- `torch_memory_saver` currently only supports Singleton, hence SGLang
will pause and resume KV Cache + Weights together, they are treated as
the same group of memory controlled by the singleton
`torch_memory_saver` instance

#### Proposal

![Image](https://github.com/user-attachments/assets/7fda9638-0dc2-4c14-bc64-cd20616f350f)

1. During Training, we do the same
2. During Update Weight Stage 1, we awake the model weights from SGLang
and then update weights
3. During Update Weight Stage 2, we delete the training model weights
from GPU Memory
4. Awake the SGLang's KV Cache



![Image](https://github.com/user-attachments/assets/f3dab327-dc2e-4ed8-88d7-15e383f77d25)


### Benefit
With above feature, we can train larger model with same GPU, we can also
make training/rollout more efficient given we can allocate larger KV
Cache

### Solution: Keep using Singleton and provide tag based pause/resume

- [x] Support tag based resume/pause:
fzyzcjy/torch_memory_saver#20
- [x] Support Multiple Stage Awake in SGLang:
sgl-project/sglang#7099
- [ ] Support Multiple Stage Awake in verl:
volcengine/verl#1911

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

![Screenshot 2025-06-19 at 12 16
19 PM](https://github.com/user-attachments/assets/a95dd57e-43e1-4f28-8a84-003ec5c043fc)
![Screenshot 2025-06-19 at 12 13
14 PM](https://github.com/user-attachments/assets/f1f4a8a8-1845-4fad-9424-5526d4154dd0)


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.

---------

Co-authored-by: Chayenne <[email protected]>
torchtorchkimtorch pushed a commit to torchtorchkimtorch/verl_optimization_exp that referenced this pull request Oct 16, 2025
### What does this PR do?

Fix a regression from volcengine/verl#1911,
because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error
happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash
tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:
```
(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in
```



### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here:
https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test
```
(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153
```
### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants