Skip to content

Conversation

afeldman-nm
Copy link
Contributor

@afeldman-nm afeldman-nm commented Aug 26, 2025

Purpose

This PR adds support for passing request-level logits processors into the vLLM V1 engine. This is accomplished using a wrapper which builds a batch-level logits processor class out of a Callable request-level logits processor that complies with the interface described here https://docs.vllm.ai/en/v0.9.2/api/vllm/logits_process.html

Test Plan

Unit-test wraps request-level logits processor, passes to batch-level logits processor and confirms correct behavior

Test Result

Passes

Documentation update

Will be provided in #22919

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Aug 26, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call it RequestLogitsProcessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of this class looks like the min tokens LP ... is this just a copy-pasted temporary state before updating it with the wrapper LP impl?

Copy link

mergify bot commented Aug 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 28, 2025
@WoosukKwon
Copy link
Collaborator

@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it.

@afeldman-nm
Copy link
Contributor Author

afeldman-nm commented Aug 28, 2025

@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it.

Hi Woosuk. This PR represents a compromise solution. New logits processors should subclass the V1 LogitsProcessor base class to create a new type of logits processor that operates at batch granularity. However, there are entire libraries of pre-existing V0-style logits processor implementations that operate at the request granularity, for example

https://github.com/NVIDIA/logits-processor-zoo

so the purpose of this PR is to provide the thinnest possible wrapper for plumbing these into the logits processor extensibility interface, without making any changes to the interface defined in #19912

This way we are not committing to support V0-style processors in the engine internals/interface, but we are also still providing a solution for existing logits processor "zoos".

In the long term, anyone with a V0 style logits processor should upgrade their implementation for the perf benefits

@afeldman-nm
Copy link
Contributor Author

@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it.

Once this PR and the logits processor documentation PR land, it might make sense for me to file an issue against https://github.com/NVIDIA/logits-processor-zoo and any other similar repo...

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @afeldman-nm

Signed-off-by: Andrew Feldman <[email protected]>
@njhill
Copy link
Member

njhill commented Aug 28, 2025

@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it.

@WoosukKwon IMO we do want to support it for users that want to experiment and/or where performance isn't critical, but with the understand that vectorized LPs should be used for performance. As well as support for all the preexisting impls as @afeldman-nm said.

It's much easier to implement something to mutate the LPs of a single request.

@njhill njhill marked this pull request as draft August 28, 2025 16:38
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @afeldman-nm, the class looks good to me now, just a couple of nits remaining

Comment on lines 195 to 196
def __init__(self, vllm_config: VllmConfig, device: torch.device,
is_pin_memory: bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't need/want any args here. So that they can call super().__init__().

But we want the contract to be that their subclass of this must have those args, since that's expected of our regular LogitsProcessors (and they can also then make use of them if they need to).

I'm just not sure how to reflect that in the abstract class apart from via comments/doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding your comment correctly, yes I think these arguments cannot be skipped unfortunately. I added a docstring explaining that these arguments may be used by the subclass but must be present regardless.

This is also worth mentioning in the docs, however I think I will leave documentation of this per-request logits processor wrapper to the docs PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm was saying that we don't actually need the args here if we require the subclasses to have such a constructor.

However on second thoughts maybe this is better because then implementations don't need to define an init method at all if they don't need any of the args (like your dummy impls for example).

Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
afeldman-nm and others added 10 commits September 1, 2025 00:07
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @afeldman-nm! LGTM just a couple of tiny things

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @afeldman-nm !

@njhill njhill enabled auto-merge (squash) September 2, 2025 23:45
@njhill njhill merged commit 136d853 into vllm-project:main Sep 3, 2025
39 checks passed
@afeldman-nm afeldman-nm deleted the lp_v0_plumb branch September 3, 2025 03:23
845473182 pushed a commit to 845473182/vllm that referenced this pull request Sep 3, 2025
* 'main' of https://github.com/845473182/vllm: (457 commits)
  [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132)
  [Misc] Add check for dual_chunk_attention (vllm-project#24070)
  [Doc]: fix typos in Python comments (vllm-project#24115)
  [Doc]: fix typos in Python comments (vllm-project#24093)
  [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660)
  fix some typos (vllm-project#24071)
  [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656)
  Upgrade xgrammar to 0.1.23 (vllm-project#22988)
  Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073)
  [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081)
  [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121)
  [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119)
  [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692)
  [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936)
  [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370)
  Fix weights loading for Apertus (vllm-project#24100)
  [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110)
  [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902)
  Run ruff format on a few files. (vllm-project#24075)
  [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945)
  ...
@jwkirchenbauer
Copy link

jwkirchenbauer commented Sep 4, 2025

@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it.

Hi Woosuk. This PR represents a compromise solution. New logits processors should subclass the V1 LogitsProcessor base class to create a new type of logits processor that operates at batch granularity. However, there are entire libraries of pre-existing V0-style logits processor implementations that operate at the request granularity, for example

https://github.com/NVIDIA/logits-processor-zoo

so the purpose of this PR is to provide the thinnest possible wrapper for plumbing these into the logits processor extensibility interface, without making any changes to the interface defined in #19912

This way we are not committing to support V0-style processors in the engine internals/interface, but we are also still providing a solution for existing logits processor "zoos".

In the long term, anyone with a V0 style logits processor should upgrade their implementation for the perf benefits

Note that something here is still a little opaque about how to use logits processors that were developed in the V0 style.

If an old school logits processor is run in version 0.10.1.1, then you land at the export VLLM_USE_V1=0 solution because you get the message "vLLM V1 does not support per request user provided logits processors."
However, then you get the message "ValueError: Logits processors are not supported in multi-step decoding" which was added here: 37efc63#diff-c89ac25bd066e936e80260d21be63c7d2379cfedc371a9ff288fb5ba02ae1350R689-R1985

"Logits processors are not supported in multi-step decoding")

The fix reach on that zoo repo's side was a downgrade to 0.10.0, which also just "worked" for me as well.

All in all though I get a sense that something needs to be resolved here but I have no experience with either library to understand why/how this is happening or whether it will get resolved.
Usecase is actually this repo, where watermarking methods are implemented as logits processors: https://github.com/THU-BPM/MarkLLM
The current state of vllm essentially breaks this kind of special LP usecase I think, and I assume there are more of them.

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants