-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing #23656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call it RequestLogitsProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this class looks like the min tokens LP ... is this just a copy-pasted temporary state before updating it with the wrapper LP impl?
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Feldman <[email protected]>
018d8be
to
ff1d90b
Compare
@afeldman-nm @njhill Do we really want to support this feature? I thought we decided to deprecate it. |
Hi Woosuk. This PR represents a compromise solution. New logits processors should subclass the V1 https://github.com/NVIDIA/logits-processor-zoo so the purpose of this PR is to provide the thinnest possible wrapper for plumbing these into the logits processor extensibility interface, without making any changes to the interface defined in #19912 This way we are not committing to support V0-style processors in the engine internals/interface, but we are also still providing a solution for existing logits processor "zoos". In the long term, anyone with a V0 style logits processor should upgrade their implementation for the perf benefits |
Once this PR and the logits processor documentation PR land, it might make sense for me to file an issue against https://github.com/NVIDIA/logits-processor-zoo and any other similar repo... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm
Signed-off-by: Andrew Feldman <[email protected]>
@WoosukKwon IMO we do want to support it for users that want to experiment and/or where performance isn't critical, but with the understand that vectorized LPs should be used for performance. As well as support for all the preexisting impls as @afeldman-nm said. It's much easier to implement something to mutate the LPs of a single request. |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm, the class looks good to me now, just a couple of nits remaining
def __init__(self, vllm_config: VllmConfig, device: torch.device, | ||
is_pin_memory: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't need/want any args here. So that they can call super().__init__()
.
But we want the contract to be that their subclass of this must have those args, since that's expected of our regular LogitsProcessor
s (and they can also then make use of them if they need to).
I'm just not sure how to reflect that in the abstract class apart from via comments/doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am understanding your comment correctly, yes I think these arguments cannot be skipped unfortunately. I added a docstring explaining that these arguments may be used by the subclass but must be present regardless.
This is also worth mentioning in the docs, however I think I will leave documentation of this per-request logits processor wrapper to the docs PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm was saying that we don't actually need the args here if we require the subclasses to have such a constructor.
However on second thoughts maybe this is better because then implementations don't need to define an init method at all if they don't need any of the args (like your dummy impls for example).
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm! LGTM just a couple of tiny things
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm !
* 'main' of https://github.com/845473182/vllm: (457 commits) [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132) [Misc] Add check for dual_chunk_attention (vllm-project#24070) [Doc]: fix typos in Python comments (vllm-project#24115) [Doc]: fix typos in Python comments (vllm-project#24093) [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660) fix some typos (vllm-project#24071) [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656) Upgrade xgrammar to 0.1.23 (vllm-project#22988) Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073) [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081) [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121) [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119) [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692) [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936) [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370) Fix weights loading for Apertus (vllm-project#24100) [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110) [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902) Run ruff format on a few files. (vllm-project#24075) [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945) ...
Note that something here is still a little opaque about how to use logits processors that were developed in the V0 style. If an old school logits processor is run in version 0.10.1.1, then you land at the vllm/vllm/engine/llm_engine.py Line 689 in 37efc63
The fix reach on that zoo repo's side was a downgrade to 0.10.0, which also just "worked" for me as well. All in all though I get a sense that something needs to be resolved here but I have no experience with either library to understand why/how this is happening or whether it will get resolved. |
…atch-level logits processing (vllm-project#23656) Signed-off-by: Andrew Feldman <[email protected]>
…atch-level logits processing (vllm-project#23656) Signed-off-by: Andrew Feldman <[email protected]>
Purpose
This PR adds support for passing request-level logits processors into the vLLM V1 engine. This is accomplished using a wrapper which builds a batch-level logits processor class out of a
Callable
request-level logits processor that complies with the interface described here https://docs.vllm.ai/en/v0.9.2/api/vllm/logits_process.htmlTest Plan
Unit-test wraps request-level logits processor, passes to batch-level logits processor and confirms correct behavior
Test Result
Passes
Documentation update
Will be provided in #22919