-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1] LogitsProcessor programming model #16728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
|
|
Signed-off-by: Andrew Feldman <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Feldman <[email protected]>
We were previously reusing the GPU SamplingMetadata class but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making a copy of the old version of the class for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. Signed-off-by: Max de Bayser <[email protected]>
We were previously reusing the GPU SamplingMetadata class but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making temporarily making a copy of the old versions of the files that we need for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. --------- Signed-off-by: Max de Bayser <[email protected]>
We were previously reusing the GPU Sampling classes but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making a copy of the old version of the class for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. Signed-off-by: Max de Bayser <[email protected]> fix linting Signed-off-by: Max de Bayser <[email protected]> Actually more classes need to be duplicated Signed-off-by: Max de Bayser <[email protected]> import the right sampler Signed-off-by: Max de Bayser <[email protected]> fix tests Signed-off-by: Max de Bayser <[email protected]> fix tests Signed-off-by: Max de Bayser <[email protected]>
The changes introduced by PR vllm-project/vllm#16728 to the sampler architecture were incompatible with our spyre model runner. Initially, as a stopgap solution. I copied the old sampling classes into our vllm_spyre tree just so that we can keep working on the latest changes from main. Now this commit reverts that and makes the same logits processor logic work for the spyre input batch and model runner classes. The difference with the gpu model runner is that in spyre we don't condense the batch but have a boolean mask that is used to calculate "dense" request indices. These indices must be used for the BatchUpdateBuilder because they are the right ones to slice the `logits` tensor that is passed to the Sampler. Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
At first it wasn't obvious if it would be easy to integrate the changes of PR vllm-project/vllm#16728 so initially I added PR that copies the sampler files previous to that PR in vllm-spyre. But actually it's easier than I thought because the sampler code is not compiled to the AIU, only the model forward is. Currently in the MinP processor there is a tensor for the cpu and for the device. Since only the model forward runs on the AIU, both tensors end up on the CPU, which means that there is an unnecessary copy from one to the other, but the result is still correct. There is a future upstream PR that will generalize the Logits processor to other sampling parameters: vllm-project/vllm#19912 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
This PR is a continuation of the draft PR here
#13360
In the context of the vLLM v1 engine, this PR (1) defines the programming model for creating logits processors (via subclassing a base
LogitsProcessor
class), (2) converts hard-coded built-in logits processors (min-p, min token penalty and logits bias) into sub-classes ofLogitsProcessor
, and (3) introduces the logic for applying a list of logits processors sequentially (in-place) to the logits input.This PR does not
Additional description (from #13360):
"Proposed abstraction for how to handle sampling parameters in relation to the persistent batch. This interface could then be used as an extension point for custom logits processors (note: see #16862 ).
Key goals/ideas:
Logits processor implementations are configured globally, we won't support per-request
They apply at a batch level rather than per-request to allow for / encourage vectorized application
Each logits processor encapsulates its own state and is responsible for updating it as needed based on notification of persistent batch updates and new output tokens each step. This minimizes the number of times tensors need to be reconstructed and updated on the GPU.
...I've implemented LPs for min_tokens, logit_bias and min_p, but if we decide to go this route it should be straightforward to refactor the others similarly...
@WoosukKwon @AlpinDale @houseroad