[V1] LogitsProcessor programming model #16728

afeldman-nm · 2025-04-16T15:08:04Z

This PR is a continuation of the draft PR here

In the context of the vLLM v1 engine, this PR (1) defines the programming model for creating logits processors (via subclassing a base LogitsProcessor class), (2) converts hard-coded built-in logits processors (min-p, min token penalty and logits bias) into sub-classes of LogitsProcessor, and (3) introduces the logic for applying a list of logits processors sequentially (in-place) to the logits input.

This PR does not

Allow the user to extend vLLM V1 engine with custom logits processors ( [RFC]: Logits processor extensibility #17799 )
Allow the user to pass in custom logits processor arguments via the OpenAI API ( [Frontend] Expose custom args in OpenAI APIs #16862 )

Additional description (from #13360):

"Proposed abstraction for how to handle sampling parameters in relation to the persistent batch. This interface could then be used as an extension point for custom logits processors (note: see #16862 ).

Key goals/ideas:

Logits processor implementations are configured globally, we won't support per-request
They apply at a batch level rather than per-request to allow for / encourage vectorized application
Each logits processor encapsulates its own state and is responsible for updating it as needed based on notification of persistent batch updates and new output tokens each step. This minimizes the number of times tensors need to be reconstructed and updated on the GPU.

...I've implemented LPs for min_tokens, logit_bias and min_p, but if we decide to go this route it should be straightforward to refactor the others similarly...

class LogitsProcessor(ABC):
    @abstractmethod
    def apply(self, logits: torch.Tensor) -> torch.Tensor:
        raise NotImplementedError

    @abstractmethod
    def update_states(
        self,
        batch_update: Optional[BatchUpdate] = None,
    ) -> None:
        """Called when there are new output tokens, prior
        to each forward pass.
        Args:
            batch_update is non-None iff there have been
            changes to the batch makeup.
        """
        raise NotImplementedError

@dataclasses.dataclass
class BatchUpdate:
    # Batch indices of any removed requests.
    removed: List[int]
    # (from, to) batch indices of any requests
    # moved within the batch.
    moved: List[Tuple[int, int]]
    # (index, params, output_tok_ids) for new
    # requests added to the batch.
    #TODO may need to include one or two other things here, like prompt token ids.
    added: List[Tuple[int, SamplingParams, List[int]]]
    # The current number of requests in the batch.
    batch_size: int

@WoosukKwon @AlpinDale @houseroad

Signed-off-by: Nick Hill <[email protected]>

github-actions · 2025-04-16T15:08:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-16T15:08:43Z

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

Signed-off-by: Andrew Feldman <[email protected]>

mergify · 2025-04-23T15:57:16Z

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

mergify · 2025-04-30T17:11:22Z

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

Signed-off-by: Andrew Feldman <[email protected]>

mergify · 2025-05-01T14:24:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Andrew Feldman <[email protected]>

mergify · 2025-07-02T01:11:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Andrew Feldman <[email protected]>

We were previously reusing the GPU SamplingMetadata class but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making a copy of the old version of the class for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. Signed-off-by: Max de Bayser <[email protected]>

We were previously reusing the GPU SamplingMetadata class but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making temporarily making a copy of the old versions of the files that we need for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. --------- Signed-off-by: Max de Bayser <[email protected]>

We were previously reusing the GPU Sampling classes but there have been incompatible changes upstream (PR vllm-project/vllm#16728) Since it's not clear for now whether we want, should or can reuse the LogitsProcessor implementation as is, I'm making a copy of the old version of the class for the spyre backend. This won't affect any features for now since the vllm change was an internal refactoring without UX impact. Signed-off-by: Max de Bayser <[email protected]> fix linting Signed-off-by: Max de Bayser <[email protected]> Actually more classes need to be duplicated Signed-off-by: Max de Bayser <[email protected]> import the right sampler Signed-off-by: Max de Bayser <[email protected]> fix tests Signed-off-by: Max de Bayser <[email protected]> fix tests Signed-off-by: Max de Bayser <[email protected]>

The changes introduced by PR vllm-project/vllm#16728 to the sampler architecture were incompatible with our spyre model runner. Initially, as a stopgap solution. I copied the old sampling classes into our vllm_spyre tree just so that we can keep working on the latest changes from main. Now this commit reverts that and makes the same logits processor logic work for the spyre input batch and model runner classes. The difference with the gpu model runner is that in spyre we don't condense the batch but have a boolean mask that is used to calculate "dense" request indices. These indices must be used for the BatchUpdateBuilder because they are the right ones to slice the `logits` tensor that is passed to the Sampler. Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Chendi Xue <[email protected]>

At first it wasn't obvious if it would be easy to integrate the changes of PR vllm-project/vllm#16728 so initially I added PR that copies the sampler files previous to that PR in vllm-spyre. But actually it's easier than I thought because the sampler code is not compiled to the AIU, only the model forward is. Currently in the MinP processor there is a tensor for the cpu and for the device. Since only the model forward runs on the AIU, both tensors end up on the CPU, which means that there is an unnecessary copy from one to the other, but the result is still correct. There is a future upstream PR that will generalize the Logits processor to other sampling parameters: vllm-project/vllm#19912 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Joe Runde <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: Andrew Feldman <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

[RFC][V1] LogitsProcessor interface

b504b73

Signed-off-by: Nick Hill <[email protected]>

afeldman-nm mentioned this pull request Apr 16, 2025

[RFC][V1] LogitsProcessor interface #13360

Draft

afeldman-nm added 14 commits April 18, 2025 17:25

extra_args

55328d8

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into extra_args

cc44096

Merge branch 'main' into extra_args

876de25

rename

191b9e1

Signed-off-by: Andrew Feldman <[email protected]>

rename

1b658cd

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into extra_args

6c892d8

extra_body

6a0f87c

Signed-off-by: Andrew Feldman <[email protected]>

completion custom arg unit test

ac57a7f

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into extra_args

9753c75

Merge branch 'main' into extra_args

c2f39bd

tweak extra_args; test sampling params extra args via api

5c43609

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into extra_args

1f8d6d1

remove unnecessary extra_body field/breakout

368f907

Signed-off-by: Andrew Feldman <[email protected]>

removed transcription scenario

a90311a

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into extra_args

0e7809d

afeldman-nm mentioned this pull request Apr 25, 2025

[RFC]: Custom sampling params support in REST API #17191

Closed

1 task

small changes

42b0d31

Signed-off-by: Andrew Feldman <[email protected]>

mergify bot added v1 tpu Related to Google TPUs labels May 1, 2025

mergify bot added the needs-rebase label May 1, 2025

afeldman-nm added 3 commits May 2, 2025 14:25

spec decode min p

f1ef8ef

Signed-off-by: Andrew Feldman <[email protected]>

spec decode min p

b270ac4

Signed-off-by: Andrew Feldman <[email protected]>

wip TPU fix

49531cb

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm added 4 commits July 1, 2025 10:20

Merge branch 'main' into logitsprocs_merge

d377a6b

memory util

6ae7574

Signed-off-by: Andrew Feldman <[email protected]>

Merge branch 'main' into logitsprocs_merge

5203324

Merge branch 'main' into logitsprocs_merge

68aab25

mergify bot added the needs-rebase label Jul 2, 2025

merge'

066736d

Signed-off-by: Andrew Feldman <[email protected]>

mergify bot removed the needs-rebase label Jul 2, 2025

aarnphm mentioned this pull request Jul 2, 2025

[Feature]: Limit thinking tokens #15418

Open

1 task

vllm-bot merged commit 48fb076 into vllm-project:main Jul 2, 2025
62 of 69 checks passed

github-project-automation bot moved this to Done in Tool Calling Jul 2, 2025

github-project-automation bot moved this to Done in Structured Output Jul 2, 2025

afeldman-nm deleted the logitsprocs branch July 2, 2025 16:46

maxdebayser mentioned this pull request Jul 3, 2025

Duplicate the SamplingMetadata class vllm-project/vllm-spyre#278

Merged

xuechendi mentioned this pull request Jul 3, 2025

Fix failing due to (#16728) vllm-project/vllm-gaudi#5

Merged

maxdebayser mentioned this pull request Jul 8, 2025

Integrate upstream logits processors vllm-project/vllm-spyre#290

Merged

kzawora-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Jul 10, 2025

Fix failing due to (vllm-project#16728) (#5)

1755fdb

Signed-off-by: Chendi Xue <[email protected]>

llsj14 mentioned this pull request Jul 14, 2025

[Feature] limit thinking tokens (hard limit) #20859

Open

4 tasks

ZeroYuJie mentioned this pull request Aug 29, 2025

[WIP]: DRY sampling #16695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] LogitsProcessor programming model #16728

[V1] LogitsProcessor programming model #16728

Uh oh!

afeldman-nm commented Apr 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 23, 2025

Uh oh!

mergify bot commented Apr 30, 2025

Uh oh!

mergify bot commented May 1, 2025

Uh oh!

mergify bot commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1] LogitsProcessor programming model #16728

[V1] LogitsProcessor programming model #16728

Uh oh!

Conversation

afeldman-nm commented Apr 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 23, 2025

Uh oh!

mergify bot commented Apr 30, 2025

Uh oh!

mergify bot commented May 1, 2025

Uh oh!

mergify bot commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

afeldman-nm commented Apr 16, 2025 •

edited by github-actions bot

Loading