Skip to content

Conversation

afeldman-nm
Copy link
Contributor

@afeldman-nm afeldman-nm commented Apr 16, 2025

This PR is a continuation of the draft PR here

#13360

In the context of the vLLM v1 engine, this PR (1) defines the programming model for creating logits processors (via subclassing a base LogitsProcessor class), (2) converts hard-coded built-in logits processors (min-p, min token penalty and logits bias) into sub-classes of LogitsProcessor, and (3) introduces the logic for applying a list of logits processors sequentially (in-place) to the logits input.

This PR does not

Additional description (from #13360):

"Proposed abstraction for how to handle sampling parameters in relation to the persistent batch. This interface could then be used as an extension point for custom logits processors (note: see #16862 ).

Key goals/ideas:

Logits processor implementations are configured globally, we won't support per-request
They apply at a batch level rather than per-request to allow for / encourage vectorized application
Each logits processor encapsulates its own state and is responsible for updating it as needed based on notification of persistent batch updates and new output tokens each step. This minimizes the number of times tensors need to be reconstructed and updated on the GPU.

...I've implemented LPs for min_tokens, logit_bias and min_p, but if we decide to go this route it should be straightforward to refactor the others similarly...

class LogitsProcessor(ABC):
    @abstractmethod
    def apply(self, logits: torch.Tensor) -> torch.Tensor:
        raise NotImplementedError

    @abstractmethod
    def update_states(
        self,
        batch_update: Optional[BatchUpdate] = None,
    ) -> None:
        """Called when there are new output tokens, prior
        to each forward pass.
        Args:
            batch_update is non-None iff there have been
            changes to the batch makeup.
        """
        raise NotImplementedError
@dataclasses.dataclass
class BatchUpdate:
    # Batch indices of any removed requests.
    removed: List[int]
    # (from, to) batch indices of any requests
    # moved within the batch.
    moved: List[Tuple[int, int]]
    # (index, params, output_tok_ids) for new
    # requests added to the batch.
    #TODO may need to include one or two other things here, like prompt token ids.
    added: List[Tuple[int, SamplingParams, List[int]]]
    # The current number of requests in the batch.
    batch_size: int

@WoosukKwon @AlpinDale @houseroad

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Apr 16, 2025

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

Copy link

mergify bot commented Apr 23, 2025

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

Copy link

mergify bot commented Apr 30, 2025

⚠️ The sha of the head commit of this PR conflicts with #13360. Mergify cannot evaluate rules on this PR. ⚠️

Signed-off-by: Andrew Feldman <[email protected]>
@mergify mergify bot added v1 tpu Related to Google TPUs labels May 1, 2025
Copy link

mergify bot commented May 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 1, 2025
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Copy link

mergify bot commented Jul 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 2, 2025
Signed-off-by: Andrew Feldman <[email protected]>
@mergify mergify bot removed the needs-rebase label Jul 2, 2025
@vllm-bot vllm-bot merged commit 48fb076 into vllm-project:main Jul 2, 2025
62 of 69 checks passed
@afeldman-nm afeldman-nm deleted the logitsprocs branch July 2, 2025 16:46
maxdebayser added a commit to vllm-project/vllm-spyre that referenced this pull request Jul 3, 2025
We were previously reusing the GPU SamplingMetadata
class but there have been incompatible changes upstream
(PR vllm-project/vllm#16728)

Since it's not clear for now whether we want, should
or can reuse the LogitsProcessor implementation as is,
I'm making a copy of the old version of the class for
the spyre backend.

This won't affect any features for now since the vllm
change was an internal refactoring without UX impact.

Signed-off-by: Max de Bayser <[email protected]>
joerunde pushed a commit to vllm-project/vllm-spyre that referenced this pull request Jul 8, 2025
We were previously reusing the GPU SamplingMetadata class but there have
been incompatible changes upstream (PR
vllm-project/vllm#16728)

Since it's not clear for now whether we want, should or can reuse the
LogitsProcessor implementation as is, I'm making temporarily making a
copy of the old versions of the files that we need for the spyre
backend.

This won't affect any features for now since the vllm change was an
internal refactoring without UX impact.

---------

Signed-off-by: Max de Bayser <[email protected]>
maxdebayser added a commit to vllm-project/vllm-spyre that referenced this pull request Jul 8, 2025
We were previously reusing the GPU Sampling
classes but there have been incompatible changes upstream
(PR vllm-project/vllm#16728)

Since it's not clear for now whether we want, should
or can reuse the LogitsProcessor implementation as is,
I'm making a copy of the old version of the class for
the spyre backend.

This won't affect any features for now since the vllm
change was an internal refactoring without UX impact.

Signed-off-by: Max de Bayser <[email protected]>

fix linting

Signed-off-by: Max de Bayser <[email protected]>

Actually more classes need to be duplicated

Signed-off-by: Max de Bayser <[email protected]>

import the right sampler

Signed-off-by: Max de Bayser <[email protected]>

fix tests

Signed-off-by: Max de Bayser <[email protected]>

fix tests

Signed-off-by: Max de Bayser <[email protected]>
maxdebayser added a commit to vllm-project/vllm-spyre that referenced this pull request Jul 9, 2025
The changes introduced by PR
vllm-project/vllm#16728
to the sampler architecture were incompatible with
our spyre model runner.

Initially, as a stopgap solution. I copied the old sampling
classes into our vllm_spyre tree just so that we can keep
working on the latest changes from main.

Now this commit reverts that and makes the same logits
processor logic work for the spyre input batch and model
runner classes.  The difference with the gpu model runner is
that in spyre we don't condense the batch but have a boolean
mask that is used to calculate "dense" request indices. These
indices must be used for the BatchUpdateBuilder because they
are the right ones to slice the `logits` tensor that is passed
to the Sampler.

Signed-off-by: Max de Bayser <[email protected]>
kzawora-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Jul 10, 2025
joerunde added a commit to vllm-project/vllm-spyre that referenced this pull request Jul 14, 2025
At first it wasn't obvious if it would be easy to integrate the changes
of PR vllm-project/vllm#16728 so initially I
added PR that copies the sampler files previous to that PR in
vllm-spyre. But actually it's easier than I thought because the sampler
code is not compiled to the AIU, only the model forward is.

Currently in the MinP processor there is a tensor for the cpu and for
the device. Since only the model forward runs on the AIU, both tensors
end up on the CPU, which means that there is an unnecessary copy from
one to the other, but the result is still correct.

There is a future upstream PR that will generalize the Logits processor
to other sampling parameters:

vllm-project/vllm#19912

Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
@ZeroYuJie ZeroYuJie mentioned this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding structured-output tool-calling v1
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants