Skip to content

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented May 12, 2025

This is an alternative to #16188 . In that other PR, I implemented embedding model support on the same model runner as the decoder models. This had the advantage that the code changes were fairly minimal. The other advantage in my opinion is that a single model runner implementation is less likely to become stale as new features and bug fixes only need to be applied to one code base. However, there were concerns about the performance implications and code complexity of a single implementation that tries to handle all cases.

In this PR I started by reverting all changes to the GPUModelRunner and created a GPUPoolingModelRunner basically by deleting everything that was related to sampling. In this state it was already passing the embedding model unit tests but there was still a lot of duplicated or unnecessary code.

Now I'm finished with the refactoring. Basically there is now a GPUBaseModelRunner that contains the common code and the GPUModelRunner and the GPUPoolingModelRunner implement the missing pieces.

There were a few issues where @22quinn spent some time thinking about:

  • kv-cache management: For encode models this is unnecessary because the attention mask is not causal and therefore optimizations such as chunked prefill and prefix caching don't apply. However, there are encoder models with pooling based in the last hidden state where these optimizations are applicable. One example is the intfloat/e5-mistral-7b-instruct that we use in the unit tests.
  • handling of m-rope, sliding window, multi-modal...: these thing are mostly orthogonal to pooling or sampling and went into the abstract base class
  • input batch management: when chunked prefill is disabled, in pooling models each request only stays in the batch for one execute_model call. However, with chunked prefill execute_model is called several times and the same logic that is used in the sampling models applies.
  • cascade attention: this only applies to decoding.

cc: @mgoin, @WoosukKwon , @DarkLight1337

FIX #18052

maxdebayser and others added 30 commits March 24, 2025 15:59
Encoder-only models can also benefit from the prefix caching that is
enabled by the kv cache

Signed-off-by: Max de Bayser <[email protected]>
This is only passing mypy, it hasn't been tested
yet

Signed-off-by: Max de Bayser <[email protected]>
... and disable cuda graphs for these models.

Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
@mergify mergify bot removed the needs-rebase label Jun 1, 2025
Copy link

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 3, 2025
@mergify mergify bot removed the needs-rebase label Jun 3, 2025
Copy link

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 3, 2025
@mergify mergify bot removed the needs-rebase label Jun 3, 2025
Copy link

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 3, 2025
@mergify mergify bot removed the needs-rebase label Jun 4, 2025
Copy link

mergify bot commented Jun 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 5, 2025
@maxdebayser maxdebayser requested a review from aarnphm as a code owner June 6, 2025 23:54
@mergify mergify bot removed the needs-rebase label Jun 6, 2025
Copy link

mergify bot commented Jun 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 9, 2025
@maxdebayser
Copy link
Contributor Author

Closed in favor of PR #16188 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[RFC]: Support pooling in V1
3 participants