Support embedding models in V1 with a dedicated model_runner #18015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

maxdebayser wants to merge 88 commits into vllm-project:main from maxdebayser:v1_embeddings_runner

Contributor

maxdebayser commented May 12, 2025 •

edited by github-actions bot

Loading

This is an alternative to #16188 . In that other PR, I implemented embedding model support on the same model runner as the decoder models. This had the advantage that the code changes were fairly minimal. The other advantage in my opinion is that a single model runner implementation is less likely to become stale as new features and bug fixes only need to be applied to one code base. However, there were concerns about the performance implications and code complexity of a single implementation that tries to handle all cases.

In this PR I started by reverting all changes to the GPUModelRunner and created a GPUPoolingModelRunner basically by deleting everything that was related to sampling. In this state it was already passing the embedding model unit tests but there was still a lot of duplicated or unnecessary code.

Now I'm finished with the refactoring. Basically there is now a GPUBaseModelRunner that contains the common code and the GPUModelRunner and the GPUPoolingModelRunner implement the missing pieces.

There were a few issues where @22quinn spent some time thinking about:

kv-cache management: For encode models this is unnecessary because the attention mask is not causal and therefore optimizations such as chunked prefill and prefix caching don't apply. However, there are encoder models with pooling based in the last hidden state where these optimizations are applicable. One example is the intfloat/e5-mistral-7b-instruct that we use in the unit tests.
handling of m-rope, sliding window, multi-modal...: these thing are mostly orthogonal to pooling or sampling and went into the abstract base class
input batch management: when chunked prefill is disabled, in pooling models each request only stays in the batch for one execute_model call. However, with chunked prefill execute_model is called several times and the same logic that is used in the sampling models applies.
cascade attention: this only applies to decoding.

cc: @mgoin, @WoosukKwon , @DarkLight1337

maxdebayser and others added 30 commits

March 24, 2025 15:59


          Remove guardrails that prevent V1 from trying to run embedding models

f36c4f9

Signed-off-by: Max de Bayser <[email protected]>


          hack v1 flash_attn to support encoder_only

acf4638

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

b13bbc0

Signed-off-by: Max de Bayser <[email protected]>


          Revert changes to disable kv caching for encoder-only models

8debea0

Encoder-only models can also benefit from the prefix caching that is
enabled by the kv cache

Signed-off-by: Max de Bayser <[email protected]>


          Add pooling support in v1

8d97b9c

This is only passing mypy, it hasn't been tested
yet

Signed-off-by: Max de Bayser <[email protected]>


          First end-to-end working version of Bert embeddings in V1

d60b22b

Signed-off-by: Max de Bayser <[email protected]>


          Support warmup for pooling models in V1

6bebbb8

... and disable cuda graphs for these models.

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

6dafd71

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

e2724a2

Signed-off-by: Max de Bayser <[email protected]>


          remove debug prints

56ff6cd

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

fc57edd

Signed-off-by: Max de Bayser <[email protected]>


          Fix cross encoder models in V1 and enable tests for pooling models

64a0e62

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

4014d41

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'main' into v1_embeddings

87a95a8

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

902c129

Signed-off-by: Max de Bayser <[email protected]>


          re-enable large embedding models

2c68855

Signed-off-by: Max de Bayser <[email protected]>


          address review comments

8afd8f5

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'main' into v1_embeddings


          Merge branch 'upstream_main' into v1_embeddings

d7537ae

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

a9e7747

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

17520bd

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

90c611a

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

dec2441

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

a5e83f4

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

187f69b


          Merge branch 'upstream_main' into v1_embeddings

69a0332

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

a9f1721

Signed-off-by: Max de Bayser <[email protected]>


          fix merge problems

4b066a3

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

43a26dc

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings

ca34513

mergify bot removed the needs-rebase label


          Merge branch 'upstream_main' into v1_embeddings_runner

e3f4bf5

Signed-off-by: Max de Bayser <[email protected]>

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label


          Merge branch 'upstream_main' into v1_embeddings_runner

4f64ee2

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label


          Merge branch 'upstream_main' into v1_embeddings_runner

77f7056

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label


          free blocks for finished pooling requests

364ec25

Signed-off-by: Max de Bayser <[email protected]>

gemini-code-assist bot mentioned this pull request

[Doc] Update V1 Guide for embedding models #19141

Merged

3 tasks


          Merge branch 'upstream_main' into v1_embeddings_runner

3ae735d

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label


          Merge branch 'upstream_main' into v1_embeddings_runner

ff796e9

mergify bot commented Jun 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

maxdebayser added 4 commits

June 6, 2025 19:02


          sync prs

a4181ba

Signed-off-by: Max de Bayser <[email protected]>


          revert unecessary change

4c8dc44

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings_runner

f7687e0

Signed-off-by: Max de Bayser <[email protected]>


          Merge branch 'upstream_main' into v1_embeddings_runner

6c3b032

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser requested a review from aarnphm as a code owner

June 6, 2025 23:54

mergify bot removed the needs-rebase label

mergify bot commented Jun 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

Contributor Author

maxdebayser commented Jun 19, 2025

Closed in favor of PR #16188 .

maxdebayser closed this

github-project-automation bot moved this to Done in Structured Output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

22quinn 22quinn left review comments

DarkLight1337 Awaiting requested review from DarkLight1337 DarkLight1337 is a code owner

ywang96 Awaiting requested review from ywang96 ywang96 is a code owner

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat robertgshaw2-redhat is a code owner

simon-mo Awaiting requested review from simon-mo simon-mo is a code owner

mgoin Awaiting requested review from mgoin mgoin is a code owner

russellb Awaiting requested review from russellb russellb is a code owner

WoosukKwon Awaiting requested review from WoosukKwon WoosukKwon is a code owner

njhill Awaiting requested review from njhill njhill is a code owner

comaniac Awaiting requested review from comaniac comaniac is a code owner

alexm-redhat Awaiting requested review from alexm-redhat alexm-redhat is a code owner

aarnphm Awaiting requested review from aarnphm aarnphm is a code owner

Labels

frontend needs-rebase structured-output tpu v1