-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Support embedding models in V1 with a dedicated model_runner #18015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Encoder-only models can also benefit from the prefix caching that is enabled by the kv cache Signed-off-by: Max de Bayser <[email protected]>
This is only passing mypy, it hasn't been tested yet Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
... and disable cuda graphs for these models. Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Closed in favor of PR #16188 . |
This is an alternative to #16188 . In that other PR, I implemented embedding model support on the same model runner as the decoder models. This had the advantage that the code changes were fairly minimal. The other advantage in my opinion is that a single model runner implementation is less likely to become stale as new features and bug fixes only need to be applied to one code base. However, there were concerns about the performance implications and code complexity of a single implementation that tries to handle all cases.
In this PR I started by reverting all changes to the
GPUModelRunner
and created aGPUPoolingModelRunner
basically by deleting everything that was related to sampling. In this state it was already passing the embedding model unit tests but there was still a lot of duplicated or unnecessary code.Now I'm finished with the refactoring. Basically there is now a
GPUBaseModelRunner
that contains the common code and theGPUModelRunner
and theGPUPoolingModelRunner
implement the missing pieces.There were a few issues where @22quinn spent some time thinking about:
intfloat/e5-mistral-7b-instruct
that we use in the unit tests.execute_model
call. However, with chunked prefillexecute_model
is called several times and the same logic that is used in the sampling models applies.cc: @mgoin, @WoosukKwon , @DarkLight1337
FIX #18052