-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model] CLIP Embedding Support #26010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for CLIP models, enabling both text and image embedding. The changes include adding causal attention support to MultiHeadAttention
, implementing the CLIPEmbeddingModel
, and providing new tests and examples. The overall implementation is solid and the new functionality is well-tested. However, I've identified a few areas for improvement in vllm/model_executor/models/clip.py
. Specifically, the model's forward pass relies on a stateful flag which is a fragile design, and there are several pieces of dead code, including unused classes and methods, that should be removed to improve maintainability. Please see my detailed comments.
self._is_text_input = (multimodal_embeddings is None | ||
or len(multimodal_embeddings) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a stateful flag self._is_text_input
set in get_input_embeddings
and read in forward
creates an implicit dependency between these two methods. This makes the code fragile and harder to reason about. If the calling order changes in the model runner, it could lead to incorrect behavior.
A better approach would be to pass the modality information explicitly to the forward
method. This could be done by having the model runner inspect the inputs and pass a flag (e.g., is_text_input
) as part of the kwargs
to forward
.
Signed-off-by: DarkLight1337 <[email protected]>
/gemini review |
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the CLIP model, enabling both text and image embedding capabilities. The changes are comprehensive, including modifications to the attention mechanism to support causal masking, a new CLIPEmbeddingModel
implementation, and updates to examples and tests. The overall implementation appears solid. I've identified one high-severity issue related to the robustness of selecting a dummy token ID, which could cause problems with different tokenizers for CLIP-like models. The proposed fix improves the robustness of this logic.
Signed-off-by: DarkLight1337 <[email protected]>
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for CLIP models, including both text and image embedding capabilities. It introduces a new CLIPEmbeddingModel
and adapts the attention mechanism to support causal masking for the text encoder. The changes also include new examples and tests for CLIP. My review focuses on a potential correctness issue related to state management in the new model implementation. Specifically, I've identified a fragile dependency between get_input_embeddings
and forward
methods that relies on a shared instance variable, which could lead to issues. I've recommended a more robust, stateless approach as suggested by the author in the PR description.
Actually, I just thought of using a separate attention module for the text encoder so we can use vLLM's |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
num_hidden_layers_override: Optional[int] = None, | ||
*, | ||
prefix: str = "", | ||
attn_cls: Union[type[Attention], type[MultiHeadAttention]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, Clip's text encoder's attention is prefill-only casual attention. Does decoder's kv-cache really work with this text tower?
Perhaps we should decouple bidirectional mask with encoder-only attention to allow encoder-only causal attention. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it should be ENCODER_ONLY
causal attention, let me fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see several other pooling models use DECODER_ONLY
attention in the language backbone when is_causal=True
. I think using DECODER_ONLY
attention to indicate causal mask should still work because pooling models don't have decode phase anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @maxdebayser @heheda12345 have a better idea about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the attention is causal, it wouldn't hurt to use DECODER_ONLY
, right? In addition it can enable chunked prefill and prefix caching if the pooling type supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently the PR is using DECODER_ONLY
which is the default attention type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that cause problems since the text backbone is actually an encoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be mistaken, but for text the core difference between encoder and decoder is the attention mask. The other differences are mainly a consequence of this. For example, with the bi-directional cache attention it doesn't make a lot of sense to use KV cache, hence in vllm we've refactored the Bert models and other to use the EncoderOnlyAttention
class. So if the CLIP attention is causal, it should work fine with the decoder attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, let's merge this then
Signed-off-by: DarkLight1337 <[email protected]>
@Isotr0py can you approve this? |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Karan Goel <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
Support CLIP text and image embedding in the same model.
token_embedding
when callingget_input_embeddings
. The rest of the text embedding and the encoder logic are applied when callingforward
on the model.get_input_embeddings
. Since the model doesn't have a decoder, we directly return the embeddings inside theforward
method.forward
method doesn't receive image inputs so we cannot use the existence ofpixel_values
to determine whetherinput_embeds
is from text or image inputs. To work around this, I have added a stateself._is_text_input
that is set insideget_input_embeddings
. @ywang96 should we updateGPUModelRunner._dummy_mm_kwargs
to create the dummy multi-modal inputs for all multi-modal models?LAST
for text, andCLS
for image. But to simplify the code, we treatLAST
pooling type for image inputs asCLS
pooling type so that we don't have to define separate pooling types for now.After this PR, it should be relatively straightforward to extend to SigLIP.
cc @maxdebayser @noooop
FIX (partial) #25581
Todo list
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.