Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Oct 1, 2025

Purpose

Support CLIP text and image embedding in the same model.

  • For text inputs, we only apply token_embedding when calling get_input_embeddings. The rest of the text embedding and the encoder logic are applied when calling forward on the model.
  • For image inputs, we apply vision embeddings when calling get_input_embeddings. Since the model doesn't have a decoder, we directly return the embeddings inside the forward method.
  • In dummy run, forward method doesn't receive image inputs so we cannot use the existence of pixel_values to determine whether input_embeds is from text or image inputs. To work around this, I have added a state self._is_text_input that is set inside get_input_embeddings. @ywang96 should we update GPUModelRunner._dummy_mm_kwargs to create the dummy multi-modal inputs for all multi-modal models?
  • Strictly speaking, for CLIP, the pooling type is LAST for text, and CLS for image. But to simplify the code, we treat LAST pooling type for image inputs as CLS pooling type so that we don't have to define separate pooling types for now.

After this PR, it should be relatively straightforward to extend to SigLIP.

cc @maxdebayser @noooop

FIX (partial) #25581

Todo list

  • Add examples for offline inference and online serving
  • Benchmarks

Test Plan

  • Added tests specific to CLIP model, which pass locally.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models labels Oct 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for CLIP models, enabling both text and image embedding. The changes include adding causal attention support to MultiHeadAttention, implementing the CLIPEmbeddingModel, and providing new tests and examples. The overall implementation is solid and the new functionality is well-tested. However, I've identified a few areas for improvement in vllm/model_executor/models/clip.py. Specifically, the model's forward pass relies on a stateful flag which is a fragile design, and there are several pieces of dead code, including unused classes and methods, that should be removed to improve maintainability. Please see my detailed comments.

Comment on lines +938 to +939
self._is_text_input = (multimodal_embeddings is None
or len(multimodal_embeddings) == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a stateful flag self._is_text_input set in get_input_embeddings and read in forward creates an implicit dependency between these two methods. This makes the code fragile and harder to reason about. If the calling order changes in the model runner, it could lead to incorrect behavior.

A better approach would be to pass the modality information explicitly to the forward method. This could be done by having the model runner inspect the inputs and pass a flag (e.g., is_text_input) as part of the kwargs to forward.

@DarkLight1337 DarkLight1337 changed the title Support clip embed [Model] CLIP Embedding Support Oct 1, 2025
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

/gemini review

Signed-off-by: DarkLight1337 <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the CLIP model, enabling both text and image embedding capabilities. The changes are comprehensive, including modifications to the attention mechanism to support causal masking, a new CLIPEmbeddingModel implementation, and updates to examples and tests. The overall implementation appears solid. I've identified one high-severity issue related to the robustness of selecting a dummy token ID, which could cause problems with different tokenizers for CLIP-like models. The proposed fix improves the robustness of this logic.

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for CLIP models, including both text and image embedding capabilities. It introduces a new CLIPEmbeddingModel and adapts the attention mechanism to support causal masking for the text encoder. The changes also include new examples and tests for CLIP. My review focuses on a potential correctness issue related to state management in the new model implementation. Specifically, I've identified a fragile dependency between get_input_embeddings and forward methods that relies on a shared instance variable, which could lead to issues. I've recommended a more robust, stateless approach as suggested by the author in the PR description.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Oct 1, 2025

Actually, I just thought of using a separate attention module for the text encoder so we can use vLLM's Attention directly instead of MultiHeadAttention, allowing for KV cache to be used 😅

num_hidden_layers_override: Optional[int] = None,
*,
prefix: str = "",
attn_cls: Union[type[Attention], type[MultiHeadAttention]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, Clip's text encoder's attention is prefill-only casual attention. Does decoder's kv-cache really work with this text tower?

Perhaps we should decouple bidirectional mask with encoder-only attention to allow encoder-only causal attention. 🤔

Copy link
Member Author

@DarkLight1337 DarkLight1337 Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it should be ENCODER_ONLY causal attention, let me fix that.

Copy link
Member Author

@DarkLight1337 DarkLight1337 Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see several other pooling models use DECODER_ONLY attention in the language backbone when is_causal=True. I think using DECODER_ONLY attention to indicate causal mask should still work because pooling models don't have decode phase anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @maxdebayser @heheda12345 have a better idea about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the attention is causal, it wouldn't hurt to use DECODER_ONLY, right? In addition it can enable chunked prefill and prefix caching if the pooling type supports it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the PR is using DECODER_ONLY which is the default attention type

Copy link
Member Author

@DarkLight1337 DarkLight1337 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that cause problems since the text backbone is actually an encoder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be mistaken, but for text the core difference between encoder and decoder is the attention mask. The other differences are mainly a consequence of this. For example, with the bi-directional cache attention it doesn't make a lot of sense to use KV cache, hence in vllm we've refactored the Bert models and other to use the EncoderOnlyAttention class. So if the CLIP attention is causal, it should work fine with the decoder attention.

Copy link
Member Author

@DarkLight1337 DarkLight1337 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, let's merge this then

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

@Isotr0py can you approve this?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 4, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 4, 2025 06:39
Signed-off-by: DarkLight1337 <[email protected]>
@vllm-bot vllm-bot merged commit 4570535 into vllm-project:main Oct 4, 2025
54 of 56 checks passed
@DarkLight1337 DarkLight1337 deleted the support-clip-embed branch October 4, 2025 13:21
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Karan Goel <[email protected]>
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants