[Model] CLIP Embedding Support #26010

DarkLight1337 · 2025-10-01T09:01:48Z

Purpose

Support CLIP text and image embedding in the same model.

For text inputs, we only apply token_embedding when calling get_input_embeddings. The rest of the text embedding and the encoder logic are applied when calling forward on the model.
For image inputs, we apply vision embeddings when calling get_input_embeddings. Since the model doesn't have a decoder, we directly return the embeddings inside the forward method.
In dummy run, forward method doesn't receive image inputs so we cannot use the existence of pixel_values to determine whether input_embeds is from text or image inputs. To work around this, I have added a state self._is_text_input that is set inside get_input_embeddings. @ywang96 should we update GPUModelRunner._dummy_mm_kwargs to create the dummy multi-modal inputs for all multi-modal models?
Strictly speaking, for CLIP, the pooling type is LAST for text, and CLS for image. But to simplify the code, we treat LAST pooling type for image inputs as CLS pooling type so that we don't have to define separate pooling types for now.

After this PR, it should be relatively straightforward to extend to SigLIP.

cc @maxdebayser @noooop

FIX (partial) #25581

Todo list

Add examples for offline inference and online serving
Benchmarks

Test Plan

Added tests specific to CLIP model, which pass locally.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist

Code Review

This pull request adds support for CLIP models, enabling both text and image embedding. The changes include adding causal attention support to MultiHeadAttention, implementing the CLIPEmbeddingModel, and providing new tests and examples. The overall implementation is solid and the new functionality is well-tested. However, I've identified a few areas for improvement in vllm/model_executor/models/clip.py. Specifically, the model's forward pass relies on a stateful flag which is a fragile design, and there are several pieces of dead code, including unused classes and methods, that should be removed to improve maintainability. Please see my detailed comments.

vllm/model_executor/models/clip.py

gemini-code-assist · 2025-10-01T09:03:56Z

vllm/model_executor/models/clip.py

+        self._is_text_input = (multimodal_embeddings is None
+                               or len(multimodal_embeddings) == 0)


Using a stateful flag self._is_text_input set in get_input_embeddings and read in forward creates an implicit dependency between these two methods. This makes the code fragile and harder to reason about. If the calling order changes in the model runner, it could lead to incorrect behavior.

A better approach would be to pass the modality information explicitly to the forward method. This could be done by having the model runner inspect the inputs and pass a flag (e.g., is_text_input) as part of the kwargs to forward.

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2025-10-01T09:14:13Z

/gemini review

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for the CLIP model, enabling both text and image embedding capabilities. The changes are comprehensive, including modifications to the attention mechanism to support causal masking, a new CLIPEmbeddingModel implementation, and updates to examples and tests. The overall implementation appears solid. I've identified one high-severity issue related to the robustness of selecting a dummy token ID, which could cause problems with different tokenizers for CLIP-like models. The proposed fix improves the robustness of this logic.

vllm/model_executor/models/clip.py

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 · 2025-10-01T09:19:06Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for CLIP models, including both text and image embedding capabilities. It introduces a new CLIPEmbeddingModel and adapts the attention mechanism to support causal masking for the text encoder. The changes also include new examples and tests for CLIP. My review focuses on a potential correctness issue related to state management in the new model implementation. Specifically, I've identified a fragile dependency between get_input_embeddings and forward methods that relies on a shared instance variable, which could lead to issues. I've recommended a more robust, stateless approach as suggested by the author in the PR description.

vllm/model_executor/models/clip.py

DarkLight1337 · 2025-10-01T11:09:25Z

Actually, I just thought of using a separate attention module for the text encoder so we can use vLLM's Attention directly instead of MultiHeadAttention, allowing for KV cache to be used 😅

Signed-off-by: DarkLight1337 <[email protected]>

docs/models/supported_models.md

Isotr0py · 2025-10-01T12:04:39Z

vllm/model_executor/models/clip.py

        num_hidden_layers_override: Optional[int] = None,
+        *,
        prefix: str = "",
+        attn_cls: Union[type[Attention], type[MultiHeadAttention]],


IIRC, Clip's text encoder's attention is prefill-only casual attention. Does decoder's kv-cache really work with this text tower?

Perhaps we should decouple bidirectional mask with encoder-only attention to allow encoder-only causal attention. 🤔

Indeed it should be ENCODER_ONLY causal attention, let me fix that.

Actually, I see several other pooling models use DECODER_ONLY attention in the language backbone when is_causal=True. I think using DECODER_ONLY attention to indicate causal mask should still work because pooling models don't have decode phase anyway.

Maybe @maxdebayser @heheda12345 have a better idea about this?

But if the attention is causal, it wouldn't hurt to use DECODER_ONLY, right? In addition it can enable chunked prefill and prefix caching if the pooling type supports it.

Yes, currently the PR is using DECODER_ONLY which is the default attention type

Would that cause problems since the text backbone is actually an encoder?

I could be mistaken, but for text the core difference between encoder and decoder is the attention mask. The other differences are mainly a consequence of this. For example, with the bi-directional cache attention it doesn't make a lot of sense to use KV cache, hence in vllm we've refactored the Bert models and other to use the EncoderOnlyAttention class. So if the CLIP attention is causal, it should work fine with the decoder attention.

Ok cool, let's merge this then

Signed-off-by: DarkLight1337 <[email protected]>

vllm/model_executor/models/bert.py

DarkLight1337 · 2025-10-04T04:20:27Z

@Isotr0py can you approve this?

Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Karan Goel <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

DarkLight1337 added 2 commits October 1, 2025 08:33

[Model] Support CLIP Embeddings

2ab3c9e

Signed-off-by: DarkLight1337 <[email protected]>

Add examples

5f99748

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 requested review from Isotr0py and hmellor October 1, 2025 09:01

DarkLight1337 requested review from LucasWilkinson and ywang96 as code owners October 1, 2025 09:01

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models labels Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

DarkLight1337 changed the title ~~Support clip embed~~ [Model] CLIP Embedding Support Oct 1, 2025

Remove dead code

d93e867

Signed-off-by: DarkLight1337 <[email protected]>

Remove outdated docstring

729f874

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

vllm/model_executor/models/clip.py Show resolved Hide resolved

Use all_special_ids

d58ef3b

Signed-off-by: DarkLight1337 <[email protected]>

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

vllm/model_executor/models/clip.py Show resolved Hide resolved

DarkLight1337 mentioned this pull request Oct 1, 2025

[Feature]: Support serving of CLIP/SigLIP embeddings #25581

Open

1 task

DarkLight1337 added 3 commits October 1, 2025 11:36

Use Attention instead of MultiHeadAttention for text encoder

8fa5dc3

Signed-off-by: DarkLight1337 <[email protected]>

Fix the test

e123c1f

Signed-off-by: DarkLight1337 <[email protected]>

Rename to avoid confusion

8415fa7

Signed-off-by: DarkLight1337 <[email protected]>

Isotr0py reviewed Oct 1, 2025

View reviewed changes

Fix the table

93c4490

Signed-off-by: DarkLight1337 <[email protected]>

maxdebayser reviewed Oct 2, 2025

View reviewed changes

vllm/model_executor/models/bert.py Show resolved Hide resolved

Merge branch 'main' into support-clip-embed

ee205e1

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 4, 2025

DarkLight1337 mentioned this pull request Oct 4, 2025

[New Model]: Google SigLip 2 #13663

Open

1 task

Isotr0py approved these changes Oct 4, 2025

View reviewed changes

DarkLight1337 added 2 commits October 4, 2025 06:38

Fix and improve registry error

bc7f8de

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into support-clip-embed

9ba3364

DarkLight1337 enabled auto-merge (squash) October 4, 2025 06:39

Fix dummy data

11f76eb

Signed-off-by: DarkLight1337 <[email protected]>

vllm-bot merged commit 4570535 into vllm-project:main Oct 4, 2025
54 of 56 checks passed

DarkLight1337 deleted the support-clip-embed branch October 4, 2025 13:21

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Model] CLIP Embedding Support (vllm-project#26010)

a63a36a

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025

[Model] CLIP Embedding Support (vllm-project#26010)

9d73bab

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Karan Goel <[email protected]>

DarkLight1337 mentioned this pull request Oct 6, 2025

[Benchmark] Enable MM Embedding benchmarks #26310

Merged

5 tasks

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

[Model] CLIP Embedding Support (vllm-project#26010)

aeaf5b2

Signed-off-by: DarkLight1337 <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Model] CLIP Embedding Support (vllm-project#26010)

d541daf

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

		self._is_text_input = (multimodal_embeddings is None
		or len(multimodal_embeddings) == 0)

Uh oh!

[Model] CLIP Embedding Support #26010

[Model] CLIP Embedding Support #26010

Uh oh!

Conversation

DarkLight1337 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Todo list

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Oct 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

DarkLight1337 commented Oct 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

DarkLight1337 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Isotr0py Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxdebayser Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 commented Oct 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

DarkLight1337 commented Oct 1, 2025 •

edited

Loading

DarkLight1337 commented Oct 1, 2025 •

edited

Loading

DarkLight1337 Oct 1, 2025 •

edited

Loading

DarkLight1337 Oct 1, 2025 •

edited

Loading

DarkLight1337 Oct 2, 2025 •

edited

Loading

DarkLight1337 Oct 4, 2025 •

edited

Loading