[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange`

In V1, we expect the output of `get_multimodal_embedding` to correspond to the `PlaceholderRange`, which is in turn constructed based on `PromptUpdateDetails.features`. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).

From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:

- [x] Fuyu (fixed by #15731)
- [x] Gemma3 (fixed by #14980)
- [x] Idefics3 (fixed by #15696)
- [x] InternVL-based models (fixed by #15086)
- [x] MiniCPM-V (fixed by #15487)

(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)

To fix this, we can follow these steps:

1. Update the multi-modal processor to output a mask to indicate which positions in the `PlaceholderRange`-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be called `embed_is_patch`.
2. Use `scatter_patch_features` to scatter the patch features into the image embedding tensor.
3. When merging multimodal embeddings, use `select_patch_features` to recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens in `PromptUpdateDetails`).

Follow-up work:

- #15712 (assigned to @DarkLight1337)
- Directly use individual token IDs instead of range of IDs (assigned to @ywang96 )


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange #15144

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange` #15144