Skip to content

[Bug] Mismatch between get_multimodal_embedding output and PlaceholderRange #15144

@DarkLight1337

Description

@DarkLight1337

In V1, we expect the output of get_multimodal_embedding to correspond to the PlaceholderRange, which is in turn constructed based on PromptUpdateDetails.features. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).

From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:

(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)

To fix this, we can follow these steps:

  1. Update the multi-modal processor to output a mask to indicate which positions in the PlaceholderRange-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be called embed_is_patch.
  2. Use scatter_patch_features to scatter the patch features into the image embedding tensor.
  3. When merging multimodal embeddings, use select_patch_features to recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens in PromptUpdateDetails).

Follow-up work:

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedExtra attention is neededmulti-modalityRelated to multi-modality (#4194)v1

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions