Potential Issue with Qwen2VisionModel RoPE

I ran into an issue today with the RoPE embeddings for Qwen2VisionModel

When the RoPE frequency embeddings are constructed here: https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/vlm/qwen2vl/model/vision.py#L416-L444, it is of length total number of tokens T in the batch of images

This can lead to issues in the global attention layers for mixed resolution batches in downstream Megatron repo. For example, lets say we have a batch where first image is 640 tokens and second image is 960 tokens. Then when indexing into the RoPE embeddings for the second image, it will index the **first 960 tokens** rather than the **last 960 tokens** in the frequency tensor, which is a mismatch: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/common/embeddings/rope_utils.py#L140

The local attention layers are fine because the patch lengths are the same, but this makes global attention layers wrong whenever you have a batch of images of mixed resolutions.  This issue seems to occur for both the fused and unfused RoPE implementations Megatron.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Issue with Qwen2VisionModel RoPE #14611

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential Issue with Qwen2VisionModel RoPE #14611

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions