Skip to content

Potential Issue with Qwen2VisionModel RoPE #14611

@xiaoxiao26

Description

@xiaoxiao26

I ran into an issue today with the RoPE embeddings for Qwen2VisionModel

When the RoPE frequency embeddings are constructed here: https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/vlm/qwen2vl/model/vision.py#L416-L444, it is of length total number of tokens T in the batch of images

This can lead to issues in the global attention layers for mixed resolution batches in downstream Megatron repo. For example, lets say we have a batch where first image is 640 tokens and second image is 960 tokens. Then when indexing into the RoPE embeddings for the second image, it will index the first 960 tokens rather than the last 960 tokens in the frequency tensor, which is a mismatch: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/common/embeddings/rope_utils.py#L140

The local attention layers are fine because the patch lengths are the same, but this makes global attention layers wrong whenever you have a batch of images of mixed resolutions. This issue seems to occur for both the fused and unfused RoPE implementations Megatron.

Metadata

Metadata

Assignees

Labels

featurerequest/PR for a new feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions