-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
I ran into an issue today with the RoPE embeddings for Qwen2VisionModel
When the RoPE frequency embeddings are constructed here: https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/vlm/qwen2vl/model/vision.py#L416-L444, it is of length total number of tokens T in the batch of images
This can lead to issues in the global attention layers for mixed resolution batches in downstream Megatron repo. For example, lets say we have a batch where first image is 640 tokens and second image is 960 tokens. Then when indexing into the RoPE embeddings for the second image, it will index the first 960 tokens rather than the last 960 tokens in the frequency tensor, which is a mismatch: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/common/embeddings/rope_utils.py#L140
The local attention layers are fine because the patch lengths are the same, but this makes global attention layers wrong whenever you have a batch of images of mixed resolutions. This issue seems to occur for both the fused and unfused RoPE implementations Megatron.