Skip to content

Conversation

jiminha
Copy link
Contributor

@jiminha jiminha commented Oct 14, 2025

This is to optimize gemma3 multimodal memory/performance.

  • bucket vision tower based on batch bucket to reduce recompile overhead
  • modify merge_multimodal to use torch.where instead of masked_scatter for performance issue
  • add warmup multimodal bucket to precompile vision tower
  • port PT_HPU_SDPA_QKV_SLICE_MODE_FWD feature from vllm-fork v0 : this is necessary to reduce the memory for the longer sequence length.

jiminha and others added 6 commits October 13, 2025 08:03
Signed-off-by: Jimin Ha <[email protected]>
Reduces memory usage for long sequences by eliminating dual attention
mask creation. Improves capacity from 150 to 400 images with 8K prompts
by avoiding OOM issues.
Limitation: Only available when block_list is None.

Signed-off-by: Jimin Ha <[email protected]>
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@jiminha jiminha marked this pull request as draft October 14, 2025 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants