Skip to content

Conversation

kzawora-intel
Copy link
Collaborator

I've noticed two accuracy issues in unified attention:

  1. We weren't updating the persistent request states and batch in the unified_execute_model method.
  2. We were overextending non-aligned prefix_prefill context lengths by one token .

The first one had major impact - I suspect we were malforming batches as the generation process went on, since the self.input_batch.num_tokens & req_state.output_token_ids were not updated correctly - in Granite GSM8K fixing that yielded +10 percentage points improvement
The second one had a negligible impact - I didn't notice any acc improvement in any tests I've run - but we should be masking anything above context length regardless.

I've added GSM8k accuracy test to CI with this PR that should now pass as well.

Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
577d498212022f95dc3a59746b1da1c6ed23eaba

@adobrzyn adobrzyn merged commit 09e4a68 into main Oct 15, 2025
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants