-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: fix accuracy and illegal memory access issues when using mtp + attention dp #4379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix accuracy and illegal memory access issues when using mtp + attention dp #4379
Conversation
/bot run |
PR_Github #5431 [ run ] triggered by Bot |
/bot kill |
PR_Github #5441 [ kill ] triggered by Bot |
PR_Github #5431 [ run ] completed with state |
PR_Github #5441 [ kill ] completed with state |
/bot run |
PR_Github #5442 [ run ] triggered by Bot |
PR_Github #5442 [ run ] completed with state |
7072ea3
to
7437b2b
Compare
0a2ffda
to
3ee3361
Compare
/bot run --disable-fail-fast |
PR_Github #6297 [ run ] triggered by Bot |
PR_Github #6297 [ run ] completed with state |
@lfr-0531 have all the illegal memory access issues been resolved or the one you mentioned still exists ? thanks. |
No, we haven't fixed it. But looks like we didn't cover this special case in our test list before. I mean the setting @PerkzZheng, what do you think? |
/bot run |
that makes sense to me. |
PR_Github #6382 [ run ] triggered by Bot |
PR_Github #6382 [ run ] completed with state |
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
…anatory comments. Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
PR_Github #7154 [ run ] completed with state |
7175803
to
bead11c
Compare
/bot run |
Got another conflict. Rerun the pipeline. |
PR_Github #7160 [ run ] triggered by Bot |
PR_Github #7160 [ run ] completed with state |
/bot run |
PR_Github #7170 [ run ] triggered by Bot |
PR_Github #7170 [ run ] completed with state |
…ttention dp (NVIDIA#4379) Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
…ttention dp (NVIDIA#4379) Signed-off-by: Fanrong Li <[email protected]> Signed-off-by: darraghdog <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Description
Fixed two issues:
The illegal memory access is caused by the token_num in
add_dummy_requests
. With token_num=1, the prompt_len will be zero. But our attention kernel cannot handle such a special case, so we got illegal memory access. The fix in this PR should be a workaround to make Attention DP+MTP work.The accuracy drop is because we didn't correctly prepare the inputs for the dummy requests when enabling overlap scheduler and speculative decoding.
Other changes:
NOTE: there is still an illegal memory access issue when running DeepSeek-V3-Lite MTP with tp + autotuner + cuda graph + overlap scheduler. We track this issue in nvbug 5304040.
Test Coverage
The waived tests are added back.