Skip to content

Conversation

lfr-0531
Copy link
Collaborator

@lfr-0531 lfr-0531 commented May 16, 2025

Description

Fixed two issues:

  • MTP + attention dp illegal memory access
  • MTP + attention dp + overlap scheduler accuracy drop

The illegal memory access is caused by the token_num in add_dummy_requests. With token_num=1, the prompt_len will be zero. But our attention kernel cannot handle such a special case, so we got illegal memory access. The fix in this PR should be a workaround to make Attention DP+MTP work.

The accuracy drop is because we didn't correctly prepare the inputs for the dummy requests when enabling overlap scheduler and speculative decoding.

Other changes:

  • Change the default py_draft_tokens to an empty list.

NOTE: there is still an illegal memory access issue when running DeepSeek-V3-Lite MTP with tp + autotuner + cuda graph + overlap scheduler. We track this issue in nvbug 5304040.

Test Coverage

The waived tests are added back.

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5431 [ run ] triggered by Bot

@lfr-0531
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5441 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5431 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5441 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit f10d532

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5442 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5442 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3972 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6297 [ run ] triggered by Bot

@lfr-0531 lfr-0531 changed the title fix: fix input ids when adding dummy requests fix: fix accuracy and illegal memory access issues when using mtp + attention dp May 23, 2025
@lfr-0531 lfr-0531 requested a review from PerkzZheng May 23, 2025 12:39
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6297 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4602 completed with status: 'FAILURE'

@PerkzZheng
Copy link
Collaborator

@lfr-0531 have all the illegal memory access issues been resolved or the one you mentioned still exists ? thanks.

@lfr-0531
Copy link
Collaborator Author

@lfr-0531 have all the illegal memory access issues been resolved or the one you mentioned still exists ? thanks.

No, we haven't fixed it. But looks like we didn't cover this special case in our test list before. I mean the setting mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=True. I want to merge this PR first so that we can have other waived tests back. And we can track this special case in a new nvbug and fix it later.

@PerkzZheng, what do you think?

@lfr-0531
Copy link
Collaborator Author

/bot run

@PerkzZheng
Copy link
Collaborator

@lfr-0531 have all the illegal memory access issues been resolved or the one you mentioned still exists ? thanks.

No, we haven't fixed it. But looks like we didn't cover this special case in our test list before. I mean the setting mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=True. I want to merge this PR first so that we can have other waived tests back. And we can track this special case in a new nvbug and fix it later.

@PerkzZheng, what do you think?

that makes sense to me.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6382 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6382 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4664 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7154 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5170 completed with status: 'SUCCESS'

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_input_ids_in_dummy_req branch from 7175803 to bead11c Compare June 1, 2025 08:24
@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Jun 1, 2025

/bot run

@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Jun 1, 2025

Got another conflict. Rerun the pipeline.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7160 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7160 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5175 completed with status: 'FAILURE'

@lfr-0531
Copy link
Collaborator Author

lfr-0531 commented Jun 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7170 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7170 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5185 completed with status: 'SUCCESS'

@lfr-0531 lfr-0531 merged commit 7d356ef into NVIDIA:main Jun 1, 2025
3 checks passed
lfr-0531 added a commit to lfr-0531/TensorRT-LLM that referenced this pull request Jun 2, 2025
lfr-0531 added a commit that referenced this pull request Jun 3, 2025
Signed-off-by: Fanrong Li <[email protected]>
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
k-l-lambda pushed a commit to k-l-lambda/TensorRT-LLM that referenced this pull request Jun 23, 2025
@lfr-0531 lfr-0531 deleted the user/fanrongl/fix_input_ids_in_dummy_req branch June 27, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants