Skip to content

Conversation

lfr-0531
Copy link
Collaborator

@lfr-0531 lfr-0531 commented Jun 17, 2025

Description

This PR updates the first draft forward in the MTP Eagle path to use all accepted tokens instead of just the last one. This allows the KV cache for the draft layer to be updated.

Before this fix, each iteration, the KV cache didn't store the key/value pair for the last draft token (let's call it D) because it was the output of the last draft forward pass. For this last draft token, we call it D. In the next iteration, if all draft tokens were accepted, we'd use the newly generated tokens as inputs for the first draft forward. But since the KV cache had incorrect key/value data for token D, identical inputs could produce different draft tokens.

With this fix, the KV cache will be updated in the first draft forward each iteration, making MTP Eagle deterministic.

I tested the DS-R1-FP4 model with the same dataset and on the same node (with BS=1). The acceptance rate will increase a little bit:

Before the changes:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     0.1530
Total Output Throughput (tokens/sec):             313.0214
Total Token Throughput (tokens/sec):              472.7368
Total Latency (ms):                               65372.5274
Average request latency (ms):                     6537.2043
Per User Output Throughput [w/ ctx] (tps/user):   313.3934
Per GPU Output Throughput (tps/gpu):              39.1277

-- Acceptance Rate Details --------------------------------
[AR] MINIMUM: 2.69
[AR] MAXIMUM: 3.08
[AR] AVERAGE: 2.81
[AR] P50    : 2.80
[AR] P90    : 3.08
[AR] P95    : 3.08
[AR] P99    : 3.08

After:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     0.1544
Total Output Throughput (tokens/sec):             316.0127
Total Token Throughput (tokens/sec):              477.2701
Total Latency (ms):                               64747.4007
Average request latency (ms):                     6474.6948
Per User Output Throughput [w/ ctx] (tps/user):   316.3113
Per GPU Output Throughput (tps/gpu):              39.5016

-- Acceptance Rate Details --------------------------------
[AR] MINIMUM: 2.75
[AR] MAXIMUM: 3.07
[AR] AVERAGE: 2.83
[AR] P50    : 2.84
[AR] P90    : 3.07
[AR] P95    : 3.07
[AR] P99    : 3.07

For the model accuracy, with this fix, I enabled MTP Ealge and tested GPQA diamond twice with the same random seed. We got the same results 70.202 ± 3.2586, which is expected.

@lfr-0531 lfr-0531 requested a review from yweng0828 June 17, 2025 12:12
@lfr-0531 lfr-0531 requested a review from a team as a code owner June 17, 2025 12:12
@lfr-0531 lfr-0531 requested review from byshiue and yilin-void June 17, 2025 12:12
@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_mtp_eagle_deterministic branch from 3883280 to cc397aa Compare June 17, 2025 12:12
@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9204 [ run ] triggered by Bot

@lfr-0531
Copy link
Collaborator Author

/bot run

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_mtp_eagle_deterministic branch from cc397aa to 9fcb06c Compare June 18, 2025 01:17
@lfr-0531
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9264 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9265 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9264 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9265 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 9fcb06c

@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9277 [ run ] triggered by Bot

@lfr-0531
Copy link
Collaborator Author

/bot kill

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_mtp_eagle_deterministic branch from 9fcb06c to 0bf3101 Compare June 18, 2025 04:29
@tensorrt-cicd
Copy link
Collaborator

PR_Github #9307 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9277 [ run ] completed with state ABORTED

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_mtp_eagle_deterministic branch from 0bf3101 to df46b63 Compare June 18, 2025 04:29
@lfr-0531
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9307 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 0bf3101

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9308 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9308 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6830 completed with status: 'FAILURE'

@lfr-0531 lfr-0531 force-pushed the user/fanrongl/fix_mtp_eagle_deterministic branch from df46b63 to d6276c0 Compare June 18, 2025 07:35
@lfr-0531
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9334 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9334 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6851 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@lfr-0531 lfr-0531 merged commit c7af650 into NVIDIA:main Jun 19, 2025
3 checks passed
k-l-lambda pushed a commit to k-l-lambda/TensorRT-LLM that referenced this pull request Jun 23, 2025
@lfr-0531 lfr-0531 deleted the user/fanrongl/fix_mtp_eagle_deterministic branch June 27, 2025 12:43
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants