Skip to content

Conversation

jlonge4
Copy link
Contributor

@jlonge4 jlonge4 commented Apr 30, 2025

Issue #, if available:
N/A
Description of changes:
Add Qwen3 model file and inference notebook. Tested with Qwen/Qwen3-8B

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jlonge4
Copy link
Contributor Author

jlonge4 commented May 14, 2025

Logit Validation Benchmark Code:

!inference_demo \
    --model-type qwen3 \
    --task-type causal-lm \
    run \
    --model-path /home/ubuntu/model_hf_qwen/qwen/ \
    --compiled-model-path /home/ubuntu/traced_model_qwen/qwen/logit \
    --torch-dtype bfloat16 \
    --tp-degree 8 \
    --batch-size 1 \
    --max-context-length 16 \
    --seq-len 32 \
    --enable-bucketing \
    --pad-token-id 151645 \
    --prompt "To be, or not to be" \
    --check-accuracy-mode logit-matching \
    --benchmark

Results:

Expected Output:  [", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,
         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,
           323, 36957,   315, 54488, 32315]])
Expected Logits Shape:  torch.Size([25, 1, 151936])
Actual Output:  [", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,
         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,
           323, 36957,   315, 54488, 32315]])
Actual Logits Shape:  torch.Size([25, 1, 151936])
Passed logits validation!

Generating outputs...
Prompts: ['To be, or not to be']
Generated outputs:
Output 0: To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune

Benchmark completed and its result is as following
{
    "e2e_model": {
        "latency_ms_p50": 169.31116580963135,
        "latency_ms_p90": 172.9245901107788,
        "latency_ms_p95": 174.3390679359436,
        "latency_ms_p99": 174.82486009597778,
        "latency_ms_p100": 174.94630813598633,
        "latency_ms_avg": 169.6009874343872,
        "throughput": 188.67814677305284
    },
    "context_encoding_model": {
        "latency_ms_p50": 13.715386390686035,
        "latency_ms_p90": 13.958406448364258,
        "latency_ms_p95": 13.969480991363525,
        "latency_ms_p99": 13.981258869171143,
        "latency_ms_p100": 13.984203338623047,
        "latency_ms_avg": 13.787257671356201,
        "throughput": 1160.4918382892702
    },
    "token_generation_model": {
        "latency_ms_p50": 8.931398391723633,
        "latency_ms_p90": 9.162139892578125,
        "latency_ms_p95": 9.23851728439331,
        "latency_ms_p99": 9.780135154724094,
        "latency_ms_p100": 12.94398307800293,
        "latency_ms_avg": 9.013524055480957,
        "throughput": 118.34069117705926
    }
}

"cell_type": "markdown",
"metadata": {},
"source": [
"# Thinking example"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmilyWebber that should do it : )

@ValkyriaLenneth
Copy link

@jlonge4
Thanks for your great work.
But I'm confusing about the transformers version.
Since neuronx-distributed need transformers==4.48, however the qwen3 need transformers>=4.51
How could you fix this problem
Thanks

@jlonge4
Copy link
Contributor Author

jlonge4 commented May 30, 2025

@ValkyriaLenneth Thanks for the kind words. At the time of creating this PR, I got this working on an older SDK version (Neuron SDK 2.17), AMI ID = ami-04faec134fd67f201.
With this version/AMI I had no issue using transformers==4.51.3.

@jlonge4
Copy link
Contributor Author

jlonge4 commented Jun 2, 2025

Updated model file and re run test notebook on latest AMI ami-0d0a2d26f80b645c2 and associated package versions:

libneuronxla                  2.2.3493.0+78c3e78c
neuronx-cc                    2.18.121.0+9e31e41a
neuronx-distributed           0.12.12111+cdd84048
neuronx-distributed-inference 0.3.5591+f50feae2
torch-neuronx                 2.6.0.2.7.5413+113e6810

AMI venv used aws_neuronx_venv_pytorch_2_6_nxd_inference
Neuron SDK v2.23.0 used reference

@jlonge4
Copy link
Contributor Author

jlonge4 commented Jun 2, 2025

Logit Validation with seq_length=1024 context_length=512

Result: Minimal logit divergence.

Test failed at batch 0 token 103. Top k = 5 error 0.01682760939002037 > 0.01.
Test failed at batch 0 token 108. Top k = 5 error 0.016880331560969353 > 0.01.
Divergence at index 204. Validating 1 tokens in each batch.
Divergence at index 319. Validating 115 tokens in each batch.
Test failed at batch 0 token 286. Top k = None error 0.07318327575922012 > 0.05. Top k = 1000 error 0.07318327575922012 > 0.03. Top k = 50 error 0.07318327575922012 > 0.02. Top k = 5 error 0.07318327575922012 > 0.01.
No divergence. Validating the remaining 81 tokens in each batch.
Test failed at batch 0 token 360. Top k = None error 0.06745750457048416 > 0.05. Top k = 1000 error 0.05250008776783943 > 0.03. Top k = 50 error 0.03233567625284195 > 0.02. Top k = 5 error 0.03233567625284195 > 0.01.
Test failed at batch 0 token 364. Top k = None error 0.37251684069633484 > 0.05. Top k = 1000 error 0.35812416672706604 > 0.03. Top k = 50 error 0.35812416672706604 > 0.02. Top k = 5 error 0.35812416672706604 > 0.01.
Summary: Max divergence difference = 0 at index (batch 0 token 0), Top k = None max error = 0.37251684069633484 at index (batch 0 token 364), Top k = 1000 max error = 0.35812416672706604 at index (batch 0 token 364), Top k = 50 max error = 0.35812416672706604 at index (batch 0 token 364), Top k = 5 max error = 0.35812416672706604 at index (batch 0 token 364)
Test fails logit validation.

@jlonge4 jlonge4 closed this Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants