[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

congcongchen123 · 2025-03-05T09:45:14Z

Now with Phi-4-multimodal-instruct merged into main, we would like to have another PR to address the high latency overhead we have observed for Phi4-multimod when using LoRA.

Benchmark results with this PR:
ttft: time to first token (in seconds)
tbt: time between token (in seconds)

… kernels Signed-off-by: Congcong Chen <[email protected]>

github-actions · 2025-03-05T09:45:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

jeejeelee · 2025-03-05T10:16:02Z

Thanks for your contribution, have you tested the performance of autotune on models like llama?

congcongchen123 · 2025-03-05T21:26:43Z

Thanks for your contribution, have you tested the performance of autotune on models like llama?

Nope, I am not familiar with the llama family of models that use LoRA. Do you know any of those?

lhcavalcanti · 2025-03-06T01:03:03Z

@DarkLight1337 could you take a look in this one?

Thanks!

jeejeelee · 2025-03-06T01:55:00Z

@congcongchen123 you can refer to https://github.com/vllm-project/vllm/blob/main/tests/lora/test_llama_tp.py

DarkLight1337 · 2025-03-06T06:22:01Z

I'll hand it over to @jeejeelee as he is more familiar with these kernels.

amarflybot · 2025-04-03T10:05:16Z

Hi @congcongchen123 Could you please rebase these changes from master again ? I think the files are no longer existing.

mergify · 2025-06-08T08:12:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @congcongchen123.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-09-07T02:08:52Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Add triton.autotune to address the high latency overheads from punica…

59a261d

… kernels Signed-off-by: Congcong Chen <[email protected]>

congcongchen123 mentioned this pull request Mar 5, 2025

[New Model]: Phi-4 Multimodal Instruct #13936

Closed

1 task

DarkLight1337 requested a review from jeejeelee March 6, 2025 06:21

mergify bot added the needs-rebase label Jun 8, 2025

github-actions bot added the stale Over 90 days of inactivity label Sep 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

congcongchen123 commented Mar 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

jeejeelee commented Mar 5, 2025

Uh oh!

congcongchen123 commented Mar 5, 2025 •

edited

Loading

Uh oh!

lhcavalcanti commented Mar 6, 2025

Uh oh!

jeejeelee commented Mar 6, 2025

Uh oh!

DarkLight1337 commented Mar 6, 2025

Uh oh!

amarflybot commented Apr 3, 2025

Uh oh!

mergify bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

Uh oh!

Uh oh!

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

Are you sure you want to change the base?

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

Conversation

congcongchen123 commented Mar 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

jeejeelee commented Mar 5, 2025

Uh oh!

congcongchen123 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhcavalcanti commented Mar 6, 2025

Uh oh!

jeejeelee commented Mar 6, 2025

Uh oh!

DarkLight1337 commented Mar 6, 2025

Uh oh!

amarflybot commented Apr 3, 2025

Uh oh!

mergify bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

Uh oh!

congcongchen123 commented Mar 5, 2025 •

edited by github-actions bot

Loading

congcongchen123 commented Mar 5, 2025 •

edited

Loading