perf(CuBLAS): explore reduction in launch overhead via CUDA graphs #1192

Closed

Closed

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs#1192

Labels

See https://developer.nvidia.com/blog/cuda-graphs/ for reference.

One can take one of two approaches:

Within operator.
Spanning multiple operators (operator fusion)

ContributorAuthor

Any thoughts @slaren ?

mentioned this

on Apr 26, 2023

Improve cuBLAS performance by dequantizing on the GPU #1065

ContributorAuthor

Looking at #1129 (comment)

It seems that inter-operator fusion is required.

This means we need a concept of a device tensor. Looks like we are slowly reimplementing PyTorch...

Member

I don't think that we launch enough kernels for this to make a meaningful difference.

Collaborator

Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use GPU only for large GEMMs that take at least a couple of milliseconds.

added

on Mar 25, 2024

Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

closed this as completed

mentioned this

on Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Participants