Skip to content

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs #1192

Closed
@jon-chuang

Description

@jon-chuang

See https://developer.nvidia.com/blog/cuda-graphs/ for reference.

One can take one of two approaches:

  1. Within operator.
  2. Spanning multiple operators (operator fusion)

Activity

jon-chuang

jon-chuang commented on Apr 26, 2023

@jon-chuang
ContributorAuthor

Any thoughts @slaren ?

jon-chuang

jon-chuang commented on Apr 26, 2023

@jon-chuang
ContributorAuthor

Looking at #1129 (comment)

It seems that inter-operator fusion is required.

This means we need a concept of a device tensor. Looks like we are slowly reimplementing PyTorch...

slaren

slaren commented on Apr 26, 2023

@slaren
Member

I don't think that we launch enough kernels for this to make a meaningful difference.

dfyz

dfyz commented on Apr 27, 2023

@dfyz
Collaborator

Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use GPU only for large GEMMs that take at least a couple of milliseconds.

github-actions

github-actions commented on Apr 9, 2024

@github-actions
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dfyz@slaren@jon-chuang

        Issue actions

          perf(CuBLAS): explore reduction in launch overhead via CUDA graphs · Issue #1192 · ggml-org/llama.cpp