Closed
Description
See https://developer.nvidia.com/blog/cuda-graphs/ for reference.
One can take one of two approaches:
- Within operator.
- Spanning multiple operators (operator fusion)
See https://developer.nvidia.com/blog/cuda-graphs/ for reference.
One can take one of two approaches:
Activity
jon-chuang commentedon Apr 26, 2023
Any thoughts @slaren ?
jon-chuang commentedon Apr 26, 2023
Looking at #1129 (comment)
It seems that inter-operator fusion is required.
This means we need a concept of a device tensor. Looks like we are slowly reimplementing PyTorch...
slaren commentedon Apr 26, 2023
I don't think that we launch enough kernels for this to make a meaningful difference.
dfyz commentedon Apr 27, 2023
Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use GPU only for large GEMMs that take at least a couple of milliseconds.
github-actions commentedon Apr 9, 2024
This issue was closed because it has been inactive for 14 days since being marked as stale.