[Enhancement] Save compilation time from cute templates

NVIDIA's CUDA Toolkit 12.8 recently published a blog post about optimizing compile times with nvcc: [[Optimizing Compile Times for CUDA C++](https://developer.nvidia.com/blog/optimizing-compile-times-for-cuda-c)](https://developer.nvidia.com/blog/optimizing-compile-times-for-cuda-c). 

We are encountering a similar issue by utilizing cute as our backend, which can lead to significant template compilation overhead. We might benefit from adopting their approach to reduce our compilation times through performance trace optimizations.

To reproduce the compilation time:


```bash
import time
import subprocess
import os
start = time.time()


os.system("nvcc -std=c++17 -w -Xcudafe --diag_suppress=177 --compiler-options '-fPIC' -lineinfo --shared /tmp/tmp5hql14m2.cu -lcuda -gencode arch=compute_89,code=sm_89 -I/root/tilelang/tilelang/../src -I/root/tilelang/tilelang/../3rdparty/cutlass/include -diag-suppress=20013 -o /tmp/tmp5hql14m2.so")


end = time.time()
print(f"Time taken: {end - start} seconds")
```

Need some volunteers who have stable machine with cuda 12.8 installed to help :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Save compilation time from cute templates #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Save compilation time from cute templates #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions