CUDA ML Kernels

🚀 Motivation

This repository implements custom CUDA kernels for common ML operations, benchmarked against PyTorch's highly optimized cuBLAS/cuDNN kernels. The goal is to:

Understand GPU parallelization patterns
Compare naive kernel performance vs. library implementations
Build intuition for ML Systems performance engineering

📁 Repository Structure

├── kernels/
│   ├── matrix_multiply.cu
│   ├── vector_add.cu
│   ├── relu.cu
│   ├── dot_product.cu
│   ├── intro.cu
├── benchmarks/
│   └── benchmark.py
└── .gitignore

kernels/: CUDA C++ kernel implementations
benchmarks/: Python script to benchmark kernels vs. PyTorch

⚡ Kernels Implemented

Kernel	Description
`matrix_multiply`	Matrix multiplication (1024x1024)
`vector_add`	Elementwise vector addition
`relu`	ReLU activation function
`dot_product`	Vector dot product reduction
`intro`	4x4 matrix multiplication demo

🔧 Build Instructions

Ensure NVIDIA CUDA Toolkit is installed.
Compile each .cu file:

cd kernels

nvcc -o matrix_multiply.exe matrix_multiply.cu
nvcc -o vector_add.exe vector_add.cu
nvcc -o relu.exe relu.cu
nvcc -o dot_product.exe dot_product.cu
nvcc -o intro.exe intro.cu

Replace .exe with no extension if on Linux/Mac.

🧪 Running Benchmarks

From the repo root:

cd benchmarks
python benchmark.py

📊 Results Summary

Kernel	PyTorch Time (ms)	Custom CUDA Time (ms)	Speedup
matrix_multiply	14.28	6.95	2.06x
vector_add	2.39	1.35	1.77x
relu	2.97	0.56	5.35x
dot_product	~0	1.33	Slower
intro (4x4 matmul)	2.25	1.08	2.09x

💡 Key Insights

Matrix multiplication and ReLU kernels show significant speedups, demonstrating effective GPU thread parallelization.
Vector addition gains are modest, as PyTorch uses cuBLAS kernels optimized near theoretical peak.
Dot product is slower due to naive reduction implementation vs. PyTorch's warp-level optimized reductions.
Small matmul (intro) demonstrates kernel launch overhead optimization benefits.

📝 Future Improvements

Implement warp-level reductions for dot product
Integrate unit tests comparing kernel outputs with PyTorch for correctness validation
Extend to batched kernels relevant for end-to-end ML pipeline acceleration

👤 Author

Gauri Sharan

📜 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
benchmarks		benchmarks
kernels		kernels
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA ML Kernels

🚀 Motivation

📁 Repository Structure

⚡ Kernels Implemented

🔧 Build Instructions

🧪 Running Benchmarks

📊 Results Summary

💡 Key Insights

📝 Future Improvements

👤 Author

📜 License

About

Uh oh!

Releases

Packages

Languages

gaurisharan/cuda-ml-kernels

Folders and files

Latest commit

History

Repository files navigation

CUDA ML Kernels

🚀 Motivation

📁 Repository Structure

⚡ Kernels Implemented

🔧 Build Instructions

🧪 Running Benchmarks

📊 Results Summary

💡 Key Insights

📝 Future Improvements

👤 Author

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages