This repository contains a series of puzzle-style Jupyter notebooks that guide you through implementing common GPU kernels first in raw PyTorch, then with torch.compile
, and finally with Triton.
Slides: https://docs.google.com/presentation/d/1VooJ3gQlhbPSyG5F08JTgvo7FCm4TyVivvdLl_gZ7-c
First, install dependencies:
pip install -r requirements.txt
Install Triton directly via pip:
pip install triton==3.3.1
You need to build from source (it takes some time -- ~15min on my Mac):
git clone https://github.com/triton-lang/triton-cpu.git
cd triton-cpu
git submodule update --init --recursive
cd python
pip install -r requirements.txt
pip install -e .
Now, just open the notebooks in Jupyter, step through each cell and solve the puzzle!
- Vector Addition
- Fused Softmax
- Fused Entmax
- Matrix Multiplication
- Layer Normalization
- Cross-Entropy Loss
- Softmax Attention - Forward Pass
- Sparsemax Attention - Forward Pass (BONUS)
Happy puzzling!
- Triton Docs
- Triton Tutorial
- GPU Mode (lecture 14)
- Christian Mills' Notes
- Stanford CS336: Language modeling from scratch (lecture 5)
Kernels: