Skip to content

Commit 441e40d

Browse files
committed
fix: improve phrasing, add gist links, update authors for all reviwers/core contributors and syntax edits
1 parent 36e1829 commit 441e40d

File tree

1 file changed

+37
-36
lines changed

1 file changed

+37
-36
lines changed

hello-hf-kernels.md

Lines changed: 37 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@ thumbnail: /blog/assets/hello-hf-kernels/kernel-hub-five-mins-short.png
44
authors:
55
- user: drbh
66
- user: danieldk
7+
- user: narsil
78
- user: pcuenca
89
- user: pagezyhf
10+
- user: merve
911
date: 2025-03-28
1012
---
1113

@@ -51,9 +53,9 @@ Spend more time building great models and less time fighting build systems!
5153

5254
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example that loads an optimized GELU activation function kernel. (Later on, we'll see another example about how to integrate a kernel in our model).
5355

54-
File: `activation_validation_example.py`
56+
File: [`activation_validation_example.py`](https://gist.github.com/drbh/aa4b8cfb79597e98be6cf0108644ce16)
5557

56-
~~~python
58+
```python
5759
# /// script
5860
# dependencies = [
5961
# "numpy",
@@ -102,7 +104,7 @@ print(expected)
102104
# List available functions in the loaded kernel module
103105
print("\nAvailable functions in 'kernels-community/activation':")
104106
print(dir(activation_kernels))
105-
~~~
107+
```
106108

107109
**(Note:** If you have [`uv`](https://github.com/astral-sh/uv) installed, you can save this script as `script.py` and run `uv run script.py` to automatically handle dependencies.)
108110

@@ -123,9 +125,9 @@ Let's integrate an optimized **RMS Normalization** kernel into a basic model. We
123125
First, define a simple RMSNorm module in PyTorch and a baseline model using it:
124126

125127

126-
File: `rmsnorm_baseline.py`
128+
File: [`rmsnorm_baseline.py`](https://gist.github.com/drbh/96621d9eafec5dfa0ca9ca59f6fc1991)
127129

128-
~~~python
130+
```python
129131
# /// script
130132
# dependencies = [
131133
# "numpy",
@@ -198,13 +200,13 @@ baseline_model = (
198200
dummy_input = torch.randn(32, input_size, device=DEVICE, dtype=DTYPE) # Batch of 32
199201
output = baseline_model(dummy_input)
200202
print("Baseline RMSNorm model output shape:", output.shape)
201-
~~~
203+
```
202204

203205
Now, let's create a version using the `LlamaRMSNorm` kernel loaded via `kernels`.
204206

205-
File: `rmsnorm_kernel.py`
207+
File: [`rmsnorm_kernel.py`](https://gist.github.com/drbh/141373363e83ea0345807d6525e1fb64)
206208

207-
~~~python
209+
```python
208210
# /// script
209211
# dependencies = [
210212
# "numpy",
@@ -334,22 +336,21 @@ except NameError:
334336
print("\nSkipping output comparison as kernel model output was not generated.")
335337

336338

337-
~~~
339+
```
338340

339341
**Important Notes on the `KernelModel`:**
340342

341343
* **Kernel Inheritance:** The `KernelRMSNorm` class inherits from `layer_norm_kernel_module.layers.LlamaRMSNorm`, which is the RMSNorm implementation in the kernel. This allows us to use the optimized kernel directly.
342344
* **Accessing the Function:** The exact way to access the RMSNorm function (`layer_norm_kernel_module.layers.LlamaRMSNorm.forward`, `layer_norm_kernel_module.rms_norm_forward`, or something else) **depends entirely on how the kernel creator structured the repository on the Hub.** You may need to inspect the loaded `layer_norm_kernel_module` object (e.g., using `dir()`) or check the kernel's documentation on the Hub to find the correct function/method and its signature. I've used `rms_norm_forward` as a plausible placeholder and added error handling.
343345
* **Parameters:** We now only define `rms_norm_weight` (no bias), consistent with RMSNorm.
344346

345-
## 4. Review Performance Impact
346-
347-
Does using the optimized Triton RMSNorm kernel provide a speedup compared to the basic PyTorch version? Let's benchmark the forward pass again.
347+
## 4. Benchmarking the Performance Impact
348348

349+
How much faster is the optimized Triton RMSNorm kernel compared to the standard PyTorch version? Let’s benchmark the forward pass to find out.
349350

350-
File: `rmsnorm_benchmark.py`
351+
File: [`rmsnorm_benchmark.py`](https://gist.github.com/drbh/c754a4ba52bcc46190ae4a45516fb190)
351352

352-
~~~python
353+
```python
353354
# /// script
354355
# dependencies = [
355356
# "numpy",
@@ -467,42 +468,42 @@ for batch_size in batch_sizes:
467468
speedup = f"{kernel_time / baseline_time:.2f}x slower"
468469
print(f"{batch_size:<12} | {baseline_time:<18} | {kernel_time:<18} | {speedup}")
469470

470-
~~~
471+
```
471472

472473
**Expected Outcome:**
473-
Similar to LayerNorm, optimized RMSNorm kernels (especially those using Triton) implemented for specific hardware can offer significant speedups over basic PyTorch implementations, particularly for memory-bound operations on suitable hardware (e.g., NVIDIA Ampere/Hopper GPUs) and data types (`float16`/`bfloat16`).
474+
As with LayerNorm, a well-tuned RMSNorm implementation using Triton can deliver substantial speedups over PyTorch’s default version—especially for memory-bound workloads on compatible hardware (e.g., NVIDIA Ampere or Hopper GPUs) and with low-precision types like `float16` or `bfloat16`.
475+
474476

475-
**Important Caveats (Remain Applicable):**
476-
* Microbenchmark limitations.
477-
* Dependence on Hardware, Input Size, Dtype.
478-
* Quality of the specific kernel implementation.
479-
* Potential overhead for small inputs.
477+
**Keep in Mind:**
478+
* Results may vary depending on your GPU, input size, and data type.
479+
* Microbenchmarks can misrepresent real-world performance.
480+
* Performance hinges on the quality of the kernel implementation.
481+
* Optimized kernels might not benefit small batch sizes due to overhead.
480482

481483

482484
Actual results will depend on your hardware and the specific kernel implementation. Here's an example of what you might see (on a L4 GPU):
483485

484-
```txt
485-
Batch Size | Baseline Time (ms) | Kernel Time (ms) | Speedup
486-
--------------------------------------------------------------------------
487-
256 | 0.2122 | 0.2911 | 1.37x slower
488-
512 | 0.4748 | 0.3312 | 1.43x
489-
1024 | 0.8946 | 0.6864 | 1.30x
490-
2048 | 2.0289 | 1.3889 | 1.46x
491-
4096 | 4.4318 | 2.2467 | 1.97x
492-
8192 | 9.2438 | 4.8497 | 1.91x
493-
16384 | 18.6992 | 9.8805 | 1.89x
494-
32768 | 37.079 | 19.9461 | 1.86x
495-
65536 | 73.588 | 39.593 | 1.86x
496-
```
486+
487+
| Batch Size | Baseline Time (ms) | Kernel Time (ms) | Speedup |
488+
| ---------- | ------------------ | ---------------- | ------- |
489+
| 256 | 0.2122 | 0.2911 | 0.72x |
490+
| 512 | 0.4748 | 0.3312 | 1.43x |
491+
| 1024 | 0.8946 | 0.6864 | 1.30x |
492+
| 2048 | 2.0289 | 1.3889 | 1.46x |
493+
| 4096 | 4.4318 | 2.2467 | 1.97x |
494+
| 8192 | 9.2438 | 4.8497 | 1.91x |
495+
| 16384 | 18.6992 | 9.8805 | 1.89x |
496+
| 32768 | 37.079 | 19.9461 | 1.86x |
497+
| 65536 | 73.588 | 39.593 | 1.86x |
497498

498499
## Get Started and Next Steps!
499500

500501
You've seen how easy it is to fetch and use optimized kernels with the Hugging Face Kernel Hub. Ready to try it yourself?
501502

502503
1. **Install the library:**
503-
~~~bash
504+
```bash
504505
pip install kernels torch numpy
505-
~~~
506+
```
506507
Ensure you have a compatible PyTorch version and gpu driver installed.
507508

508509
2. **Browse the Hub:** Explore available kernels on the Hugging Face Hub under the [`kernels` tag](https://huggingface.co/kernels) or within organizations like [`kernels-community`](https://huggingface.co/kernels-community). Look for kernels relevant to your operations (activations, attention, normalization like LayerNorm/RMSNorm, etc.).

0 commit comments

Comments
 (0)