You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -51,9 +53,9 @@ Spend more time building great models and less time fighting build systems!
51
53
52
54
Using the Kernel Hub is designed to be straightforward. The `kernels` library provides the main interface. Here's a quick example that loads an optimized GELU activation function kernel. (Later on, we'll see another example about how to integrate a kernel in our model).
# List available functions in the loaded kernel module
103
105
print("\nAvailable functions in 'kernels-community/activation':")
104
106
print(dir(activation_kernels))
105
-
~~~
107
+
```
106
108
107
109
**(Note:** If you have [`uv`](https://github.com/astral-sh/uv) installed, you can save this script as `script.py` and run `uv run script.py` to automatically handle dependencies.)
108
110
@@ -123,9 +125,9 @@ Let's integrate an optimized **RMS Normalization** kernel into a basic model. We
123
125
First, define a simple RMSNorm module in PyTorch and a baseline model using it:
print("\nSkipping output comparison as kernel model output was not generated.")
335
337
336
338
337
-
~~~
339
+
```
338
340
339
341
**Important Notes on the `KernelModel`:**
340
342
341
343
***Kernel Inheritance:** The `KernelRMSNorm` class inherits from `layer_norm_kernel_module.layers.LlamaRMSNorm`, which is the RMSNorm implementation in the kernel. This allows us to use the optimized kernel directly.
342
344
***Accessing the Function:** The exact way to access the RMSNorm function (`layer_norm_kernel_module.layers.LlamaRMSNorm.forward`, `layer_norm_kernel_module.rms_norm_forward`, or something else) **depends entirely on how the kernel creator structured the repository on the Hub.** You may need to inspect the loaded `layer_norm_kernel_module` object (e.g., using `dir()`) or check the kernel's documentation on the Hub to find the correct function/method and its signature. I've used `rms_norm_forward` as a plausible placeholder and added error handling.
343
345
***Parameters:** We now only define `rms_norm_weight` (no bias), consistent with RMSNorm.
344
346
345
-
## 4. Review Performance Impact
346
-
347
-
Does using the optimized Triton RMSNorm kernel provide a speedup compared to the basic PyTorch version? Let's benchmark the forward pass again.
347
+
## 4. Benchmarking the Performance Impact
348
348
349
+
How much faster is the optimized Triton RMSNorm kernel compared to the standard PyTorch version? Let’s benchmark the forward pass to find out.
Similar to LayerNorm, optimized RMSNorm kernels (especially those using Triton) implemented for specific hardware can offer significant speedups over basic PyTorch implementations, particularly for memory-bound operations on suitable hardware (e.g., NVIDIA Ampere/Hopper GPUs) and data types (`float16`/`bfloat16`).
474
+
As with LayerNorm, a well-tuned RMSNorm implementation using Triton can deliver substantial speedups over PyTorch’s default version—especially for memory-bound workloads on compatible hardware (e.g., NVIDIA Ampere or Hopper GPUs) and with low-precision types like `float16` or `bfloat16`.
475
+
474
476
475
-
**Important Caveats (Remain Applicable):**
476
-
*Microbenchmark limitations.
477
-
*Dependence on Hardware, Input Size, Dtype.
478
-
*Quality of the specific kernel implementation.
479
-
*Potential overhead for small inputs.
477
+
**Keep in Mind:**
478
+
*Results may vary depending on your GPU, input size, and data type.
479
+
*Microbenchmarks can misrepresent real-world performance.
480
+
*Performance hinges on the quality of the kernel implementation.
481
+
*Optimized kernels might not benefit small batch sizes due to overhead.
480
482
481
483
482
484
Actual results will depend on your hardware and the specific kernel implementation. Here's an example of what you might see (on a L4 GPU):
483
485
484
-
```txt
485
-
Batch Size | Baseline Time (ms) | Kernel Time (ms) | Speedup
You've seen how easy it is to fetch and use optimized kernels with the Hugging Face Kernel Hub. Ready to try it yourself?
501
502
502
503
1.**Install the library:**
503
-
~~~bash
504
+
```bash
504
505
pip install kernels torch numpy
505
-
~~~
506
+
```
506
507
Ensure you have a compatible PyTorch version and gpu driver installed.
507
508
508
509
2. **Browse the Hub:** Explore available kernels on the Hugging Face Hub under the [`kernels` tag](https://huggingface.co/kernels) or within organizations like [`kernels-community`](https://huggingface.co/kernels-community). Look for kernels relevant to your operations (activations, attention, normalization like LayerNorm/RMSNorm, etc.).
0 commit comments