[Draft] Tilelang JITv2: Simpler Kernel Declaration, Smart Code Generation, More Syntax Sugar and Extremely Low Overhead #916

kurisu6912 · 2025-09-30T09:28:55Z

Tilelang JITv2

In this PR we introduce Tilelang JITv2, a new frontend for Tilelang with modern and attractive features.

Features

Kernel Declaration

Function declaration has been simplified:

Kernel Call

When calling functions, tensor shapes, strides, and dtypes are automatically inferred:

# before
ker_1 = matmul(1024, 1024, 1024, 'float32')
c1 = ker_1(a1, b1)
ker_2 = matmul(1024, 1024, 512, 'float32')
c2 = ker_2(a2, b2)

# after
gemm(a1, b1)
gemm(a2, b2)

Auto Tuning

Auto tuning can be done via default arguments:

@tl.jit
def add(
    A: tl.Tensor[int],
    B: tl.Tensor[int],
    block: int = tune([128, 256, 512])
):
    ...

Or on-the-fly:

add(A, B, tune([64, 128]))

Smarter Static Evaluation

JITv2 preserves as much Python code as possible, allowing calls to custom Python functions or conditional kernel generation:

@tl.jit
def gemm(
    ...
    split_k: bool = False
):
    block_size = my_super_block_size_huristic(M, N, K)
    if split_k: # split_k is a constant value
        with tl.Kernel(...) as ...:
            ...
    else:
        with tl.Kernel(...) as ...:
            ...

    return C

Smarter Type Hinting

JITv2 not only eliminates annoying type warnings, but also adds extensive type annotations. This helps you clearly see each Tensor’s dimensions, and marks whether a value is on the Python side or kernel side. Even generated functions and the JIT-compiled kernels have friendly type hints:

Extremely Low Overhead

JITv2's Python overhead has been optimized to the extreme. In the fast path, only dynamic parameters are checked, bringing overhead in line with calling a torch function (e.g., torch.add):

A = torch.randn(128, dtype=torch.float16, device="cuda")
B = torch.randn(128, dtype=torch.float16, device="cuda")

# torch.add:  ~ 6.5us
C_1 = A + B
# jit kernel: ~ 7.5us (cached)
C_2 = add(A, B)

Architecture

The Tilelang JIT workflow:

Py-to-Py generates two pieces of code: argument parser and JIT function generator
Fast path (~1.5 μs): Calls the kernel, argument parser separates static and dynamic parameters; static cache hit → directly calls C++ library functions
Slow path: Static cache miss → kernel needs to be recompiled

Static & Dynamic Arguments

JITv2 inspects function signatures to determine which parameters are const and which are dyn:

dyn supports only int, float, and ptr; treated as tir.Var
const can be any type (simple types preferred; prefer Tuple over List)
dyn types must be explicitly annotated; Tensor must be explicitly annotated because its data_ptr is always dynamic
const arguments can differ from annotation (e.g., annotate int but pass a dict) — note: validation is hard (like writing a pydantic)

Argument Parser

JITv2 generates Python code for the fast path, which unpacks const and dyn arguments and then invokes the kernel:

Optimized Python statements: Each statement in the fast path is carefully designed, using bytecode fast to execute — overhead is minimal, even slightly smaller than torch.to_dlpack
Static check cache: The fast path does not perform type checks for const variables; instead, these are checked at compile time (e.g., wrong tensor shape → cache miss → kernel compiled → value range check)
Dynamic type checks: Fast path performs simple dynamic checks, e.g., asserting equal values for K. More complex asserts may be compiled to host code (not yet supported)

_K = dyn[int, '_K']
def foo(
    a: Tensor[int, _K],
    b: Tensor[int, _K],
    c: int,
):
    pass
# generated code
def foo_fastpath(a, b, c):
    # 1. Unpack type info
    # 1.1 Unpacking a tensor ~600 ns; each of the following lines takes ~200 ns, heavily optimized
    assert a.device != __device_cpu__, "Expected a non CPU tensor"
    a__shape_0, a__shape_1 = a.shape
    a__stride_0, a__stride_1 = a.stride()
    assert b.device != __device_cpu__, "Expected a non CPU tensor"
    #                  ^- note: torch.device('cpu') costs 200+ ns; using closure trick, __device_cpu__ costs 5 ns
    b__shape_0, b__shape_1 = b.shape
    b__stride_0, b__stride_1 = b.stride()
    # 2. Construct argument lists ~20–50 ns
    __const_args__ = (
        a.dtype, a__shape_0, a__shape_1, a__stride_0, a__stride_1,
        b.dtype, b__shape_0, b__shape_1, b__stride_0, b__stride_1,
        c)
    __dyn_args__ = (a.data_ptr(), b.data_ptr())
    return __const_args, __dyn_args__

Memory Allocation & Return Values

Inside functions, use T.alloc_global to create global buffers:

T.alloc_global is friendlier for type linting — it’s translated into torch.empty
T.alloc_xxx must be assigned to a variable (x = T.alloc_xxx()), not passed directly as a function parameter (e.g., foo(T.alloc_shared(...)) is not allowed)
Return objects must be global buffers; returning Python objects is not supported (e.g., returning BLOCK_M + BLOCK_N is not allowed):

@T.prim_func
def gemm(
    A: T.Tensor[int, int],
    B: T.Tensor[int, int],
    out_ty  = torch.half,
    BLOCK_M = T.tune([64, 128, 256]),
    BLOCK_N = T.tune([64, 128, 256]),
):
    # Quickly get dimensions
    (N, K), (M, K2) = A.shape, B.shape
    assert K == K2, "Expect 2 matrices with identical K dimension"
    # Allocate memory for output
    out = T.alloc_global((N, M), dtype=out_ty)
    with T.Kernel((T.ceildiv(M, BLOCK_M), T.ceildiv(N, BLOCK_N)), threads=128) as (bx, by):
        pass
    return out

TODOs

Integrate with tl.language
Add auto tuner

coderabbitai · 2025-09-30T09:29:03Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-30T09:29:05Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

LeiWang1999 · 2025-09-30T09:31:15Z

This is huge !

tilelang jit v2

dba18e8

LeiWang1999 self-requested a review September 30, 2025 10:19

LeiWang1999 mentioned this pull request Oct 10, 2025

WIP feat: support torch.dtype in T.tensor #832

Closed

kurisu6912 added 8 commits October 10, 2025 10:22

Merge branch 'main' into jit-v2

209490c

fix lint error

0b56bd6

fix typos

e7f4cd9

add torch.dtype, add var naming

51c415a

add macro, add ptr and make_tensor

fb8dd3d

many update

26b6e65

add support for augassign

032c978

fix lint error

dee2b75

kurisu6912 force-pushed the jit-v2 branch from 3397b35 to dee2b75 Compare October 11, 2025 09:11

Merge branch 'main' into jit-v2

df8c21e

kurisu6912 force-pushed the jit-v2 branch from e323480 to df8c21e Compare October 11, 2025 09:43

kurisu6912 added 5 commits October 13, 2025 10:20

fix compile error reports

407b694

update

f7a4e0d

update

5832618

fix lint error

dee6d7c

update

f7acf8a

kurisu6912 force-pushed the jit-v2 branch from 5c219b9 to f7acf8a Compare October 13, 2025 06:58

fix lint error

1dc0d0f

kurisu6912 closed this Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Tilelang JITv2: Simpler Kernel Declaration, Smart Code Generation, More Syntax Sugar and Extremely Low Overhead #916

[Draft] Tilelang JITv2: Simpler Kernel Declaration, Smart Code Generation, More Syntax Sugar and Extremely Low Overhead #916

Uh oh!

kurisu6912 commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Review skipped

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

LeiWang1999 commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Draft] Tilelang JITv2: Simpler Kernel Declaration, Smart Code Generation, More Syntax Sugar and Extremely Low Overhead #916

[Draft] Tilelang JITv2: Simpler Kernel Declaration, Smart Code Generation, More Syntax Sugar and Extremely Low Overhead #916

Uh oh!

Conversation

kurisu6912 commented Sep 30, 2025

Tilelang JITv2

Features

Kernel Declaration

Kernel Call

Auto Tuning

Smarter Static Evaluation

Smarter Type Hinting

Extremely Low Overhead

Architecture

Static & Dynamic Arguments

Argument Parser

Memory Allocation & Return Values

TODOs

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

LeiWang1999 commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Sep 30, 2025 •

edited

Loading