-
Notifications
You must be signed in to change notification settings - Fork 7
Improve matmul instruction scheduling with loop rotation #2488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
assert(ind >= 0); | ||
assert(ind <= max_ind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but asserting different conditions separately provides a better error message. (The line number in the error message will tell me which is violated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- apply improvement in matmul instruction scheduling with loop rotation
Introduction
Loop rotation is a lowering pass that transform
into
In the matmul kernel, both the
cp.async
and theld.matrix
are circular/double buffered. This PR applies loop rotation to the matmul main loop to pull the first iteration'sld.matrix
out of the main loop ofcp.async
.That is, to change the code from
to
In order to do so, I need to do a reorder to change the matmul schedule from
to
Because in the first schedule, the loop structure is
where inside the
cp.async
circular buffer loop, the entireld.matrix->mma
is contained in thethreadIdx
trivial loop, and theld.matrix
is not separable.In contrast, for the second schedule, we have
The
blockIdx
andthreadIdx
loops are trivial loops, so this schedule change actually doesn't affect the generated CUDA kernel. However, it does make kernel IR easier to deal with.Benchmark
Using command
Before this PR:
After this PR: