-
Notifications
You must be signed in to change notification settings - Fork 24k
request for faster inductor kernels for blockwise reduction across dim1 -> write #149982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
module: inductor
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Comments
This was referenced Mar 27, 2025
eellison
added a commit
that referenced
this issue
Apr 28, 2025
Fix for #149982. Summary: This PR does two main things: 1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. 2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency: `(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. I'm sure there are still some CI failures to debug.. cc vkuzo cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
eellison
added a commit
that referenced
this issue
Apr 28, 2025
Fix for #149982. Summary: This PR does two main things: 1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. 2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency: `(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. I'm sure there are still some CI failures to debug.. cc vkuzo cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
eellison
added a commit
that referenced
this issue
Apr 28, 2025
Fix for #149982. Summary: This PR does two main things: 1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. 2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency: `(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. I'm sure there are still some CI failures to debug.. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov vkuzo cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
eellison
added a commit
that referenced
this issue
Apr 28, 2025
Fix for #149982. Summary: This PR does two main things: 1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. 2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency: `(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. I'm sure there are still some CI failures to debug.. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov vkuzo cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
module: inductor
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Describe the bug
We should make the following kernel be fast in compile + inductor. This is important to be able to generate the dim1 cast to MX formats.
Currently, I am only hitting 0.6 to 0.7 TB/s on NVIDIA H100. If the reduction and write is across dim0 instead of dim1, I see 2.0-2.2 TB/s. From discussions with @eellison , this is due to uncoalesced reads and we can fix this.
Repro script: https://gist.github.com/vkuzo/9eff0d27691be483e45bb10edf66d82c
Repro results on NVIDIA H100:
TORCH_LOGS=output_code results: https://gist.github.com/vkuzo/4420c5b508ddd560e5d4620758b5936a
Versions
main branch
cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @aakhundov
The text was updated successfully, but these errors were encountered: