-
Notifications
You must be signed in to change notification settings - Fork 128
[inductor] Revise #956 to avoid a perf regression #981
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the issue?
torchinductor/codegen/triton.py
Outdated
index, mask = self.indexing(index) | ||
if is_index_0 and "tl.zeros" not in index: | ||
# Need dense_indexing when index == 0 | ||
index = f"{index} + tl.zeros({self.dense_size_str()}, tl.int32)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
if need_dense and not have_dense: | ||
mask = dense_mask | ||
index_str = f"{index_str} + tl.zeros({self.dense_size_str()}, tl.int32)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if need_dense and not have_dense: | |
mask = dense_mask | |
index_str = f"{index_str} + tl.zeros({self.dense_size_str()}, tl.int32)" | |
if need_dense and not have_dense or index == 0: | |
index_str = f"{index_str} + tl.zeros({self.dense_size_str()}, tl.int32)" | |
if index == 0: | |
mask = ["None"] | |
else: | |
mask = dense_mask |
and remove changes to load
. If index is 0, you don't need mask at all, a single element load is always valid.
396d8a1
to
4689b59
Compare
The perf regression is because my previous change will cause some torchdynamo/torchinductor/codegen/triton.py Line 721 in fb2fb47
I am actually surprised to see how perf measurement on AWS is stable enough to capture a 3% regression correctly. I will look into how stable the numbers are on the CI and consider add perf checking for CI tasks. |
Summary: #981 introduced a misaligned address bug which relates how tl.load from index 0 should be written in triton.
Summary: #981 introduced a misaligned address bug which relates how tl.load from index 0 should be written in triton.
No description provided.