-
Notifications
You must be signed in to change notification settings - Fork 7
Make contiguity ignore broadcasts #2517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question on TensorView constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments here and there.
int no_broadcast_i = 0; | ||
for (const auto i : c10::irange(ids.size())) { | ||
if (!ids.at(i)->isBroadcast()) { | ||
full2nob_map.at(i) = no_broadcast_i++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: std::vector (size_t) gives zero init values? I'm a little uncomfortable with us leaving them here, which might accidentally map that to axis-0. Can we default init everything to -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the way we are accessing these, I think an unordered_map makes more sense, in terms of readability... It'll be easier to catch bug in accessing the mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that unordered_map
provide better readability and error checking, but isn't unordered_map
much slower compared to vector
? I changed the code to initialize the vector with std::numeric_limits<size_t>::max()
, so if an unexpected item is accessed, it will almost always lead to an out-of-bound error or a segfault.
It seems that many of the changes are because the contiguity vector now only holds flags only for non-broadcast domains. I wonder if it could be simpler if we kept the contiguity vector to have flags of all domains and just change the definition of the flag. IIRC, if the flag is true, it means the stride of the domain can be calculated as the stride of the next inner domain multiplied by the extent of the inner domain. If we change the definition of the next inner domain to the next non-broadcast inner domain, I think we should be able to have the same benefit of this PR without doing Just my two cents. |
I agree that keeping the flag for broadcasting could still have the benefit of ignoring broadcast in its definition. But I don't think it would make this diff easier. The only save is a few |
Hahaha, I expected this answer:
|
if ((*it)->isBroadcast()) { | ||
if (inner_most_id == nullptr) { | ||
inner_most_id = *it; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this function ignored broadcast domains if there's any other non-broadcast domains. If the tensor only has broadcast domains, it would have returned the innermost broadcast domain. I don't remember why we're doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure either. But this would only happen on all-broadcasting tensors, and this function is only used in vectorization and transpose detection. Loosely speaking, if a tensor is all-broadcasting, it should be a no-op in vectorization/transpose analysis, so I would speculate this change is safe.
Looks like my issues are all resolved. I'm leaving it to @naoyam to stamp on this one. |
from nvfuser import compute_contiguity | ||
|
||
return compute_contiguity(shape, strides) | ||
return tuple(compute_contiguity(shape, strides)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to change this. @jjsjann123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I upstream this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@naoyam Indeed, instead of making contiguity storing redundant unused boolean, we can still make contiguity have the same size as rfactor domain by making contiguity a |
In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: #96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
Sounds good to me. |
Briefly brought up this conversation with the frontend team. There's some opinion on how we should expose contiguity flag on the frontend. pointing @kevinstephano here for visibility. |
The change on python frontend is linked above in #2561 |
In csarofeen/pytorch#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: pytorch/pytorch#96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
In csarofeen/pytorch#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: pytorch/pytorch#96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: pytorch#96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
Currently, contiguity had the same size as the rfactor domain of a tensor. And in this PR, I am changing it to the size of
TensorDomain::noBroadcasts(rfactor_dom)
. With this change, "the contiguity of a broadcast/expand dimension" is no longer meaningful by definition. Andcontiguity[i]
istrue
if and only if it is memory dense with the next non-broadcasting dimension.For example, if I have a tensor
torch.zeros(4, 1, 3).expand(-1, 10, -1)
, before this change, the contiguity of this tensor will be(false, false, true)
, and after the change, it will be(true, true)
.The reason for doing this change is, we are more interested in whether a non-broadcasting dimension is memory dense with its next non-broadcasting dimension. In the example above, we are interested in whether 4 and 3 is memory dense. We are not interested in if 10 and 3 are memory dense, because by definition they are trivially not. In this example, we want to vectorize 4, however, the current contiguity design is blocking us from doing so.
Currently, our definition about the contiguity of the broadcasting dimensions and the dimension before a broadcasting dimension is vague and not well formalized. For example, if I have shape
(4, 1, 6)
, stride(4*999999, 999999, 1)
, then on the one hand, our system will calculate its contiguity as(true, false, true)
, however, on the other hand, our indexing will collapse the index of dim 0 with dim 2 because it ignoring broadcasts (this is the root cause of #2169). I will not consider this an indexing bug. Instead, I consider this as an issue of ambiguity in the definition of contiguity. And my design change is an effort to remove this ambiguity.See also: #2169, #2049
Fixes #2169