Skip to content

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Feb 24, 2023

Currently, contiguity had the same size as the rfactor domain of a tensor. And in this PR, I am changing it to the size of TensorDomain::noBroadcasts(rfactor_dom). With this change, "the contiguity of a broadcast/expand dimension" is no longer meaningful by definition. And contiguity[i] is true if and only if it is memory dense with the next non-broadcasting dimension.

For example, if I have a tensor torch.zeros(4, 1, 3).expand(-1, 10, -1), before this change, the contiguity of this tensor will be (false, false, true), and after the change, it will be (true, true).

The reason for doing this change is, we are more interested in whether a non-broadcasting dimension is memory dense with its next non-broadcasting dimension. In the example above, we are interested in whether 4 and 3 is memory dense. We are not interested in if 10 and 3 are memory dense, because by definition they are trivially not. In this example, we want to vectorize 4, however, the current contiguity design is blocking us from doing so.

Currently, our definition about the contiguity of the broadcasting dimensions and the dimension before a broadcasting dimension is vague and not well formalized. For example, if I have shape (4, 1, 6), stride (4*999999, 999999, 1), then on the one hand, our system will calculate its contiguity as (true, false, true), however, on the other hand, our indexing will collapse the index of dim 0 with dim 2 because it ignoring broadcasts (this is the root cause of #2169). I will not consider this an indexing bug. Instead, I consider this as an issue of ambiguity in the definition of contiguity. And my design change is an effort to remove this ambiguity.

See also: #2169, #2049

Fixes #2169

@zasdfgbnm zasdfgbnm marked this pull request as ready for review February 28, 2023 08:28
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question on TensorView constructor.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here and there.

int no_broadcast_i = 0;
for (const auto i : c10::irange(ids.size())) {
if (!ids.at(i)->isBroadcast()) {
full2nob_map.at(i) = no_broadcast_i++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: std::vector (size_t) gives zero init values? I'm a little uncomfortable with us leaving them here, which might accidentally map that to axis-0. Can we default init everything to -1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the way we are accessing these, I think an unordered_map makes more sense, in terms of readability... It'll be easier to catch bug in accessing the mapping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that unordered_map provide better readability and error checking, but isn't unordered_map much slower compared to vector? I changed the code to initialize the vector with std::numeric_limits<size_t>::max(), so if an unexpected item is accessed, it will almost always lead to an out-of-bound error or a segfault.

@zasdfgbnm zasdfgbnm requested a review from jjsjann123 March 1, 2023 09:15
@naoyam
Copy link
Collaborator

naoyam commented Mar 6, 2023

It seems that many of the changes are because the contiguity vector now only holds flags only for non-broadcast domains. I wonder if it could be simpler if we kept the contiguity vector to have flags of all domains and just change the definition of the flag. IIRC, if the flag is true, it means the stride of the domain can be calculated as the stride of the next inner domain multiplied by the extent of the inner domain. If we change the definition of the next inner domain to the next non-broadcast inner domain, I think we should be able to have the same benefit of this PR without doing noBroadcastDomains.

Just my two cents.

@zasdfgbnm
Copy link
Collaborator Author

It seems that many of the changes are because the contiguity vector now only holds flags only for non-broadcast domains. I wonder if it could be simpler if we kept the contiguity vector to have flags of all domains and just change the definition of the flag. IIRC, if the flag is true, it means the stride of the domain can be calculated as the stride of the next inner domain multiplied by the extent of the inner domain. If we change the definition of the next inner domain to the next non-broadcast inner domain, I think we should be able to have the same benefit of this PR without doing noBroadcastDomains.

Just my two cents.

I agree that keeping the flag for broadcasting could still have the benefit of ignoring broadcast in its definition. But I don't think it would make this diff easier. The only save is a few noBroadcastDomains, but we would need to change our caching system to ignore the flag value at broadcast dimensions. And I don't like making contiguity storing redundant unused value, because the definition of contiguity is not trivial, and we already made mistakes by writing wrong indexing code. Storing these extra values would make it easy for us to make similar mistakes in the future. Making the size of contiguity deviates from the size of the rfactor domain could lead to much louder errors when we make a mistake, therefore avoid hard-to-catch bugs.

@naoyam
Copy link
Collaborator

naoyam commented Mar 6, 2023

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

Comment on lines -1155 to -1158
if ((*it)->isBroadcast()) {
if (inner_most_id == nullptr) {
inner_most_id = *it;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this function ignored broadcast domains if there's any other non-broadcast domains. If the tensor only has broadcast domains, it would have returned the innermost broadcast domain. I don't remember why we're doing this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure either. But this would only happen on all-broadcasting tensors, and this function is only used in vectorization and transpose detection. Loosely speaking, if a tensor is all-broadcasting, it should be a no-op in vectorization/transpose analysis, so I would speculate this change is safe.

@jjsjann123
Copy link
Collaborator

jjsjann123 commented Mar 7, 2023

Looks like my issues are all resolved. I'm leaving it to @naoyam to stamp on this one.

from nvfuser import compute_contiguity

return compute_contiguity(shape, strides)
return tuple(compute_contiguity(shape, strides))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to change this. @jjsjann123

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I upstream this?

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zasdfgbnm
Copy link
Collaborator Author

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

@naoyam Indeed, instead of making contiguity storing redundant unused boolean, we can still make contiguity have the same size as rfactor domain by making contiguity a std::vector<c10::optional<bool>> instead of std::vector<bool>. So for the tensor torch.zeros(8, 1, 3).expand(8, 9, 3), the contiguity will be (true, None, true). I disliked (true, true, true) and (true, false, true) in favor of (true, true) because I wanted to see a hard error when trying to read the contiguity of a broadcast domain because this means a bug and because I don't want to recompile if the contiguity changed from (true, true, true) to (true, false, true). I think (true, None, true) has all the benefits of (true, true) and is more straightforward and more convenient. I was discussing with @jjsjann123 this morning about frontend design, and we both like this approach. Do you think this makes sense? I will write a PR to change it if it makes sense.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Mar 8, 2023
In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: #96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
@naoyam
Copy link
Collaborator

naoyam commented Mar 8, 2023

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

@naoyam Indeed, instead of making contiguity storing redundant unused boolean, we can still make contiguity have the same size as rfactor domain by making contiguity a std::vector<c10::optional<bool>> instead of std::vector<bool>. So for the tensor torch.zeros(8, 1, 3).expand(8, 9, 3), the contiguity will be (true, None, true). I disliked (true, true, true) and (true, false, true) in favor of (true, true) because I wanted to see a hard error when trying to read the contiguity of a broadcast domain because this means a bug and because I don't want to recompile if the contiguity changed from (true, true, true) to (true, false, true). I think (true, None, true) has all the benefits of (true, true) and is more straightforward and more convenient. I was discussing with @jjsjann123 this morning about frontend design, and we both like this approach. Do you think this makes sense? I will write a PR to change it if it makes sense.

Sounds good to me.

@jjsjann123
Copy link
Collaborator

Briefly brought up this conversation with the frontend team. There's some opinion on how we should expose contiguity flag on the frontend. pointing @kevinstephano here for visibility.

@jjsjann123
Copy link
Collaborator

The change on python frontend is linked above in #2561

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
In csarofeen/pytorch#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: pytorch/pytorch#96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 12, 2023
In csarofeen/pytorch#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: pytorch/pytorch#96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
ydwu4 added a commit to ydwu4/pytorch that referenced this pull request Mar 13, 2023
In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: pytorch#96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

slow test failure: test_nvfuser_extremal_values_masked_amin_cuda_float32
3 participants