-
Notifications
You must be signed in to change notification settings - Fork 7
Add computeWith to interleave gmem accesses and computations #2156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean solution for computing and allocating at different positions! I have just started, posting a few comments for now.
As a concrete data point, in one of the outer welford tests in PR #1772,
With
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
posting more
@@ -0,0 +1,451 @@ | |||
#if defined(USE_CUDA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating a new test file :)
": ", | ||
consumer_pos); | ||
|
||
// Find the corresponding position in this tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of interface, which do you think makes more sense? computeAt
uses consumer position, but inlineAt
uses this position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computeAt
as well as computeWith
have a consumer parameter, and I thought it'd be more intuitive to have the position parameter relative to the consumer.
inlineAt
, on the other hand, doesn't reference any particular consumer, and it makes sense that the position references its domain.
That said, the difference does seem confusing. If we wanted to have a uniform position semantic, I'd change computeWith
to take a position in its domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea on which will be more convenient to use in the scheduler, and I have no strong opinion about which should be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely, the best-effort option should just work, so I think this difference won't matter much. I'm leaning toward changing this to the position in its own tensor as that's the position we ultimately save in each tensor (i.e., compute_with_pos_
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question here is what would be the disadvantage of making compute at relative to the producer instead of a consumer. Right now it seems we're trying to make sure it gets inlined directly into a specific consumer, but is there a practical limitation of making it inlined relative to all consumers, and opportunistically picking the first consumer that lower expr sort returns during lowering?
I'm generally concerned about adding a sorting constraint (as you've identified) without having a mechanism to guarantee at scheduling time it can be sorted. If we instead let it sort the same way it would without the computeWith then just inline into the first consumer, it seems like it would achieve what you're looking for without opportunity to break sorting.
That's interesting to think about. I'd feel reluctant to inline everything opportunistically. Maybe that's just what makes most sense, but I'd be leaning towards making it opt-in or opt-out. Maybe someday we may want to port nvFuser to CPUs, where we may want to do software pipelining, so it may not be always optimal to inline as much as possible. I'm thinking about something like this.
|
Yeah, this sounds fine. I was just thinking the same except not needing to specify a set of consumers (so if you set computeWith it's just with any consumer that matches that position). |
Ah, that seems like a safer and easier approach. Thanks for the suggestion. I'll update the PR. |
What is the status of this PR? I guess the only remaining two things are: #2156 (comment) and #2156 (comment) ? |
Yeah, I'm going to change the interface. I'll ask reviews again when ready. |
Changed the
It now has just:
Previously, it was:
The target consumer tensor is automatically picked when the fusion is lowered. It is the first consumer tensor as appearing in the sorted expression list. One of the major changes I needed to do is to do the expression sorting twice when computeWith is used: first to resolve computeWith targets and once more for lowering. The computeWith resolution is done at the very beginning of the lowering, before examining the fusion for validations and other analyses. Resolving computeWith, i.e., finding the right target of computeWith affects the computeAt LOOP mappings as well as the max producer positions, so anything depending on that information must be done after computeWith resolution. It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly. Remaining TODO:
|
This is ready for review again @zasdfgbnm @csarofeen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed everything but the tests and only have one sticking point of confusion. I'll rely on @zasdfgbnm and @naoyam to make sure there's sufficient testing at this point.
|
||
// If it's already set to be computed with the consumer and the | ||
// position is higher, nothing to change | ||
if (getComputeWithPosition() >= pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update consumers in some way in this call to make sure that consumers don't transform in an inconsistent way after the computeWith of their producers is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me resolveComputeWith
will simply fail if we're in an inconsistent state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Since at this point no consumer is aware of the potential computeWith into it, there's nothing to block the consumers to be transformed in an inconsistent way. I need to think about this more carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a blocker for this PR, we might just want to more explicitly validate in lowering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice in the future to error on transformation attempt of the consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added maybe_max_producer_pos_
(1c83e69). Don't like the name, but don't have any other idea.
Modifying consumer domains where producer may be computed at should now throw an error.
} | ||
} | ||
|
||
bool TensorView::resolveComputeWith(const std::vector<Expr*>& sorted_exprs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorted expressions is post expr sorting, not just the topologically sorted DAG, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will add a comment
} | ||
|
||
// First use found. Set it as the computeWith target tensor | ||
std::cerr << "Resolve the computeWith target as: " << expr->toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be cleaned up?
} | ||
|
||
for (auto consumer_tv : compute_with_consumers_) { | ||
consumer_tv->updateMaxProducerPosition(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this function was a little unclear in my first read, but think I got it. It seems you're not trying to enforce that all consumers can support the given computeWith
from all its producers. Instead you only care if the first consumer in the sorted list actually supports it.
This seems more permissive than the conservative approach I was thinking of where we would simply enforce all consumers would support the computeWith
position of all producers. Which would have made the behavior more conservative but consistent with computeAt
behavior during scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the max producer position of all consumers doesn't seem quite right to me. Except for one consumer where the producer is inlined, the other consumers still see the producer produced as if no computeWith is done.
I don't remember what parts of the system would be affected by the change of the max producer position, but the expr sorting is one of them. Thinking about the normalization pattern, and I don't think it would be able to resolve the dependencies of the consumer and producer positions if we update the max producer position of all the consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it would have to be maxProducerAtPosition
and maxProducerWithPosition
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm thinking about adding something like maybe_max_produce_position_
, which should work as a guard to prevent further transformations, it shouldn't be used for the expr sorting.
// Don't know which consumer would be computed with at this | ||
// point. Just make sure all the grouped reduction outputs have the | ||
// same set of consumers. This is not necessarily a required | ||
// condition and could be made more flexible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reduction grouping be done after scheduling then? I'm confused what you're checking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now reduction grouping is provided as a scheduling primitive (groupReductions(std::vector<TensorView*)
) so that schedulers can opt-in to group particular reductions. This part of the code is part of the validations done before converting multiple ReductionOp
exprs to a single GroupedReductionOp
expr.
Everything happens before lowering, so there can be unresolved computeWith. Specifically, what this validation checks is that if the output of one of the grouped ReductionOp
exprs is set to be computed-with, all of the other reduction outputs must have non-conflicting compute-with settings. That means all of the outputs have the same computeWith position. At this point, they are unresolved, so we don't know which consumer of each output will be picked as the target of the computeWith, so I simply enforced all of them must have the same consumers, which is more than enough to make the computeWith setting after grouping still valid.
@@ -347,7 +349,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) { | |||
|
|||
// Reorder expressions for loop-nest generation respecting computeAt | |||
// relationships | |||
const auto exprs_sorted = reorderExprsForComputeAt(); | |||
auto exprs_sorted = reorderExprsForComputeAt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little strange to me that above resolveComputeWith(fusion_);
calls reorderExprsForComputeAt
then it's being called again here. Can the expression ordering change between the two calls based on the result of the resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but may not be as is. That's what I meant to say in the above:
It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly.
//! which also means its computeWith needs to have been resolved, the | ||
//! computeWith position is returned. Otherwise, the default computeAt | ||
//! position is retured. | ||
unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: getComputePosition
//! computeWith position is returned. Otherwise, the default computeAt | ||
//! position is retured. | ||
unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always be using the new get functions for the compute at position? Do we want to hide the getComputeAtPosition
call now or is it still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can hide getComputeAtPosition
entirely, but I want to rename it something like getStoreAtPosition
. We still need that information, for example, finding allocation points.
"Invalid tensor: ", | ||
compute_with_tv->toString()); | ||
|
||
// Can use any consumer this tensor is computed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought compute with is resolved after compute at map is built but this reads to me like we need to resolve the compute with before building the compute at map. I'm a little confused on the logical dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, think I understood now:
- Compute at map is built
- Expressions are sorted
- Compute with (the producer) is resolved based on first consumer found in expression sorting
- Compute at map updates the loop mappings based on compute with
- Expressions are sorted again with the compute with information
The only step that concerns me I think is step (5). It seems to me we should assert the expressions are sorted the same way as in (2) otherwise I think that would be a logical inconsistency with the approach.
Loop map being updated in (4) I believe is to generate the right loop structure not to modify the sorted expression order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't specify, @naoyam this is my only "sticking point" before approving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. I'm not sure how the second sort could result in inconsistency. As long as the sorting work as intended, I believe we should be fine. The second sorting could result in different ordering as some more IterDomains are loop-mapped with updated max producer positions, but as long as it completes the soring algorithm, isn't the sorted list valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm okay with this. The expectation is the second sorting wouldn't have a tangible difference, some things I guess could be reordered, but I would think that shouldn't actually be the case because loop map would be updated based on the first consumer after a producer and updating the loop map should only reinforce them being next to each other.
I'm just wondering if we should assert the first and second sorting match, but maybe given the algorithm has some degrees of freedom it might just trigger coincidentally some expr "group" to reorder with another expr "group" which is just that "degree of freedom". i.e. there may be more than one unique valid ordering.
Yeah, I'll change this to approve, I'm okay with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's some freedom, and updated loop mapping and max producer position might affect the decision about which ordering to use within the freedom. Otherwise, everything should be deterministic (except for bugs of course). I'm reasonably confident that the expression sorting should be able to handle both the first and second sorting consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, didn't review the tests.
be potentially shared with producers through computeWith
This PR adds a new scheduling primitive called
computeWith
. It's different from what we previously had ascomputeWith
.The main motivation is that in persistent fusions we often have code like this:
Here, when
F
is some non-trivial computation, like the serial Welford computation, I observed 10-15 % perf gain by converting it the code as shown below:Note that
N
is a compile-time constant, so all the loops are completely unrolled. There should be nothing to block nvcc to interleave the initial memory read and the first use automatically, but apparently that doesn't seem to happen.The existing primitives in nvFuser don't allow us to express this code pattern.
computeAt
orinlineAt
can be used to merge the first two loops in the first example of the above, but they also mean the allocation of the buffer is inlined inside the merged loop, thus the code would become invalid as there's another use of the buffer,G(buffer[i])
. For this persistent pattern, we always need to make sure the buffer is allocated that is commonly in the scope of all the uses, but somehow we would like to move only the expression of the buffer read into the loop where it is first used.This effectively means we would need something similar to
store_at
in Halide. OurcomputeAt
is less flexible as it dictates both the allocation point and the computation point.The
computeWith
primitive is to address the lack of the store-at concept in a limited manner without heavily modifying the existing primitives. It does not allow specifying the allocation point of a tensor, but rather it can be used to inline the computation of an expression at a consumer without changing the allocation point. So, for example, in the above case,buffer->computeWith(F, -1)
would move the buffer loading into the loop of theF
loop.More specifically,
TensorView
has these new fields:compute_with_consumers_
hold a list of consumers that this tensor is computed with. It's usually just one consumer tensor, but there can be multiple tensors when a consumer has siblings.compute_with_pos_
is the position where this tensor is computed with the consumers. Obviously, this must be always greater than or equal tocompute_at_pos_
.Note that at this point I'm still keeping
compute_at_pos_
to mean the allocation point. We may want to rename it tostore_at_pos_
.I also added several functions to
TensorView
. Namely:computeWith(TensorView* consumer, int consumer_pos, bool best_effort)
to do the computeWith setting with some error checkinghasComputedWith
to query if this tensor is computed with any consumerisComputedWith(TensorView* consumer)
to query if it's computed withconsumer
getMaxComputeAtPosition()
to get the maximum of the computeAt and computeWith positionsgetMaybeComputeAtPosition(TensorView* consumer)
to get the position where this tensor is computed from the perspective of the given consumer. If this tenso is computed with the consumer, it returns the compute-with position. Otherwise, it returns the compute-at position, which effectively means the store-at position.The actual implementation of
computeWith
is mostly borrowed frominlineAt
. UnlikecomputeAt
but similar toinlineAt
, it does not transform itself nor the consumer but just sets the computeWith position and the consumer tensors that it's computed with. UnlikeinlineAt
, it does not need to have the constraint of allocation of persistent buffers since it does not change allocation points, which means we can skip building of the unmappable domains usingComputeAtRootDomainMap
.For concrete examples of how it's used, see the new tests in
test_gpu_computed_with.cpp
. In particular,FusionComputeWith7
reproducers the original motivating case of this transformation.Some more notes:
computeWith
is used.getComputeAtPosition()
and changed them togetMaxComputeAtPosition()
orgetMaybeComputeAtPosition()
when I thought necessary. I'm not super confident but nothing seems broken.FusionComputeWith4
for a concrete example. I believe this error check would effectively require almost the same analysis as the expression sorting, and in fact the test fails at the expression sorting. Ideally, invalid computeWith like this should be detected at the time when a tensor is computed with, not when its fusion is lowered, but at this point, I don't see it's a must to get some progress with the outer persistent problem.