Skip to content

Add computeWith to interleave gmem accesses and computations #2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 17, 2022

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Nov 2, 2022

This PR adds a new scheduling primitive called computeWith. It's different from what we previously had as computeWith.

The main motivation is that in persistent fusions we often have code like this:

float buffer[N];
// Load everything we need
for i in N:
  buffer[i] = load_from_gmem();
// Use the buffer
for i in N:
  F(buffer[i]);
...
// Use the buffer again
for i in N:
  G(buffer[i]);

Here, when F is some non-trivial computation, like the serial Welford computation, I observed 10-15 % perf gain by converting it the code as shown below:

float buffer[N];
// Load and use the data
for i in N:
  buffer[i] = load_from_gmem();
   F(buffer[i]);
...
// Use the buffer again
for i in N:
  G(buffer[i]);

Note that N is a compile-time constant, so all the loops are completely unrolled. There should be nothing to block nvcc to interleave the initial memory read and the first use automatically, but apparently that doesn't seem to happen.

The existing primitives in nvFuser don't allow us to express this code pattern. computeAt or inlineAt can be used to merge the first two loops in the first example of the above, but they also mean the allocation of the buffer is inlined inside the merged loop, thus the code would become invalid as there's another use of the buffer, G(buffer[i]). For this persistent pattern, we always need to make sure the buffer is allocated that is commonly in the scope of all the uses, but somehow we would like to move only the expression of the buffer read into the loop where it is first used.

This effectively means we would need something similar to store_at in Halide. Our computeAt is less flexible as it dictates both the allocation point and the computation point.

The computeWith primitive is to address the lack of the store-at concept in a limited manner without heavily modifying the existing primitives. It does not allow specifying the allocation point of a tensor, but rather it can be used to inline the computation of an expression at a consumer without changing the allocation point. So, for example, in the above case, buffer->computeWith(F, -1) would move the buffer loading into the loop of the F loop.

More specifically, TensorView has these new fields:

//! Direct consumer tensors that this tensor is computed with
std::vector<TensorView*> compute_with_consumers_;
//! Position where this tensor is computed with the compute-with
//! consumer tensor. It should be always be equal or greater than
//! the computeAt position
unsigned int compute_with_pos_ = 0;

compute_with_consumers_ hold a list of consumers that this tensor is computed with. It's usually just one consumer tensor, but there can be multiple tensors when a consumer has siblings. compute_with_pos_ is the position where this tensor is computed with the consumers. Obviously, this must be always greater than or equal to compute_at_pos_.

Note that at this point I'm still keeping compute_at_pos_ to mean the allocation point. We may want to rename it to store_at_pos_.

I also added several functions to TensorView. Namely:

  • computeWith(TensorView* consumer, int consumer_pos, bool best_effort) to do the computeWith setting with some error checking
  • hasComputedWith to query if this tensor is computed with any consumer
  • isComputedWith(TensorView* consumer) to query if it's computed with consumer
  • getMaxComputeAtPosition() to get the maximum of the computeAt and computeWith positions
  • getMaybeComputeAtPosition(TensorView* consumer) to get the position where this tensor is computed from the perspective of the given consumer. If this tenso is computed with the consumer, it returns the compute-with position. Otherwise, it returns the compute-at position, which effectively means the store-at position.

The actual implementation of computeWith is mostly borrowed from inlineAt. Unlike computeAt but similar to inlineAt, it does not transform itself nor the consumer but just sets the computeWith position and the consumer tensors that it's computed with. Unlike inlineAt, it does not need to have the constraint of allocation of persistent buffers since it does not change allocation points, which means we can skip building of the unmappable domains using ComputeAtRootDomainMap.

For concrete examples of how it's used, see the new tests in test_gpu_computed_with.cpp. In particular, FusionComputeWith7 reproducers the original motivating case of this transformation.

Some more notes:

  • Nothing should change unless computeWith is used.
  • I looked over all the uses of getComputeAtPosition() and changed them to getMaxComputeAtPosition() or getMaybeComputeAtPosition() when I thought necessary. I'm not super confident but nothing seems broken.
  • The LOOP computeAt map is modified to map all leaf IDs that are left of the compute-with position when the consumer is the one this tensor is computed with.
  • One of the major limitations is that there's no error check with the computed-with consumer. For example, in the above case, we can move the memory read at the first use but not at the second use since that would break the data dependency between the read and the first use. See FusionComputeWith4 for a concrete example. I believe this error check would effectively require almost the same analysis as the expression sorting, and in fact the test fails at the expression sorting. Ideally, invalid computeWith like this should be detected at the time when a tensor is computed with, not when its fusion is lowered, but at this point, I don't see it's a must to get some progress with the outer persistent problem.

@naoyam naoyam requested review from csarofeen and zasdfgbnm November 2, 2022 22:18
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean solution for computing and allocating at different positions! I have just started, posting a few comments for now.

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 2, 2022

As a concrete data point, in one of the outer welford tests in PR #1772, FusionGridPersistWelfordOuterNormLike256x14x512_CUDA, it's around 330-340 GB/s on Titan RTX:

kernel1 run in 0.316224 ms, achieved: 324.961 GB/s
kernel1 run in 0.303104 ms, achieved: 339.027 GB/s
kernel1 run in 0.302304 ms, achieved: 339.924 GB/s
kernel1 run in 0.305152 ms, achieved: 336.752 GB/s
kernel1 run in 0.301056 ms, achieved: 341.333 GB/s
kernel1 run in 0.303264 ms, achieved: 338.848 GB/s
kernel1 run in 0.296768 ms, achieved: 346.265 GB/s
kernel1 run in 0.300992 ms, achieved: 341.406 GB/s
kernel1 run in 0.293824 ms, achieved: 349.735 GB/s
kernel1 run in 0.30048 ms, achieved: 341.988 GB/s
kernel1 run in 0.300416 ms, achieved: 342.061 GB/s

With computeWith used as shown here, I got around 390 GB/s:

kernel1 run in 0.278368 ms, achieved: 369.153 GB/s
kernel1 run in 0.263232 ms, achieved: 390.38 GB/s
kernel1 run in 0.261792 ms, achieved: 392.527 GB/s
kernel1 run in 0.262144 ms, achieved: 392 GB/s
kernel1 run in 0.262912 ms, achieved: 390.855 GB/s
kernel1 run in 0.262144 ms, achieved: 392 GB/s
kernel1 run in 0.262112 ms, achieved: 392.048 GB/s
kernel1 run in 0.261696 ms, achieved: 392.671 GB/s
kernel1 run in 0.265312 ms, achieved: 387.319 GB/s
kernel1 run in 0.258336 ms, achieved: 397.778 GB/s
kernel1 run in 0.26368 ms, achieved: 389.716 GB/s

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posting more

@@ -0,0 +1,451 @@
#if defined(USE_CUDA)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating a new test file :)

": ",
consumer_pos);

// Find the corresponding position in this tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of interface, which do you think makes more sense? computeAt uses consumer position, but inlineAt uses this position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeAt as well as computeWith have a consumer parameter, and I thought it'd be more intuitive to have the position parameter relative to the consumer.

inlineAt, on the other hand, doesn't reference any particular consumer, and it makes sense that the position references its domain.

That said, the difference does seem confusing. If we wanted to have a uniform position semantic, I'd change computeWith to take a position in its domain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea on which will be more convenient to use in the scheduler, and I have no strong opinion about which should be used here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, the best-effort option should just work, so I think this difference won't matter much. I'm leaning toward changing this to the position in its own tensor as that's the position we ultimately save in each tensor (i.e., compute_with_pos_)

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question here is what would be the disadvantage of making compute at relative to the producer instead of a consumer. Right now it seems we're trying to make sure it gets inlined directly into a specific consumer, but is there a practical limitation of making it inlined relative to all consumers, and opportunistically picking the first consumer that lower expr sort returns during lowering?

I'm generally concerned about adding a sorting constraint (as you've identified) without having a mechanism to guarantee at scheduling time it can be sorted. If we instead let it sort the same way it would without the computeWith then just inline into the first consumer, it seems like it would achieve what you're looking for without opportunity to break sorting.

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 3, 2022

My main question here is what would be the disadvantage of making compute at relative to the producer instead of a consumer. Right now it seems we're trying to make sure it gets inlined directly into a specific consumer, but is there a practical limitation of making it inlined relative to all consumers, and opportunistically picking the first consumer that lower expr sort returns during lowering?

I'm generally concerned about adding a sorting constraint (as you've identified) without having a mechanism to guarantee at scheduling time it can be sorted. If we instead let it sort the same way it would without the computeWith then just inline into the first consumer, it seems like it would achieve what you're looking for without opportunity to break sorting.

That's interesting to think about. I'd feel reluctant to inline everything opportunistically. Maybe that's just what makes most sense, but I'd be leaning towards making it opt-in or opt-out. Maybe someday we may want to port nvFuser to CPUs, where we may want to do software pipelining, so it may not be always optimal to inline as much as possible.

I'm thinking about something like this.

  • Let the user specify which tensors to inline with computeWith (opt-in inlining)
  • When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

@csarofeen
Copy link
Owner

I'm thinking about something like this.
Let the user specify which tensors to inline with computeWith (opt-in inlining)
When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

Yeah, this sounds fine. I was just thinking the same except not needing to specify a set of consumers (so if you set computeWith it's just with any consumer that matches that position).

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 3, 2022

I'm thinking about something like this.
Let the user specify which tensors to inline with computeWith (opt-in inlining)
When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

Yeah, this sounds fine. I was just thinking the same except not needing to specify a set of consumers (so if you set computeWith it's just with any consumer that matches that position).

Ah, that seems like a safer and easier approach. Thanks for the suggestion. I'll update the PR.

@zasdfgbnm
Copy link
Collaborator

What is the status of this PR? I guess the only remaining two things are: #2156 (comment) and #2156 (comment) ?

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 10, 2022

Yeah, I'm going to change the interface. I'll ask reviews again when ready.

@naoyam naoyam marked this pull request as draft November 10, 2022 23:03
@naoyam
Copy link
Collaborator Author

naoyam commented Nov 15, 2022

Changed the computeWith interface as we discussed. Here's an example with a softmax-like pattern:

TEST_F(NVFuserTest, FusionComputeWith2_CUDA) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  const int vec = 4;
  const int tidx = 128;
  const int dimx = 1000;

  auto input_tv0 = makeContigTensor(1);
  fusion.addInput(input_tv0);

  auto exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
  auto sum_exp_tv2 = sum(exp_tv1, {-1});
  auto bcast_sum_tv3 = broadcast(sum_exp_tv2, {true});

  auto exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);

  auto output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);

  fusion.addOutput(output_tv4);

  auto input_tv0_cache = input_tv0->cacheAfter();

  input_tv0->split(-1, vec);
  input_tv0->split(-2, tidx);
  MaxRootDomainInfoSpanningTree tree(input_tv0);
  TransformPropagatorWithCheck tp(input_tv0);
  tree.traverse(&tp);

  auto sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-1});

  inlineMost();
  input_tv0_cache->computeWith(-1);

  input_tv0_cache->axis(0)->parallelize(ParallelType::BIDx);
  input_tv0_cache->axis(1)->parallelize(ParallelType::TIDx);
  scheduler_utils::parallelizeAllLike(input_tv0_cache);

  GpuLower gpulw(&fusion);
  // Lowering should automatcially pick the first consumer of the
  // computed-with tensor as its target
  checkComputeWith(
      gpulw.kernel(), input_tv0_cache, input_tv0_cache->nDims(), {exp_tv1});

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({dimx}, options);

  FusionExecutor fe;
  fe.compileFusion(&fusion, {t0});
  auto cg_outputs = fe.runFusion({t0});

  auto aten_output = at::_softmax(t0.to(at::kDouble), -1, false);

  testValidate(&fusion, cg_outputs, {t0}, {aten_output}, __LINE__, __FILE__);
}

It now has just:

input_tv0_cache->computeWith(-1);

Previously, it was:

input_tv0_cache->computeWith(-1, exp_tv1);

The target consumer tensor is automatically picked when the fusion is lowered. It is the first consumer tensor as appearing in the sorted expression list.

One of the major changes I needed to do is to do the expression sorting twice when computeWith is used: first to resolve computeWith targets and once more for lowering. The computeWith resolution is done at the very beginning of the lowering, before examining the fusion for validations and other analyses. Resolving computeWith, i.e., finding the right target of computeWith affects the computeAt LOOP mappings as well as the max producer positions, so anything depending on that information must be done after computeWith resolution.

It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly.

Remaining TODO:

  • Don't create the computeAt map from scratch again after computeWith resolution.

@naoyam naoyam marked this pull request as ready for review November 16, 2022 07:48
@naoyam
Copy link
Collaborator Author

naoyam commented Nov 16, 2022

This is ready for review again @zasdfgbnm @csarofeen

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything but the tests and only have one sticking point of confusion. I'll rely on @zasdfgbnm and @naoyam to make sure there's sufficient testing at this point.


// If it's already set to be computed with the consumer and the
// position is higher, nothing to change
if (getComputeWithPosition() >= pos) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update consumers in some way in this call to make sure that consumers don't transform in an inconsistent way after the computeWith of their producers is set?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me resolveComputeWith will simply fail if we're in an inconsistent state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Since at this point no consumer is aware of the potential computeWith into it, there's nothing to block the consumers to be transformed in an inconsistent way. I need to think about this more carefully.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a blocker for this PR, we might just want to more explicitly validate in lowering.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice in the future to error on transformation attempt of the consumer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added maybe_max_producer_pos_ (1c83e69). Don't like the name, but don't have any other idea.

Modifying consumer domains where producer may be computed at should now throw an error.

}
}

bool TensorView::resolveComputeWith(const std::vector<Expr*>& sorted_exprs) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted expressions is post expr sorting, not just the topologically sorted DAG, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will add a comment

}

// First use found. Set it as the computeWith target tensor
std::cerr << "Resolve the computeWith target as: " << expr->toString();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be cleaned up?

}

for (auto consumer_tv : compute_with_consumers_) {
consumer_tv->updateMaxProducerPosition();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this function was a little unclear in my first read, but think I got it. It seems you're not trying to enforce that all consumers can support the given computeWith from all its producers. Instead you only care if the first consumer in the sorted list actually supports it.

This seems more permissive than the conservative approach I was thinking of where we would simply enforce all consumers would support the computeWith position of all producers. Which would have made the behavior more conservative but consistent with computeAt behavior during scheduling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the max producer position of all consumers doesn't seem quite right to me. Except for one consumer where the producer is inlined, the other consumers still see the producer produced as if no computeWith is done.

I don't remember what parts of the system would be affected by the change of the max producer position, but the expr sorting is one of them. Thinking about the normalization pattern, and I don't think it would be able to resolve the dependencies of the consumer and producer positions if we update the max producer position of all the consumers.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume it would have to be maxProducerAtPosition and maxProducerWithPosition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm thinking about adding something like maybe_max_produce_position_, which should work as a guard to prevent further transformations, it shouldn't be used for the expr sorting.

// Don't know which consumer would be computed with at this
// point. Just make sure all the grouped reduction outputs have the
// same set of consumers. This is not necessarily a required
// condition and could be made more flexible
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reduction grouping be done after scheduling then? I'm confused what you're checking here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now reduction grouping is provided as a scheduling primitive (groupReductions(std::vector<TensorView*)) so that schedulers can opt-in to group particular reductions. This part of the code is part of the validations done before converting multiple ReductionOp exprs to a single GroupedReductionOp expr.

Everything happens before lowering, so there can be unresolved computeWith. Specifically, what this validation checks is that if the output of one of the grouped ReductionOp exprs is set to be computed-with, all of the other reduction outputs must have non-conflicting compute-with settings. That means all of the outputs have the same computeWith position. At this point, they are unresolved, so we don't know which consumer of each output will be picked as the target of the computeWith, so I simply enforced all of them must have the same consumers, which is more than enough to make the computeWith setting after grouping still valid.

@@ -347,7 +349,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {

// Reorder expressions for loop-nest generation respecting computeAt
// relationships
const auto exprs_sorted = reorderExprsForComputeAt();
auto exprs_sorted = reorderExprsForComputeAt();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little strange to me that above resolveComputeWith(fusion_); calls reorderExprsForComputeAt then it's being called again here. Can the expression ordering change between the two calls based on the result of the resolution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but may not be as is. That's what I meant to say in the above:

It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly.

//! which also means its computeWith needs to have been resolved, the
//! computeWith position is returned. Otherwise, the default computeAt
//! position is retured.
unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: getComputePosition

//! computeWith position is returned. Otherwise, the default computeAt
//! position is retured.
unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always be using the new get functions for the compute at position? Do we want to hide the getComputeAtPosition call now or is it still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can hide getComputeAtPosition entirely, but I want to rename it something like getStoreAtPosition. We still need that information, for example, finding allocation points.

"Invalid tensor: ",
compute_with_tv->toString());

// Can use any consumer this tensor is computed with
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought compute with is resolved after compute at map is built but this reads to me like we need to resolve the compute with before building the compute at map. I'm a little confused on the logical dependencies.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, think I understood now:

  1. Compute at map is built
  2. Expressions are sorted
  3. Compute with (the producer) is resolved based on first consumer found in expression sorting
  4. Compute at map updates the loop mappings based on compute with
  5. Expressions are sorted again with the compute with information
    The only step that concerns me I think is step (5). It seems to me we should assert the expressions are sorted the same way as in (2) otherwise I think that would be a logical inconsistency with the approach.

Loop map being updated in (4) I believe is to generate the right loop structure not to modify the sorted expression order.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't specify, @naoyam this is my only "sticking point" before approving.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. I'm not sure how the second sort could result in inconsistency. As long as the sorting work as intended, I believe we should be fine. The second sorting could result in different ordering as some more IterDomains are loop-mapped with updated max producer positions, but as long as it completes the soring algorithm, isn't the sorted list valid?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm okay with this. The expectation is the second sorting wouldn't have a tangible difference, some things I guess could be reordered, but I would think that shouldn't actually be the case because loop map would be updated based on the first consumer after a producer and updating the loop map should only reinforce them being next to each other.

I'm just wondering if we should assert the first and second sorting match, but maybe given the algorithm has some degrees of freedom it might just trigger coincidentally some expr "group" to reorder with another expr "group" which is just that "degree of freedom". i.e. there may be more than one unique valid ordering.

Yeah, I'll change this to approve, I'm okay with this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's some freedom, and updated loop mapping and max producer position might affect the decision about which ordering to use within the freedom. Otherwise, everything should be deterministic (except for bugs of course). I'm reasonably confident that the expression sorting should be able to handle both the first and second sorting consistently.

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, didn't review the tests.

@naoyam naoyam merged commit bf1596c into devel Nov 17, 2022
@naoyam naoyam deleted the prelim_store_at branch November 17, 2022 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants