Add computeWith to interleave gmem accesses and computations #2156

naoyam · 2022-11-02T19:38:36Z

This PR adds a new scheduling primitive called computeWith. It's different from what we previously had as computeWith.

The main motivation is that in persistent fusions we often have code like this:

float buffer[N];
// Load everything we need
for i in N:
  buffer[i] = load_from_gmem();
// Use the buffer
for i in N:
  F(buffer[i]);
...
// Use the buffer again
for i in N:
  G(buffer[i]);

Here, when F is some non-trivial computation, like the serial Welford computation, I observed 10-15 % perf gain by converting it the code as shown below:

float buffer[N];
// Load and use the data
for i in N:
  buffer[i] = load_from_gmem();
   F(buffer[i]);
...
// Use the buffer again
for i in N:
  G(buffer[i]);

Note that N is a compile-time constant, so all the loops are completely unrolled. There should be nothing to block nvcc to interleave the initial memory read and the first use automatically, but apparently that doesn't seem to happen.

The existing primitives in nvFuser don't allow us to express this code pattern. computeAt or inlineAt can be used to merge the first two loops in the first example of the above, but they also mean the allocation of the buffer is inlined inside the merged loop, thus the code would become invalid as there's another use of the buffer, G(buffer[i]). For this persistent pattern, we always need to make sure the buffer is allocated that is commonly in the scope of all the uses, but somehow we would like to move only the expression of the buffer read into the loop where it is first used.

This effectively means we would need something similar to store_at in Halide. Our computeAt is less flexible as it dictates both the allocation point and the computation point.

The computeWith primitive is to address the lack of the store-at concept in a limited manner without heavily modifying the existing primitives. It does not allow specifying the allocation point of a tensor, but rather it can be used to inline the computation of an expression at a consumer without changing the allocation point. So, for example, in the above case, buffer->computeWith(F, -1) would move the buffer loading into the loop of the F loop.

More specifically, TensorView has these new fields:

//! Direct consumer tensors that this tensor is computed with
std::vector<TensorView*> compute_with_consumers_;
//! Position where this tensor is computed with the compute-with
//! consumer tensor. It should be always be equal or greater than
//! the computeAt position
unsigned int compute_with_pos_ = 0;

compute_with_consumers_ hold a list of consumers that this tensor is computed with. It's usually just one consumer tensor, but there can be multiple tensors when a consumer has siblings. compute_with_pos_ is the position where this tensor is computed with the consumers. Obviously, this must be always greater than or equal to compute_at_pos_.

Note that at this point I'm still keeping compute_at_pos_ to mean the allocation point. We may want to rename it to store_at_pos_.

I also added several functions to TensorView. Namely:

computeWith(TensorView* consumer, int consumer_pos, bool best_effort) to do the computeWith setting with some error checking
hasComputedWith to query if this tensor is computed with any consumer
isComputedWith(TensorView* consumer) to query if it's computed with consumer
getMaxComputeAtPosition() to get the maximum of the computeAt and computeWith positions
getMaybeComputeAtPosition(TensorView* consumer) to get the position where this tensor is computed from the perspective of the given consumer. If this tenso is computed with the consumer, it returns the compute-with position. Otherwise, it returns the compute-at position, which effectively means the store-at position.

The actual implementation of computeWith is mostly borrowed from inlineAt. Unlike computeAt but similar to inlineAt, it does not transform itself nor the consumer but just sets the computeWith position and the consumer tensors that it's computed with. Unlike inlineAt, it does not need to have the constraint of allocation of persistent buffers since it does not change allocation points, which means we can skip building of the unmappable domains using ComputeAtRootDomainMap.

For concrete examples of how it's used, see the new tests in test_gpu_computed_with.cpp. In particular, FusionComputeWith7 reproducers the original motivating case of this transformation.

Some more notes:

Nothing should change unless computeWith is used.
I looked over all the uses of getComputeAtPosition() and changed them to getMaxComputeAtPosition() or getMaybeComputeAtPosition() when I thought necessary. I'm not super confident but nothing seems broken.
The LOOP computeAt map is modified to map all leaf IDs that are left of the compute-with position when the consumer is the one this tensor is computed with.
One of the major limitations is that there's no error check with the computed-with consumer. For example, in the above case, we can move the memory read at the first use but not at the second use since that would break the data dependency between the read and the first use. See FusionComputeWith4 for a concrete example. I believe this error check would effectively require almost the same analysis as the expression sorting, and in fact the test fails at the expression sorting. Ideally, invalid computeWith like this should be detected at the time when a tensor is computed with, not when its fusion is lowered, but at this point, I don't see it's a must to get some progress with the outer persistent problem.

zasdfgbnm

Very clean solution for computing and allocating at different positions! I have just started, posting a few comments for now.

torch/csrc/jit/codegen/cuda/inlining.cpp

torch/csrc/jit/codegen/cuda/inlining.h

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

naoyam · 2022-11-02T23:09:14Z

As a concrete data point, in one of the outer welford tests in PR #1772, FusionGridPersistWelfordOuterNormLike256x14x512_CUDA, it's around 330-340 GB/s on Titan RTX:

kernel1 run in 0.316224 ms, achieved: 324.961 GB/s
kernel1 run in 0.303104 ms, achieved: 339.027 GB/s
kernel1 run in 0.302304 ms, achieved: 339.924 GB/s
kernel1 run in 0.305152 ms, achieved: 336.752 GB/s
kernel1 run in 0.301056 ms, achieved: 341.333 GB/s
kernel1 run in 0.303264 ms, achieved: 338.848 GB/s
kernel1 run in 0.296768 ms, achieved: 346.265 GB/s
kernel1 run in 0.300992 ms, achieved: 341.406 GB/s
kernel1 run in 0.293824 ms, achieved: 349.735 GB/s
kernel1 run in 0.30048 ms, achieved: 341.988 GB/s
kernel1 run in 0.300416 ms, achieved: 342.061 GB/s

With computeWith used as shown here, I got around 390 GB/s:

kernel1 run in 0.278368 ms, achieved: 369.153 GB/s
kernel1 run in 0.263232 ms, achieved: 390.38 GB/s
kernel1 run in 0.261792 ms, achieved: 392.527 GB/s
kernel1 run in 0.262144 ms, achieved: 392 GB/s
kernel1 run in 0.262912 ms, achieved: 390.855 GB/s
kernel1 run in 0.262144 ms, achieved: 392 GB/s
kernel1 run in 0.262112 ms, achieved: 392.048 GB/s
kernel1 run in 0.261696 ms, achieved: 392.671 GB/s
kernel1 run in 0.265312 ms, achieved: 387.319 GB/s
kernel1 run in 0.258336 ms, achieved: 397.778 GB/s
kernel1 run in 0.26368 ms, achieved: 389.716 GB/s

zasdfgbnm

posting more

torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp

torch/csrc/jit/codegen/cuda/lower_sync_information.cpp

torch/csrc/jit/codegen/cuda/transform_replay.cpp

zasdfgbnm · 2022-11-02T23:16:45Z

torch/csrc/jit/codegen/cuda/test/test_gpu_compute_with.cpp

@@ -0,0 +1,451 @@
+#if defined(USE_CUDA)


Thank you for creating a new test file :)

torch/csrc/jit/codegen/cuda/tensor_view.cpp

zasdfgbnm · 2022-11-02T23:33:17Z

torch/csrc/jit/codegen/cuda/tensor_view.cpp

+      ": ",
+      consumer_pos);
+
+  // Find the corresponding position in this tensor


In terms of interface, which do you think makes more sense? computeAt uses consumer position, but inlineAt uses this position.

computeAt as well as computeWith have a consumer parameter, and I thought it'd be more intuitive to have the position parameter relative to the consumer.

inlineAt, on the other hand, doesn't reference any particular consumer, and it makes sense that the position references its domain.

That said, the difference does seem confusing. If we wanted to have a uniform position semantic, I'd change computeWith to take a position in its domain.

I have no idea on which will be more convenient to use in the scheduler, and I have no strong opinion about which should be used here.

Most likely, the best-effort option should just work, so I think this difference won't matter much. I'm leaning toward changing this to the position in its own tensor as that's the position we ultimately save in each tensor (i.e., compute_with_pos_)

torch/csrc/jit/codegen/cuda/tensor_view.cpp

torch/csrc/jit/codegen/cuda/test/test_gpu_compute_with.cpp

torch/csrc/jit/codegen/cuda/transform_replay.cpp

…f CA

csarofeen

My main question here is what would be the disadvantage of making compute at relative to the producer instead of a consumer. Right now it seems we're trying to make sure it gets inlined directly into a specific consumer, but is there a practical limitation of making it inlined relative to all consumers, and opportunistically picking the first consumer that lower expr sort returns during lowering?

I'm generally concerned about adding a sorting constraint (as you've identified) without having a mechanism to guarantee at scheduling time it can be sorted. If we instead let it sort the same way it would without the computeWith then just inline into the first consumer, it seems like it would achieve what you're looking for without opportunity to break sorting.

torch/csrc/jit/codegen/cuda/compute_at_map.cpp

torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

naoyam · 2022-11-03T16:41:30Z

My main question here is what would be the disadvantage of making compute at relative to the producer instead of a consumer. Right now it seems we're trying to make sure it gets inlined directly into a specific consumer, but is there a practical limitation of making it inlined relative to all consumers, and opportunistically picking the first consumer that lower expr sort returns during lowering?

I'm generally concerned about adding a sorting constraint (as you've identified) without having a mechanism to guarantee at scheduling time it can be sorted. If we instead let it sort the same way it would without the computeWith then just inline into the first consumer, it seems like it would achieve what you're looking for without opportunity to break sorting.

That's interesting to think about. I'd feel reluctant to inline everything opportunistically. Maybe that's just what makes most sense, but I'd be leaning towards making it opt-in or opt-out. Maybe someday we may want to port nvFuser to CPUs, where we may want to do software pipelining, so it may not be always optimal to inline as much as possible.

I'm thinking about something like this.

Let the user specify which tensors to inline with computeWith (opt-in inlining)
When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

csarofeen · 2022-11-03T17:40:44Z

I'm thinking about something like this.
Let the user specify which tensors to inline with computeWith (opt-in inlining)
When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

Yeah, this sounds fine. I was just thinking the same except not needing to specify a set of consumers (so if you set computeWith it's just with any consumer that matches that position).

naoyam · 2022-11-03T17:49:44Z

I'm thinking about something like this.
Let the user specify which tensors to inline with computeWith (opt-in inlining)
When lowering, sort the expressions WITHOUT considering computeWith and validate if computed-with tensors can be moved without invalidating the ordering determined by the sorting.

Yeah, this sounds fine. I was just thinking the same except not needing to specify a set of consumers (so if you set computeWith it's just with any consumer that matches that position).

Ah, that seems like a safer and easier approach. Thanks for the suggestion. I'll update the PR.

zasdfgbnm · 2022-11-10T19:52:31Z

What is the status of this PR? I guess the only remaining two things are: #2156 (comment) and #2156 (comment) ?

naoyam · 2022-11-10T23:03:24Z

Yeah, I'm going to change the interface. I'll ask reviews again when ready.

naoyam · 2022-11-15T23:51:05Z

Changed the computeWith interface as we discussed. Here's an example with a softmax-like pattern:

TEST_F(NVFuserTest, FusionComputeWith2_CUDA) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  const int vec = 4;
  const int tidx = 128;
  const int dimx = 1000;

  auto input_tv0 = makeContigTensor(1);
  fusion.addInput(input_tv0);

  auto exp_tv1 = unaryOp(UnaryOpType::Exp, input_tv0);
  auto sum_exp_tv2 = sum(exp_tv1, {-1});
  auto bcast_sum_tv3 = broadcast(sum_exp_tv2, {true});

  auto exp_tv1_copy = unaryOp(UnaryOpType::Exp, input_tv0);

  auto output_tv4 = div(exp_tv1_copy, bcast_sum_tv3);

  fusion.addOutput(output_tv4);

  auto input_tv0_cache = input_tv0->cacheAfter();

  input_tv0->split(-1, vec);
  input_tv0->split(-2, tidx);
  MaxRootDomainInfoSpanningTree tree(input_tv0);
  TransformPropagatorWithCheck tp(input_tv0);
  tree.traverse(&tp);

  auto sum_exp_rf_tv5 = sum_exp_tv2->rFactor({-1});

  inlineMost();
  input_tv0_cache->computeWith(-1);

  input_tv0_cache->axis(0)->parallelize(ParallelType::BIDx);
  input_tv0_cache->axis(1)->parallelize(ParallelType::TIDx);
  scheduler_utils::parallelizeAllLike(input_tv0_cache);

  GpuLower gpulw(&fusion);
  // Lowering should automatcially pick the first consumer of the
  // computed-with tensor as its target
  checkComputeWith(
      gpulw.kernel(), input_tv0_cache, input_tv0_cache->nDims(), {exp_tv1});

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({dimx}, options);

  FusionExecutor fe;
  fe.compileFusion(&fusion, {t0});
  auto cg_outputs = fe.runFusion({t0});

  auto aten_output = at::_softmax(t0.to(at::kDouble), -1, false);

  testValidate(&fusion, cg_outputs, {t0}, {aten_output}, __LINE__, __FILE__);
}

It now has just:

input_tv0_cache->computeWith(-1);

Previously, it was:

input_tv0_cache->computeWith(-1, exp_tv1);

The target consumer tensor is automatically picked when the fusion is lowered. It is the first consumer tensor as appearing in the sorted expression list.

One of the major changes I needed to do is to do the expression sorting twice when computeWith is used: first to resolve computeWith targets and once more for lowering. The computeWith resolution is done at the very beginning of the lowering, before examining the fusion for validations and other analyses. Resolving computeWith, i.e., finding the right target of computeWith affects the computeAt LOOP mappings as well as the max producer positions, so anything depending on that information must be done after computeWith resolution.

It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly.

Remaining TODO:

Don't create the computeAt map from scratch again after computeWith resolution.

naoyam · 2022-11-16T08:14:54Z

This is ready for review again @zasdfgbnm @csarofeen

csarofeen

Reviewed everything but the tests and only have one sticking point of confusion. I'll rely on @zasdfgbnm and @naoyam to make sure there's sufficient testing at this point.

csarofeen · 2022-11-16T14:15:09Z

torch/csrc/jit/codegen/cuda/tensor_view.cpp

+
+  // If it's already set to be computed with the consumer and the
+  // position is higher, nothing to change
+  if (getComputeWithPosition() >= pos) {


Do we need to update consumers in some way in this call to make sure that consumers don't transform in an inconsistent way after the computeWith of their producers is set?

It seems to me resolveComputeWith will simply fail if we're in an inconsistent state?

Ah, I see. Since at this point no consumer is aware of the potential computeWith into it, there's nothing to block the consumers to be transformed in an inconsistent way. I need to think about this more carefully.

I don't think this needs to be a blocker for this PR, we might just want to more explicitly validate in lowering.

Would be nice in the future to error on transformation attempt of the consumer.

Added maybe_max_producer_pos_ (1c83e69). Don't like the name, but don't have any other idea.

Modifying consumer domains where producer may be computed at should now throw an error.

csarofeen · 2022-11-16T14:16:23Z

torch/csrc/jit/codegen/cuda/tensor_view.cpp

+  }
+}
+
+bool TensorView::resolveComputeWith(const std::vector<Expr*>& sorted_exprs) {


Sorted expressions is post expr sorting, not just the topologically sorted DAG, correct?

Yes. Will add a comment

csarofeen · 2022-11-16T14:17:36Z

torch/csrc/jit/codegen/cuda/tensor_view.cpp

+    }
+
+    // First use found. Set it as the computeWith target tensor
+    std::cerr << "Resolve the computeWith target as: " << expr->toString();


Does this need to be cleaned up?

torch/csrc/jit/codegen/cuda/tensor_view.cpp

csarofeen · 2022-11-16T14:23:29Z

torch/csrc/jit/codegen/cuda/tensor_view.cpp

+    }
+
+    for (auto consumer_tv : compute_with_consumers_) {
+      consumer_tv->updateMaxProducerPosition();


Okay, this function was a little unclear in my first read, but think I got it. It seems you're not trying to enforce that all consumers can support the given computeWith from all its producers. Instead you only care if the first consumer in the sorted list actually supports it.

This seems more permissive than the conservative approach I was thinking of where we would simply enforce all consumers would support the computeWith position of all producers. Which would have made the behavior more conservative but consistent with computeAt behavior during scheduling.

Updating the max producer position of all consumers doesn't seem quite right to me. Except for one consumer where the producer is inlined, the other consumers still see the producer produced as if no computeWith is done.

I don't remember what parts of the system would be affected by the change of the max producer position, but the expr sorting is one of them. Thinking about the normalization pattern, and I don't think it would be able to resolve the dependencies of the consumer and producer positions if we update the max producer position of all the consumers.

I would assume it would have to be maxProducerAtPosition and maxProducerWithPosition.

Yes, I'm thinking about adding something like maybe_max_produce_position_, which should work as a guard to prevent further transformations, it shouldn't be used for the expr sorting.

csarofeen · 2022-11-16T14:33:42Z

torch/csrc/jit/codegen/cuda/grouped_reduction.cpp

+  // Don't know which consumer would be computed with at this
+  // point. Just make sure all the grouped reduction outputs have the
+  // same set of consumers. This is not necessarily a required
+  // condition and could be made more flexible


Should reduction grouping be done after scheduling then? I'm confused what you're checking here.

Right now reduction grouping is provided as a scheduling primitive (groupReductions(std::vector<TensorView*)) so that schedulers can opt-in to group particular reductions. This part of the code is part of the validations done before converting multiple ReductionOp exprs to a single GroupedReductionOp expr.

Everything happens before lowering, so there can be unresolved computeWith. Specifically, what this validation checks is that if the output of one of the grouped ReductionOp exprs is set to be computed-with, all of the other reduction outputs must have non-conflicting compute-with settings. That means all of the outputs have the same computeWith position. At this point, they are unresolved, so we don't know which consumer of each output will be picked as the target of the computeWith, so I simply enforced all of them must have the same consumers, which is more than enough to make the computeWith setting after grouping still valid.

csarofeen · 2022-11-16T14:38:57Z

torch/csrc/jit/codegen/cuda/lower2device.cpp

@@ -347,7 +349,7 @@ void GpuLower::lower(Fusion* fusion, DataType index_type) {

  // Reorder expressions for loop-nest generation respecting computeAt
  // relationships
-  const auto exprs_sorted = reorderExprsForComputeAt();
+  auto exprs_sorted = reorderExprsForComputeAt();


Little strange to me that above resolveComputeWith(fusion_); calls reorderExprsForComputeAt then it's being called again here. Can the expression ordering change between the two calls based on the result of the resolution?

It can, but may not be as is. That's what I meant to say in the above:

It is also possible to reuse the sorted expression list but not always as we also replace some expressions in the Fusion container at the beginning of lowering, so the sorted list needs to be also updated. It doesn't mean we would need to do the complete sorting again but we could just update the list as necessary. I didn't try that, but I'd revisit if sorting is really costly.

csarofeen · 2022-11-16T14:40:06Z

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

+  //! which also means its computeWith needs to have been resolved, the
+  //! computeWith position is returned. Otherwise, the default computeAt
+  //! position is retured.
+  unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const;


Nit: getComputePosition

csarofeen · 2022-11-16T14:41:03Z

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

+  //! computeWith position is returned. Otherwise, the default computeAt
+  //! position is retured.
+  unsigned int getMaybeComputeAtPosition(const TensorView* consumer) const;
+


Should we always be using the new get functions for the compute at position? Do we want to hide the getComputeAtPosition call now or is it still needed?

I don't think we can hide getComputeAtPosition entirely, but I want to rename it something like getStoreAtPosition. We still need that information, for example, finding allocation points.

csarofeen · 2022-11-16T14:43:52Z

torch/csrc/jit/codegen/cuda/compute_at_map.cpp

+      "Invalid tensor: ",
+      compute_with_tv->toString());
+
+  // Can use any consumer this tensor is computed with


I thought compute with is resolved after compute at map is built but this reads to me like we need to resolve the compute with before building the compute at map. I'm a little confused on the logical dependencies.

Okay, think I understood now:

Compute at map is built

Expressions are sorted

Compute with (the producer) is resolved based on first consumer found in expression sorting

Compute at map updates the loop mappings based on compute with

Expressions are sorted again with the compute with information
The only step that concerns me I think is step (5). It seems to me we should assert the expressions are sorted the same way as in (2) otherwise I think that would be a logical inconsistency with the approach.

Loop map being updated in (4) I believe is to generate the right loop structure not to modify the sorted expression order.

Didn't specify, @naoyam this is my only "sticking point" before approving.

Yes, that is correct. I'm not sure how the second sort could result in inconsistency. As long as the sorting work as intended, I believe we should be fine. The second sorting could result in different ordering as some more IterDomains are loop-mapped with updated max producer positions, but as long as it completes the soring algorithm, isn't the sorted list valid?

Yeah, I'm okay with this. The expectation is the second sorting wouldn't have a tangible difference, some things I guess could be reordered, but I would think that shouldn't actually be the case because loop map would be updated based on the first consumer after a producer and updating the loop map should only reinforce them being next to each other.

I'm just wondering if we should assert the first and second sorting match, but maybe given the algorithm has some degrees of freedom it might just trigger coincidentally some expr "group" to reorder with another expr "group" which is just that "degree of freedom". i.e. there may be more than one unique valid ordering.

Yeah, I'll change this to approve, I'm okay with this.

Yes, there's some freedom, and updated loop mapping and max producer position might affect the decision about which ordering to use within the freedom. Otherwise, everything should be deterministic (except for bugs of course). I'm reasonably confident that the expression sorting should be able to handle both the first and second sorting consistently.

csarofeen

LGTM, didn't review the tests.

torch/csrc/jit/codegen/cuda/lower2device.cpp

be potentially shared with producers through computeWith

naoyam added 6 commits November 1, 2022 22:12

computeWith

04c7e85

Update other uses of getComputeAtPos

f1943bd

update test

8e09bbe

Support computeWith of multiple outputs

de1727c

computeWith and siblings

6cb8ca0

Cleanup

2a9dd26

naoyam requested review from csarofeen and zasdfgbnm November 2, 2022 22:18

fix

cac4ac5

zasdfgbnm reviewed Nov 2, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/inlining.cpp Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/inlining.h Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h Outdated Show resolved Hide resolved

zasdfgbnm reviewed Nov 2, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp Outdated Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/lower_sync_information.cpp Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/transform_replay.cpp Outdated Show resolved Hide resolved

Fix comment

5081320

naoyam mentioned this pull request Nov 2, 2022

Fix position comparison in lower_double_buffer #2158

Open

zasdfgbnm reviewed Nov 2, 2022

View reviewed changes

zasdfgbnm reviewed Nov 3, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/test/test_gpu_compute_with.cpp Outdated Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/test/test_gpu_compute_with.cpp Outdated Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/transform_replay.cpp Outdated Show resolved Hide resolved

naoyam added 3 commits November 2, 2022 17:24

Reset compute-with before updating max-prod-pos

98a97a7

Mention it's an error to attemp todo computeWith at a position left o…

2db1665

…f CA

test cleanup

d25f218

csarofeen reviewed Nov 3, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/compute_at_map.cpp Outdated Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/lower_expr_sort.cpp Show resolved Hide resolved

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h Outdated Show resolved Hide resolved

PR feedback

42f1e3b

naoyam marked this pull request as draft November 10, 2022 23:03

naoyam added 3 commits November 14, 2022 20:11

Merge branch 'devel' into prelim_store_at

eb32591

Pick the first consumer as computeWith target

272ca07

test cleanup

06fa15e

Resolve computeWith first

8eae7cd

naoyam added 3 commits November 15, 2022 16:09

fix

e912d62

fix

250adc7

update ca map

c7b688e

naoyam marked this pull request as ready for review November 16, 2022 07:48

naoyam added 2 commits November 16, 2022 00:04

cleanup

3d7f401

Merge branch 'devel' into prelim_store_at

fc48c66

naoyam requested review from zasdfgbnm and csarofeen November 16, 2022 08:14

csarofeen reviewed Nov 16, 2022

View reviewed changes

csarofeen approved these changes Nov 16, 2022

View reviewed changes

cleanup

116a5b1

zasdfgbnm approved these changes Nov 16, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/lower2device.cpp Outdated Show resolved Hide resolved

naoyam added 2 commits November 16, 2022 17:11

Add maybe_max_producer_pos_ to prevent modifications of axes that may

1c83e69

be potentially shared with producers through computeWith

add back const

908c1d6

naoyam merged commit bf1596c into devel Nov 17, 2022

naoyam deleted the prelim_store_at branch November 17, 2022 03:13

naoyam mentioned this pull request Dec 1, 2022

[WIP] Enable the persistent scheduler to use grid persistence #1772

Closed

9 tasks

Add computeWith to interleave gmem accesses and computations #2156

Add computeWith to interleave gmem accesses and computations #2156

Uh oh!

Conversation

naoyam commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naoyam commented Nov 2, 2022

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naoyam commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen commented Nov 3, 2022

Uh oh!

naoyam commented Nov 3, 2022

Uh oh!

zasdfgbnm commented Nov 10, 2022

Uh oh!

naoyam commented Nov 10, 2022

Uh oh!

naoyam commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Nov 16, 2022

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Nov 2, 2022 •

edited

Loading

naoyam commented Nov 3, 2022 •

edited

Loading

naoyam commented Nov 15, 2022 •

edited

Loading