Separate the class hierarchies for Fusion IR and Kernel IR #428

tlemo · 2020-10-16T23:20:32Z

This PR introduces a hard split between the Fusion IR and the Kernel IR: each form has a dedicated class hierarchy. This means
that we're free to specialize and evolve each IR without having to worry about the internal details of the "other side".

Separate class hierarchies also make the C++ static type system work for us, accidental mixes would be detected early, at compile time.

The PR touches a lot of code since the new types triggered a cascading set of changes. A lot of the changes are simple, but there are a few notable differences:

the Kernel IR is owned by the Kernel object, and with a few minor details (kir::TensorView::fuserTv) it is largely decoupled from the Fusion IR
After the initial lowering pass (LoopNestGenerator::loweredExprs), everything is Kernel IR
No more `TensorView::unsafeClone(). Replaced with a bit smaller hack.
Dedicated Kernel IR visitor (kir::IrVisitor)
There's a dedicated expression evaluator for the Kernel IR (kir::ExpressionEvaluator)
GpuLower::lowerExpr() can be used to automatically lower a Fusion IR expression node

Co-authored-by: Christian Sarofeen <[email protected]>

* Fix #306 * Reenable smem block gemm cache test.

Fixes #230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR

Add an lstm cell c++ test for convenience.

removing graph copy from critical code path; cache hasReduction result

Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning) Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp

Fixes #305 sys env to disabling fma and specify optimization level for jit compilation

Removing support for cloning Kernel IR nodes, which is not needed today.

Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()

* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment

* Initial Dynamic Shared Memory Check if shared memory usage is within limits for current GPU Gather buffers in a single pass Use single dynamic shared memory for reduction/broadcast workspace Align dynamic shared memory by data type Co-authored-by: Ryan Spring <[email protected]>

An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.

* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience

#333) Add Executor method to compile from a string for debug usage. Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.

…338)

Move IterVisitor derived classes from fusion.h to iter_visitor.h

#341)

Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.

Removing support for Kernel IR nodes from IrGraphGenerator

While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes. The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.

torch/csrc/jit/codegen/cuda/index_compute.cpp

torch/csrc/jit/codegen/cuda/kernel.cpp

torch/csrc/jit/codegen/cuda/kernel_ir.h

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

torch/csrc/jit/codegen/cuda/lower_index.cpp

torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp

csarofeen

Changes overall look really good, nice work :-)
Only one significant comment in predicate compute.

csarofeen · 2020-10-20T14:36:29Z

torch/csrc/jit/codegen/cuda/executor_utils.cpp

+  return expr_eval;
+}
+
+StatefulExpressionEvaluator bindFusionInputs(


Should we start pulling out StatefulExpressionEvaluation, or more generically expression evaluation on the Fusion IR all together?

I think that evaluation at the Fusion level is still valuable - creating the schedules is one use-case where we need it.

We may be able to simplify a few things, but I tried to minimize the number of changes in this PR (the size of the PR makes this a bit hard to believe, I know)

csarofeen · 2020-10-20T14:38:10Z

torch/csrc/jit/codegen/cuda/expr_evaluator.cpp

  auto already_concrete_val = getValue(value);

+  // TODO(kir): do we need this anymore?


Seems like we can pull stateful expression evaluation out soon. Looks like in the new implementation you've pulled this logic outside the evaluation class which seems reasonable.

csarofeen · 2020-10-20T14:49:47Z

torch/csrc/jit/codegen/cuda/ir_base_nodes.h

@@ -165,10 +165,6 @@ class TORCH_CUDA_API Statement : public NonCopyable, public PolymorphicBase {
 */
 class TORCH_CUDA_API Val : public Statement {
 public:
-  virtual ~Val() = default;
-
-  Val() = delete;


Why remove this line?

We get the virtual destructor from Statement (actually from PolymorphicBase). And the default constructor is implicitly deleted since we have other user-declared constructors.

csarofeen · 2020-10-20T14:50:19Z

torch/csrc/jit/codegen/cuda/ir_base_nodes.h

  c10::optional<ValType> getValType() const override {
    return vtype_;
  }

  // Throws if no DataType is found. Vals must have a DataType
+  //
+  // TODO: why is this optional?


Simply because we're overriding definition in statement which is inherited for expr as well:

virtual c10::optional<ValType> getValType() const { return c10::nullopt; } virtual c10::optional<DataType> getDataType() const { return c10::nullopt; } virtual c10::optional<ExprType> getExprType() const { return c10::nullopt; }

I'm fine with changing this if you want to.

makes sense, thanks for the explanation. It may be worth revisiting, but not in the scope for this PR.

csarofeen · 2020-10-20T14:51:16Z

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

-  // dependency model to allow for initailziation of reduction buffers. The only
-  // reason we can get away with this for now is because we don't use dependency
-  // analysis for the IR after we call this.
-  TensorView* unsafeClone() const;


This is one of the highlights of this PR :)

csarofeen · 2020-10-20T16:48:23Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

-    pushBack(ir_builder_.create<kir::BinaryOp>(
-        rop->getReductionOpType(), out, out, in));
+    // TODO(kir): this breaks our "SSA" form
+    pushBack(ir_builder_.create<kir::BinaryOp>(rop->operation(), out, out, in));


Do you want to do this only at printing, leaving it as an reduction op until then?

That would be one option. It doesn't break anything at this point and I'd like to revisit it in a follow up iteration.

csarofeen · 2020-10-20T17:28:14Z

torch/csrc/jit/codegen/cuda/lower_loops.cpp

@@ -21,19 +23,20 @@ LoopNestGenerator::LoopNestGenerator(
    ThreadPredicateMap& thread_predicates,


are we no longer using these?

yep, no longer needed (it was used only for unsafe clone). I'll remove the parameter too, thanks for pointing it out.

csarofeen · 2020-10-20T18:06:22Z

torch/csrc/jit/codegen/cuda/lower_utils.cpp

-  }
-  return expr->as<kir::ForLoop>()->iter_domain()->getParallelType() ==
-      ParallelType::Unroll;
+// TODO(kir): revisit, is it really needed?


Looks like this wouldn't be hard to remove.

Indeed, in the final version there are no more uses so I'm removing it.

csarofeen · 2020-10-20T18:12:27Z

torch/csrc/jit/codegen/cuda/predicate_compute.cpp

+// find the first (and only) TensorView output
+//
+// TODO(kir): same question as ir_utils::getTvOutput():
+//    why do we assume a single TV output?


There's implicit assumptions in the approach we use that exprs only have one output TV. Haven't come across any instances yet where there would have to be multiple output TVs.

Thinking about it, uncertain we have any exprs with more than one output.

csarofeen · 2020-10-20T18:23:23Z

torch/csrc/jit/codegen/cuda/predicate_compute.cpp

@@ -131,20 +154,6 @@ kir::Bool* PredicateCompute::getInlinePredicate(
  auto root_indices = pred_inds.first;
  bool use_maybe_rfactor = pred_inds.second;

-  if (out_tv->getMemoryType() == MemoryType::Local && out_tv->hasReduction() &&


Did this logic move somewhere? This is trying to catch instances where we're initializing reduction buffers in global and shared memory. We don't want to generate predicates in this case.

This seems strange indeed. I can't remember the reason, it could've been accidental (I was making deeper change in this file).

Added back, thanks.

torch/csrc/jit/codegen/cuda/lower_loops.cpp

torch/csrc/jit/codegen/cuda/predicate_compute.cpp

tlemo · 2020-10-20T18:58:42Z

Yes, I was referring to IrVisitor. I understand your point, but at the same time I'd prefer consistency within our codebase.

I agree that we should aim for consistency. I don't think my previous reply was very clear though, sorry: the different vocabulary in this iteration is intentional - the semantics of the kir::IrVisitor and the old xxxDispatch are different, even if we wanted to deviate from the established vocabulary, using "handle" instead of "visit" for kir::IrVisitor would lead to deeper and more subtle confusion. For example, calling "visit" directly (as in the Dispatch implementation) would be a bug since it would bypass the double dispatch). Another difference is the handling of abstract base classes.

I think it's worth discussing options for consolidating the "dispatch" implementation, but I'd prefer to do it separate from the PR (which I hope illustrates a simpler alternative that may make sense for the Fusion too)

naoyam · 2020-10-20T19:13:59Z

Yes, I was referring to IrVisitor. I understand your point, but at the same time I'd prefer consistency within our codebase.

I agree that we should aim for consistency. I don't think my previous reply was very clear though, sorry: the different vocabulary in this iteration is intentional - the semantics of the kir::IrVisitor and the old xxxDispatch are different, even if we wanted to deviate from the established vocabulary, using "handle" instead of "visit" for kir::IrVisitor would lead to deeper and more subtle confusion. For example, calling "visit" directly (as in the Dispatch implementation) would be a bug since it would bypass the double dispatch). Another difference is the handling of abstract base classes.

I think it's worth discussing options for consolidating the "dispatch" implementation, but I'd prefer to do it separate from the PR (which I hope illustrates a simpler alternative that may make sense for the Fusion too)

OK, that makes sense to me.

tlemo · 2020-10-20T22:59:26Z

Only one significant comment in predicate compute.

Fixed, thankfully both you and @naoyam caught it!

naoyam · 2020-10-21T17:21:28Z

LGTM. Some concerns on naming remain, but they can be addressed in the future if really necessary.

jjsjann123 and others added 30 commits August 18, 2020 11:53

CI, to our fork. (#145) (#303)

7325643

Co-authored-by: Christian Sarofeen <[email protected]>

Fix for issue #306 and #296 (#307)

47f6a57

* Fix #306 * Reenable smem block gemm cache test.

removing WAR of contig flag for broadcasting (#301)

1793533

Fixes #230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR

LSTM cell C++ test (#310)

f12ab01

Add an lstm cell c++ test for convenience.

Fix predicate generation, there was a broken root map. (#311)

4ab4110

Reorder expressions in a breadth-first order (#312)

ce9ac6e

Runtime overhead reduction pr (#309)

9766713

removing graph copy from critical code path; cache hasReduction result

Split the origin (def) links between Fusion IR and Kernel IR

1bf4028

Merge remote-tracking branch 'origin/20_8_18_devel' into kernel_ir

0420efc

Merge remote-tracking branch 'origin/20_8_18_devel' into kernel_ir

52daa86

Debug env disable fma (#315)

3cc7ab7

Fixes #305 sys env to disabling fma and specify optimization level for jit compilation

Kernel IR refactoring: part 6.1 (#316)

e40aaca

Removing support for cloning Kernel IR nodes, which is not needed today.

Merge remote-tracking branch 'origin/20_8_18_devel' into kernel_ir

6685712

Fix kir::Sync::Sync() registration (#317)

ffd7ba3

Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()

Add an IRPrinter handler for kir::TensorView (#318)

6f94724

* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment

Detect computeAt causing mismatched TensorDomain (#327)

930cfe0

An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.

Additional tests on computeAt with minor refactoring (#331)

b7a1060

* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience

Merge remote-tracking branch 'origin/20_8_18_devel' into kernel_ir

81d4647

Change pointwise scheduling to not generate multiple unrolled loops. (#…

c68fba8

…338)

Move IterVisitor derived classes from fusion.h to iter_visitor.h (#339)

4194f49

Move IterVisitor derived classes from fusion.h to iter_visitor.h

Update fusion parser test, remove printing from common consumer tests. (

339e629

#341)

Cleanup of hasBlockBroadcast (#340)

2c1060a

Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.

Merge remote-tracking branch 'origin/20_8_18_devel' into kernel_ir

60f9ed3

Minor cleanup

65b6469

Kernel IR: minor cleanup (#351)

f8f5062

Removing support for Kernel IR nodes from IrGraphGenerator

oops, resolving auto merge issue (#354)

82248bb