Skip to content

Conversation

tlemo
Copy link
Collaborator

@tlemo tlemo commented Oct 16, 2020

This PR introduces a hard split between the Fusion IR and the Kernel IR: each form has a dedicated class hierarchy. This means
that we're free to specialize and evolve each IR without having to worry about the internal details of the "other side".

Separate class hierarchies also make the C++ static type system work for us, accidental mixes would be detected early, at compile time.

The PR touches a lot of code since the new types triggered a cascading set of changes. A lot of the changes are simple, but there are a few notable differences:

  • the Kernel IR is owned by the Kernel object, and with a few minor details (kir::TensorView::fuserTv) it is largely decoupled from the Fusion IR

  • After the initial lowering pass (LoopNestGenerator::loweredExprs), everything is Kernel IR

  • No more `TensorView::unsafeClone(). Replaced with a bit smaller hack.

  • Dedicated Kernel IR visitor (kir::IrVisitor)

  • There's a dedicated expression evaluator for the Kernel IR (kir::ExpressionEvaluator)

  • GpuLower::lowerExpr() can be used to automatically lower a Fusion IR expression node

jjsjann123 and others added 30 commits August 18, 2020 11:53
Co-authored-by: Christian Sarofeen <[email protected]>
* Fix #306

* Reenable smem block gemm cache test.
Fixes #230
removing WAR of contig flag for broadcasting
removing unnecessary tests for the WAR
Add an lstm cell c++ test for convenience.
removing graph copy from critical code path;
cache hasReduction result
Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning)

Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp
Fixes #305
sys env to disabling fma and specify optimization level for jit compilation
Removing support for cloning Kernel IR nodes, which is not needed today.
Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()
* Add an IRPrinter handler for kir::TensorView

This is considered a temporary workaround as IRPrinter is meant to be
exclusive to the fusion IR.

* Add a comment
* Initial Dynamic Shared Memory

Check if shared memory usage is within limits for current GPU

Gather buffers in a single pass

Use single dynamic shared memory for reduction/broadcast workspace

Align dynamic shared memory by data type

Co-authored-by: Ryan Spring <[email protected]>
An example of this error happens with tv4 of
testGPU_FusionComputeAtMultiBCast.
* Add computeAt tests with minor cleanup

* Print names of IterDomains for better debugging experience
#333)

Add Executor method to compile from a string for debug usage.  Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.
Move IterVisitor derived classes from fusion.h to iter_visitor.h
Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.
Removing support for Kernel IR nodes from IrGraphGenerator
While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes.

The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall look really good, nice work :-)
Only one significant comment in predicate compute.

return expr_eval;
}

StatefulExpressionEvaluator bindFusionInputs(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start pulling out StatefulExpressionEvaluation, or more generically expression evaluation on the Fusion IR all together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that evaluation at the Fusion level is still valuable - creating the schedules is one use-case where we need it.

We may be able to simplify a few things, but I tried to minimize the number of changes in this PR (the size of the PR makes this a bit hard to believe, I know)

auto already_concrete_val = getValue(value);

// TODO(kir): do we need this anymore?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we can pull stateful expression evaluation out soon. Looks like in the new implementation you've pulled this logic outside the evaluation class which seems reasonable.

@@ -165,10 +165,6 @@ class TORCH_CUDA_API Statement : public NonCopyable, public PolymorphicBase {
*/
class TORCH_CUDA_API Val : public Statement {
public:
virtual ~Val() = default;

Val() = delete;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get the virtual destructor from Statement (actually from PolymorphicBase). And the default constructor is implicitly deleted since we have other user-declared constructors.

c10::optional<ValType> getValType() const override {
return vtype_;
}

// Throws if no DataType is found. Vals must have a DataType
//
// TODO: why is this optional?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply because we're overriding definition in statement which is inherited for expr as well:

  virtual c10::optional<ValType> getValType() const {
    return c10::nullopt;
  }
  virtual c10::optional<DataType> getDataType() const {
    return c10::nullopt;
  }
  virtual c10::optional<ExprType> getExprType() const {
    return c10::nullopt;
  }

I'm fine with changing this if you want to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks for the explanation. It may be worth revisiting, but not in the scope for this PR.

// dependency model to allow for initailziation of reduction buffers. The only
// reason we can get away with this for now is because we don't use dependency
// analysis for the IR after we call this.
TensorView* unsafeClone() const;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the highlights of this PR :)

pushBack(ir_builder_.create<kir::BinaryOp>(
rop->getReductionOpType(), out, out, in));
// TODO(kir): this breaks our "SSA" form
pushBack(ir_builder_.create<kir::BinaryOp>(rop->operation(), out, out, in));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do this only at printing, leaving it as an reduction op until then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be one option. It doesn't break anything at this point and I'd like to revisit it in a follow up iteration.

@@ -21,19 +23,20 @@ LoopNestGenerator::LoopNestGenerator(
ThreadPredicateMap& thread_predicates,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we no longer using these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, no longer needed (it was used only for unsafe clone). I'll remove the parameter too, thanks for pointing it out.

}
return expr->as<kir::ForLoop>()->iter_domain()->getParallelType() ==
ParallelType::Unroll;
// TODO(kir): revisit, is it really needed?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this wouldn't be hard to remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, in the final version there are no more uses so I'm removing it.

// find the first (and only) TensorView output
//
// TODO(kir): same question as ir_utils::getTvOutput():
// why do we assume a single TV output?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's implicit assumptions in the approach we use that exprs only have one output TV. Haven't come across any instances yet where there would have to be multiple output TVs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, uncertain we have any exprs with more than one output.

@@ -131,20 +154,6 @@ kir::Bool* PredicateCompute::getInlinePredicate(
auto root_indices = pred_inds.first;
bool use_maybe_rfactor = pred_inds.second;

if (out_tv->getMemoryType() == MemoryType::Local && out_tv->hasReduction() &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this logic move somewhere? This is trying to catch instances where we're initializing reduction buffers in global and shared memory. We don't want to generate predicates in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange indeed. I can't remember the reason, it could've been accidental (I was making deeper change in this file).

Added back, thanks.

@tlemo
Copy link
Collaborator Author

tlemo commented Oct 20, 2020

Yes, I was referring to IrVisitor. I understand your point, but at the same time I'd prefer consistency within our codebase.

I agree that we should aim for consistency. I don't think my previous reply was very clear though, sorry: the different vocabulary in this iteration is intentional - the semantics of the kir::IrVisitor and the old xxxDispatch are different, even if we wanted to deviate from the established vocabulary, using "handle" instead of "visit" for kir::IrVisitor would lead to deeper and more subtle confusion. For example, calling "visit" directly (as in the Dispatch implementation) would be a bug since it would bypass the double dispatch). Another difference is the handling of abstract base classes.

I think it's worth discussing options for consolidating the "dispatch" implementation, but I'd prefer to do it separate from the PR (which I hope illustrates a simpler alternative that may make sense for the Fusion too)

@naoyam
Copy link
Collaborator

naoyam commented Oct 20, 2020

Yes, I was referring to IrVisitor. I understand your point, but at the same time I'd prefer consistency within our codebase.

I agree that we should aim for consistency. I don't think my previous reply was very clear though, sorry: the different vocabulary in this iteration is intentional - the semantics of the kir::IrVisitor and the old xxxDispatch are different, even if we wanted to deviate from the established vocabulary, using "handle" instead of "visit" for kir::IrVisitor would lead to deeper and more subtle confusion. For example, calling "visit" directly (as in the Dispatch implementation) would be a bug since it would bypass the double dispatch). Another difference is the handling of abstract base classes.

I think it's worth discussing options for consolidating the "dispatch" implementation, but I'd prefer to do it separate from the PR (which I hope illustrates a simpler alternative that may make sense for the Fusion too)

OK, that makes sense to me.

@tlemo
Copy link
Collaborator Author

tlemo commented Oct 20, 2020

Only one significant comment in predicate compute.

Fixed, thankfully both you and @naoyam caught it!

@naoyam
Copy link
Collaborator

naoyam commented Oct 21, 2020

LGTM. Some concerns on naming remain, but they can be addressed in the future if really necessary.

@tlemo tlemo force-pushed the kernel_ir_part9 branch 2 times, most recently from 6d549fd to 2382b7b Compare October 21, 2020 20:51
@tlemo tlemo changed the base branch from 20_9_25_devel to 20_10_20_devel October 21, 2020 21:31
@tlemo tlemo merged commit 4e9a55c into 20_10_20_devel Oct 22, 2020
@tlemo tlemo deleted the kernel_ir_part9 branch October 22, 2020 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants