-
Notifications
You must be signed in to change notification settings - Fork 7
Separate the class hierarchies for Fusion IR and Kernel IR #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Christian Sarofeen <[email protected]>
Fixes #230 removing WAR of contig flag for broadcasting removing unnecessary tests for the WAR
Add an lstm cell c++ test for convenience.
removing graph copy from critical code path; cache hasReduction result
Splits the origin (definition) links between Fusion IR and Kernel IR. This will allow moving the nodes into different containers (as well as cleaning up parts which are not really needed for the Kernel IR, ex. cloning) Also fixing isConstScalar() and a couple of build warnings in kernel_cache.cpp
Fixes #305 sys env to disabling fma and specify optimization level for jit compilation
Removing support for cloning Kernel IR nodes, which is not needed today.
Kernel IR expressions must call Fusion::registerLoweredExpr() instead of Fusion::registerExpr()
* Add an IRPrinter handler for kir::TensorView This is considered a temporary workaround as IRPrinter is meant to be exclusive to the fusion IR. * Add a comment
* Initial Dynamic Shared Memory Check if shared memory usage is within limits for current GPU Gather buffers in a single pass Use single dynamic shared memory for reduction/broadcast workspace Align dynamic shared memory by data type Co-authored-by: Ryan Spring <[email protected]>
An example of this error happens with tv4 of testGPU_FusionComputeAtMultiBCast.
* Add computeAt tests with minor cleanup * Print names of IterDomains for better debugging experience
#333) Add Executor method to compile from a string for debug usage. Fix Reduction Scheduler to have TI level perf for FP16 inner dimension reductions. Fix tests to use randn() so large reductions aren't matching on inf.
Move IterVisitor derived classes from fusion.h to iter_visitor.h
Implement hasBlockBroadcast like hasGrid/BlockReduction, cache results of these functions in executor during compilation. Improves average latency on LSTMCell 77.5us -> 20.5us.
Removing support for Kernel IR nodes from IrGraphGenerator
While our kernels handle dynamic input sizes, we are now caching kernel selection and launch parameters on static sizes. This improves kernel launch latency for repeated input sizes. The encoding from input array to a unique_id is done at `GraphCache` level, where we record and encode every seen inputs. We plumb the unique_id through the `FusionExecutorCache` and `FusionExecutor`, so we do not repeatedly infer launch parameters / cache entry selections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes overall look really good, nice work :-)
Only one significant comment in predicate compute.
return expr_eval; | ||
} | ||
|
||
StatefulExpressionEvaluator bindFusionInputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start pulling out StatefulExpressionEvaluation, or more generically expression evaluation on the Fusion IR all together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that evaluation at the Fusion level is still valuable - creating the schedules is one use-case where we need it.
We may be able to simplify a few things, but I tried to minimize the number of changes in this PR (the size of the PR makes this a bit hard to believe, I know)
auto already_concrete_val = getValue(value); | ||
|
||
// TODO(kir): do we need this anymore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we can pull stateful expression evaluation out soon. Looks like in the new implementation you've pulled this logic outside the evaluation class which seems reasonable.
@@ -165,10 +165,6 @@ class TORCH_CUDA_API Statement : public NonCopyable, public PolymorphicBase { | |||
*/ | |||
class TORCH_CUDA_API Val : public Statement { | |||
public: | |||
virtual ~Val() = default; | |||
|
|||
Val() = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get the virtual destructor from Statement (actually from PolymorphicBase). And the default constructor is implicitly deleted since we have other user-declared constructors.
c10::optional<ValType> getValType() const override { | ||
return vtype_; | ||
} | ||
|
||
// Throws if no DataType is found. Vals must have a DataType | ||
// | ||
// TODO: why is this optional? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply because we're overriding definition in statement which is inherited for expr as well:
virtual c10::optional<ValType> getValType() const {
return c10::nullopt;
}
virtual c10::optional<DataType> getDataType() const {
return c10::nullopt;
}
virtual c10::optional<ExprType> getExprType() const {
return c10::nullopt;
}
I'm fine with changing this if you want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks for the explanation. It may be worth revisiting, but not in the scope for this PR.
// dependency model to allow for initailziation of reduction buffers. The only | ||
// reason we can get away with this for now is because we don't use dependency | ||
// analysis for the IR after we call this. | ||
TensorView* unsafeClone() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woo 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the highlights of this PR :)
pushBack(ir_builder_.create<kir::BinaryOp>( | ||
rop->getReductionOpType(), out, out, in)); | ||
// TODO(kir): this breaks our "SSA" form | ||
pushBack(ir_builder_.create<kir::BinaryOp>(rop->operation(), out, out, in)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do this only at printing, leaving it as an reduction op until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be one option. It doesn't break anything at this point and I'd like to revisit it in a follow up iteration.
@@ -21,19 +23,20 @@ LoopNestGenerator::LoopNestGenerator( | |||
ThreadPredicateMap& thread_predicates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we no longer using these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, no longer needed (it was used only for unsafe clone). I'll remove the parameter too, thanks for pointing it out.
} | ||
return expr->as<kir::ForLoop>()->iter_domain()->getParallelType() == | ||
ParallelType::Unroll; | ||
// TODO(kir): revisit, is it really needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this wouldn't be hard to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, in the final version there are no more uses so I'm removing it.
// find the first (and only) TensorView output | ||
// | ||
// TODO(kir): same question as ir_utils::getTvOutput(): | ||
// why do we assume a single TV output? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's implicit assumptions in the approach we use that exprs only have one output TV. Haven't come across any instances yet where there would have to be multiple output TVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, uncertain we have any exprs with more than one output.
@@ -131,20 +154,6 @@ kir::Bool* PredicateCompute::getInlinePredicate( | |||
auto root_indices = pred_inds.first; | |||
bool use_maybe_rfactor = pred_inds.second; | |||
|
|||
if (out_tv->getMemoryType() == MemoryType::Local && out_tv->hasReduction() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this logic move somewhere? This is trying to catch instances where we're initializing reduction buffers in global and shared memory. We don't want to generate predicates in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strange indeed. I can't remember the reason, it could've been accidental (I was making deeper change in this file).
Added back, thanks.
I agree that we should aim for consistency. I don't think my previous reply was very clear though, sorry: the different vocabulary in this iteration is intentional - the semantics of the I think it's worth discussing options for consolidating the "dispatch" implementation, but I'd prefer to do it separate from the PR (which I hope illustrates a simpler alternative that may make sense for the Fusion too) |
OK, that makes sense to me. |
Fixed, thankfully both you and @naoyam caught it! |
LGTM. Some concerns on naming remain, but they can be addressed in the future if really necessary. |
6d549fd
to
2382b7b
Compare
8919534
to
898a430
Compare
This PR introduces a hard split between the Fusion IR and the Kernel IR: each form has a dedicated class hierarchy. This means
that we're free to specialize and evolve each IR without having to worry about the internal details of the "other side".
Separate class hierarchies also make the C++ static type system work for us, accidental mixes would be detected early, at compile time.
The PR touches a lot of code since the new types triggered a cascading set of changes. A lot of the changes are simple, but there are a few notable differences:
the Kernel IR is owned by the Kernel object, and with a few minor details (
kir::TensorView::fuserTv
) it is largely decoupled from the Fusion IRAfter the initial lowering pass (
LoopNestGenerator::loweredExprs
), everything is Kernel IRNo more `TensorView::unsafeClone(). Replaced with a bit smaller hack.
Dedicated Kernel IR visitor (
kir::IrVisitor
)There's a dedicated expression evaluator for the Kernel IR (
kir::ExpressionEvaluator
)GpuLower::lowerExpr()
can be used to automatically lower a Fusion IR expression node