JIT: limitations in hoisting (loop invariant code motion)

Have been looking into #13811 and have found that the current implementation of loop invariant code motion has some awkward limitations.

In particular if the invariant computations are distributed across statements connected by temps, only the first computation in the chain ends up getting hoisted. In the particular example from #13811 the invariant chain was:
```C#
         Vector128<byte> result = CreateScalarUnsafe(value);
         return Avx2.BroadcastScalarToVector128(result);
```
where value was constant. This ended up in a loop after some inlining. Only the `CreateScalarUnsafe` gets hoisted.

Note the chains can be arbitrary computation and involve more than two statements.

When hoisting we walk statement by statement looking for hoistable subtrees. Local assignments are not considered hoistable -- only their right hand sides. If we hoist a tree we produce an unconsumed copy in the preheader and let CSE come along later and clean things up.

When the analysis gets to the second statement in a dependent chain, it sees the def for the local conveying the value from the first statement as loop varying, and so does not hoist.

We could try fixing this in a variety of ways:
* forward substitution might be able to glue together trees connected by single-def single use temps, however it is a big hammer, potentially tricky to get right, and costly to run in full generality
* we could try and fuse these trees in the importer, say if we see back to back stloc/ldloc and no other references to the local
* we could fix hoisting to handle this case, with a few options:
  * we could check if the subtree's VN is already hoisted, and so effectively do forward sub for the temp -- then let CSE clean all this up like we do now; this would potentially end up with quadratic amounts of cloning, though in practice, it might be acceptable;
  * we could hoist assignments; this requires some care and rewiring of SSA which might be risky
  * we could introduce new temps and/or modify the unconsumed hoisted tree to write to a temp, and use that to propagate the hoisted value from the first clone to second clone.

I am trying to assess how often we see this; it is a bit tricky because while I can spot the second link being blocked I can't easily tell how long the chains are so anything beyond that is harder to spot. 

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set. There are 152 in the crossgen of SPC, including some sort and span methods.

I'm encouraged enough that I will build a more realistic prototype.

category:cq
theme:loop-opt
skill-level:expert
cost:large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: limitations in hoisting (loop invariant code motion) #35735

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JIT: limitations in hoisting (loop invariant code motion) #35735

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions