Skip to content

JIT: limitations in hoisting (loop invariant code motion) #35735

@AndyAyersMS

Description

@AndyAyersMS

Have been looking into #13811 and have found that the current implementation of loop invariant code motion has some awkward limitations.

In particular if the invariant computations are distributed across statements connected by temps, only the first computation in the chain ends up getting hoisted. In the particular example from #13811 the invariant chain was:

         Vector128<byte> result = CreateScalarUnsafe(value);
         return Avx2.BroadcastScalarToVector128(result);

where value was constant. This ended up in a loop after some inlining. Only the CreateScalarUnsafe gets hoisted.

Note the chains can be arbitrary computation and involve more than two statements.

When hoisting we walk statement by statement looking for hoistable subtrees. Local assignments are not considered hoistable -- only their right hand sides. If we hoist a tree we produce an unconsumed copy in the preheader and let CSE come along later and clean things up.

When the analysis gets to the second statement in a dependent chain, it sees the def for the local conveying the value from the first statement as loop varying, and so does not hoist.

We could try fixing this in a variety of ways:

  • forward substitution might be able to glue together trees connected by single-def single use temps, however it is a big hammer, potentially tricky to get right, and costly to run in full generality
  • we could try and fuse these trees in the importer, say if we see back to back stloc/ldloc and no other references to the local
  • we could fix hoisting to handle this case, with a few options:
    • we could check if the subtree's VN is already hoisted, and so effectively do forward sub for the temp -- then let CSE clean all this up like we do now; this would potentially end up with quadratic amounts of cloning, though in practice, it might be acceptable;
    • we could hoist assignments; this requires some care and rewiring of SSA which might be risky
    • we could introduce new temps and/or modify the unconsumed hoisted tree to write to a temp, and use that to propagate the hoisted value from the first clone to second clone.

I am trying to assess how often we see this; it is a bit tricky because while I can spot the second link being blocked I can't easily tell how long the chains are so anything beyond that is harder to spot.

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set. There are 152 in the crossgen of SPC, including some sort and span methods.

I'm encouraged enough that I will build a more realistic prototype.

category:cq
theme:loop-opt
skill-level:expert
cost:large

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions