cmd/compile: loop invariant code motion

Hi, several months ago I submit a patch to implement LICM(https://github.com/golang/go/pull/59194), but in that patch the performance improvement was not compelling, after some research we think we need to hoist the high-yield instructions (Load/Store/etc) to achieve a positive performance improvement. Since these high-yield instructions are usually not speculative execution, we may need to implement a fulll LICM with the following dependencies:

- LICM: hoist load/store/nilcheck etc
  - Loop Rotation: Transform while loop to do-while loop, it creates a home for hoistable Values
    - LCSSA: Loop closed SSA form, it ensures that all values defined inside the loop are only used within loop. This significantly simplifies the implementation of Loop Rotation and is the basis for future loop optimization such as loop unswitching and loop vectorization.
    - Def-Use Utilities
  - Type-based Alias Analysis: It ensures load/store are not pointing to the same location, thus we can safely hoist Load/Store/etc.

**Overall, LCSSA opens the door to future loop optimizations, and LICM is expected to have attractive performance improvements on some benchmarks.**

| package | arch | baseline | opt | delta |
| :-----------: | :---: | :---: | :---: | :---: |
| unicode/utf16 | amd64 | 12.83n | 10.59n | -17.45% |
| math/bits | yitian | 0.8228n | 0.7231n | -12.11% |
| regexp/syntax | amd64 | 175.3n | 157.5n | -10.17% |
| unicode/utf16 | yitian | 8.191n | 7.582n | -7.43% |
| math/rand | amd64 | 54.91n | 50.90n | -7.31% |
| expvar | yitian | 162.7n | 154.0n | -5.36% |
| runtime/internal/atomic | yitian | 6.194n | 5.889n | -4.92% |
| mime | amd64 | 48.96n | 46.73n | -4.57% |
| encoding/binary | amd64 | 28.94n | 27.83n | -3.84% |
| runtime/cgo | yitian | 801.9n | 771.6n | -3.79% |
| encoding/binary | yitian | 23.65n | 22.76n | -3.75% |
| sync/atomic | yitian | 4.748n | 4.577n | -3.61% |
| archive/zip | amd64 | 326.9_ | 316.0_ | -3.33% |
| time | yitian | 489.9n | 474.3n | -3.18% |
| image | amd64 | 8.690n | 8.427n | -3.03% |
| fmt | yitian | 79.12n | 76.77n | -2.97% |
| encoding/csv | amd64 | 2.189_ | 2.132_ | -2.62% |
| net/rpc | amd64 | 10.62_ | 10.35_ | -2.61% |
| encoding/base64 | amd64 | 373.4n | 363.8n | -2.56% |
| time | amd64 | 745.2n | 726.3n | -2.54% |
| text/template/parse | yitian | 17.42µ | 17.00µ | -2.41% |
| bufio | amd64 | 319.2n | 311.5n | -2.40% |
| encoding/gob | amd64 | 7.626_ | 7.445_ | -2.37% |
| encoding/base32 | yitian | 17.63µ | 17.22µ | -2.36% |
| crypto/rc4 | amd64 | 1.967_ | 1.921_ | -2.30% |
| math/bits | amd64 | 1.377n | 1.346n | -2.28% |
| encoding/xml | amd64 | 3.507_ | 3.430_ | -2.18% |
| fmt | amd64 | 167.0n | 163.5n | -2.14% |
| log/slog/internal/benchmarks | amd64 | 178.6n | 174.7n | -2.14% |
| crypto/x509 | amd64 | 121.9_ | 119.3_ | -2.12% |
| log/slog/internal/benchmarks | yitian | 76.34n | 74.75n | -2.08% |
| encoding/gob | yitian | 3.439µ | 3.368µ | -2.07% |
| internal/poll | amd64 | 140.6n | 137.7n | -2.04% |
| image/jpeg | yitian | 252.0µ | 247.4µ | -1.82% |
| compress/lzw | yitian | 946.3µ | 929.7µ | -1.75% |
| text/template/parse | amd64 | 21.72_ | 21.34_ | -1.72% |
| slices | amd64 | 144.1n | 141.7n | -1.67% |
| compress/lzw | amd64 | 1.446m | 1.422m | -1.66% |
| crypto/x509 | yitian | 98.15µ | 96.54µ | -1.64% |
| io | amd64 | 7.584_ | 7.461_ | -1.62% |
| crypto/des | amd64 | 236.3n | 232.5n | -1.60% |
| log/slog | amd64 | 452.1n | 445.0n | -1.59% |
| crypto/elliptic | amd64 | 57.17_ | 56.33_ | -1.48% |
| crypto/internal/nistec | yitian | 131.1µ | 129.1µ | -1.48% |
| net/rpc | yitian | 8.009µ | 7.896µ | -1.42% |
| crypto/internal/nistec/fiat | amd64 | 89.35n | 88.15n | -1.34% |
| crypto/internal/nistec/fiat | yitian | 51.92n | 51.25n | -1.30% |
| slices | yitian | 99.47n | 98.17n | -1.30% |
| encoding/asn1 | yitian | 2.581µ | 2.549µ | -1.24% |
| crypto/internal/nistec | amd64 | 208.6_ | 206.2_ | -1.15% |
| sort | yitian | 528.5µ | 522.7µ | -1.08% |
| reflect | amd64 | 36.59n | 36.20n | -1.06% |
| compress/flate | amd64 | 1.597m | 1.582m | -0.92% |
| internal/fuzz | amd64 | 9.127_ | 9.043_ | -0.92% |
| image/gif | yitian | 4.505m | 4.464m | -0.92% |
| reflect | yitian | 23.30n | 23.09n | -0.90% |
| debug/elf | yitian | 5.180µ | 5.134µ | -0.88% |
| debug/gosym | yitian | 4.722µ | 4.683µ | -0.83% |
| regexp | amd64 | 10.57_ | 10.48_ | -0.81% |
| crypto/elliptic | yitian | 37.49µ | 37.18µ | -0.81% |
| encoding/json | amd64 | 3.434_ | 3.407_ | -0.80% |
| image | yitian | 6.054n | 6.005n | -0.80% |
| encoding/pem | amd64 | 170.6_ | 169.4_ | -0.71% |
| runtime/cgo | amd64 | 845.7n | 839.9n | -0.69% |
| debug/elf | amd64 | 5.933_ | 5.894_ | -0.67% |
| crypto/internal/bigmod | yitian | 8.593µ | 8.538µ | -0.64% |
| hash/crc32 | amd64 | 287.9n | 286.1n | -0.63% |
| go/constant | amd64 | 40.93_ | 40.67_ | -0.62% |
| image/png | amd64 | 1.220m | 1.212m | -0.60% |
| math/big | amd64 | 3.557_ | 3.536_ | -0.59% |
| crypto/aes | amd64 | 27.64n | 27.48n | -0.58% |
| net/netip | yitian | 44.12n | 43.87n | -0.57% |
| math/big | yitian | 2.412µ | 2.399µ | -0.51% |
| math/cmplx | amd64 | 39.62n | 39.43n | -0.46% |
| crypto/subtle | yitian | 8.622n | 8.583n | -0.46% |
| math | yitian | 5.104n | 5.083n | -0.41% |
| sort | amd64 | 825.4_ | 822.0_ | -0.40% |
| bytes | amd64 | 1.733_ | 1.727_ | -0.37% |
| image/color | yitian | 3.669n | 3.655n | -0.36% |
| image/gif | amd64 | 4.927m | 4.910m | -0.35% |
| html/template | yitian | 494.9n | 493.3n | -0.32% |
| hash/crc64 | yitian | 10.65µ | 10.63µ | -0.27% |
| internal/poll | yitian | 62.14n | 61.99n | -0.25% |
| crypto/internal/edwards25519 | yitian | 28.41µ | 28.34µ | -0.24% |
| mime/multipart | yitian | 35.87µ | 35.79µ | -0.22% |
| crypto/md5 | amd64 | 4.305_ | 4.298_ | -0.17% |
| crypto/sha1 | amd64 | 1.381_ | 1.378_ | -0.17% |
| crypto/cipher | amd64 | 1.182_ | 1.180_ | -0.16% |
| crypto/ecdh | amd64 | 483.4_ | 482.7_ | -0.14% |
| unicode/utf8 | yitian | 55.44n | 55.38n | -0.12% |
| html/template | amd64 | 664.7n | 664.0n | -0.10% |
| bytes | yitian | 1.154µ | 1.153µ | -0.09% |
| crypto/internal/bigmod | amd64 | 9.538_ | 9.532_ | -0.07% |
| compress/bzip2 | yitian | 5.270m | 5.266m | -0.07% |
| crypto/rc4 | yitian | 1.501µ | 1.500µ | -0.06% |
| archive/tar | yitian | 3.656µ | 3.654µ | -0.05% |
| crypto/ecdsa | yitian | 187.3µ | 187.2µ | -0.05% |
| crypto/sha1 | yitian | 483.7n | 483.7n | -0.01% |
| crypto/internal/edwards25519/field | yitian | 34.96n | 34.97n | 0.02% |
| hash/fnv | amd64 | 2.075_ | 2.075_ | 0.03% |
| crypto/md5 | yitian | 3.746µ | 3.747µ | 0.03% |
| crypto/cipher | yitian | 816.4n | 816.9n | 0.06% |
| crypto/hmac | amd64 | 1.677_ | 1.678_ | 0.08% |
| crypto/ed25519 | amd64 | 42.51_ | 42.55_ | 0.09% |
| encoding/asn1 | amd64 | 3.390_ | 3.394_ | 0.10% |
| crypto/sha256 | amd64 | 3.200_ | 3.204_ | 0.11% |
| crypto/ecdh | yitian | 307.9µ | 308.2µ | 0.11% |
| crypto/sha512 | yitian | 1.036µ | 1.037µ | 0.11% |
| hash/crc32 | yitian | 166.1n | 166.3n | 0.11% |
| image/png | yitian | 787.7µ | 788.5µ | 0.11% |
| crypto/hmac | yitian | 473.6n | 474.1n | 0.12% |
| sync | amd64 | 85.76n | 85.89n | 0.15% |
| image/jpeg | amd64 | 417.5_ | 418.2_ | 0.17% |
| archive/zip | yitian | 205.1µ | 205.5µ | 0.20% |
| go/printer | amd64 | 232.9_ | 233.5_ | 0.23% |
| strconv | yitian | 42.90n | 43.00n | 0.23% |
| log/slog | yitian | 335.1n | 335.9n | 0.24% |
| image/color | amd64 | 5.199n | 5.213n | 0.26% |
| crypto/ed25519 | yitian | 32.43µ | 32.51µ | 0.27% |
| crypto/internal/edwards25519/field | amd64 | 44.30n | 44.42n | 0.28% |
| crypto/internal/edwards25519 | amd64 | 36.84_ | 36.95_ | 0.30% |
| compress/flate | yitian | 990.4µ | 993.5µ | 0.32% |
| debug/gosym | amd64 | 5.990_ | 6.011_ | 0.36% |
| strings | yitian | 816.3n | 819.3n | 0.36% |
| go/constant | yitian | 39.23µ | 39.37µ | 0.37% |
| math/rand/v2 | yitian | 5.932n | 5.954n | 0.37% |
| crypto/rsa | yitian | 734.8µ | 738.4µ | 0.49% |
| crypto/ecdsa | amd64 | 312.1_ | 313.7_ | 0.51% |
| log | amd64 | 175.0n | 175.9n | 0.53% |
| regexp | yitian | 7.101µ | 7.138µ | 0.53% |
| crypto/sha256 | yitian | 581.6n | 585.0n | 0.58% |
| encoding/csv | yitian | 1.870µ | 1.881µ | 0.62% |
| crypto/sha512 | amd64 | 2.774_ | 2.792_ | 0.65% |
| runtime | yitian | 34.38n | 34.61n | 0.65% |
| sync | yitian | 56.61n | 56.98n | 0.65% |
| image/draw | amd64 | 626.6_ | 631.0_ | 0.69% |
| runtime/trace | amd64 | 4.299n | 4.329n | 0.69% |
| strings | amd64 | 1.152_ | 1.160_ | 0.70% |
| internal/fuzz | yitian | 6.616µ | 6.663µ | 0.71% |
| crypto/subtle | amd64 | 12.80n | 12.90n | 0.74% |
| crypto/aes | yitian | 17.14n | 17.27n | 0.76% |
| text/tabwriter | amd64 | 388.0_ | 391.0_ | 0.77% |
| go/parser | amd64 | 1.634m | 1.647m | 0.80% |
| net/textproto | yitian | 1.647µ | 1.661µ | 0.80% |
| crypto/des | yitian | 162.4n | 163.7n | 0.81% |
| internal/intern | yitian | 128.0n | 129.1n | 0.85% |
| encoding/hex | amd64 | 5.941_ | 5.993_ | 0.86% |
| compress/bzip2 | amd64 | 8.345m | 8.419m | 0.89% |
| go/parser | yitian | 1.233m | 1.244m | 0.89% |
| encoding/base64 | yitian | 240.5n | 242.8n | 0.92% |
| log | yitian | 180.6n | 182.4n | 0.96% |
| encoding/hex | yitian | 4.120µ | 4.161µ | 0.97% |
| encoding/base32 | amd64 | 27.58_ | 27.85_ | 0.98% |
| net/http | yitian | 3.268µ | 3.301µ | 1.03% |
| encoding/json | yitian | 1.572µ | 1.589µ | 1.06% |
| math/cmplx | yitian | 26.43n | 26.74n | 1.17% |
| mime | yitian | 24.37n | 24.65n | 1.17% |
| strconv | amd64 | 67.41n | 68.25n | 1.24% |
| runtime | amd64 | 51.06n | 51.71n | 1.26% |
| runtime/trace | yitian | 2.411n | 2.443n | 1.31% |
| math/rand | yitian | 40.09n | 40.64n | 1.38% |
| html | yitian | 2.562µ | 2.599µ | 1.46% |
| bufio | yitian | 267.9n | 272.0n | 1.50% |
| expvar | amd64 | 264.3n | 268.5n | 1.60% |
| net/netip | amd64 | 64.72n | 65.76n | 1.62% |
| os | yitian | 1.387µ | 1.414µ | 1.94% |
| encoding/xml | yitian | 3.475µ | 3.544µ | 2.00% |
| go/types | yitian | 497.5µ | 507.8µ | 2.07% |
| io | yitian | 1.994µ | 2.036µ | 2.12% |
| regexp/syntax | yitian | 106.3n | 108.6n | 2.14% |
| math/rand/v2 | amd64 | 8.660n | 8.854n | 2.24% |
| context | yitian | 953.0n | 975.0n | 2.31% |
| encoding/pem | yitian | 106.5µ | 108.9µ | 2.31% |
| mime/multipart | amd64 | 42.46_ | 43.55_ | 2.57% |
| database/sql | amd64 | 853.1_ | 875.5_ | 2.62% |
| go/scanner | amd64 | 241.1_ | 247.4_ | 2.62% |
| hash/maphash | yitian | 28.44n | 29.21n | 2.71% |
| runtime/internal/math | yitian | 0.9047n | 0.9295n | 2.74% |
| go/scanner | yitian | 164.8µ | 169.6µ | 2.94% |
| errors | yitian | 132.0n | 136.1n | 3.16% |
| image/draw | yitian | 370.5µ | 382.8µ | 3.30% |
| unicode/utf8 | amd64 | 76.28n | 78.91n | 3.45% |
| text/tabwriter | yitian | 284.5µ | 294.3µ | 3.45% |
| runtime/internal/atomic | amd64 | 13.41n | 13.95n | 4.01% |
| html | amd64 | 2.673_ | 2.783_ | 4.12% |
| errors | amd64 | 166.2n | 173.3n | 4.25% |
| net/url | yitian | 254.0n | 264.9n | 4.29% |
| go/printer | yitian | 159.7µ | 166.6µ | 4.36% |
| hash/crc64 | amd64 | 12.87_ | 13.43_ | 4.38% |
| math | amd64 | 9.082n | 9.480n | 4.39% |
| archive/tar | amd64 | 4.891_ | 5.107_ | 4.42% |
| index/suffixarray | yitian | 46.87m | 49.02m | 4.60% |
| go/types | amd64 | 685.3_ | 718.3_ | 4.81% |
| net/textproto | amd64 | 2.142_ | 2.246_ | 4.87% |
| hash/fnv | yitian | 1.453µ | 1.526µ | 5.00% |
| sync/atomic | amd64 | 8.594n | 9.055n | 5.37% |
| database/sql | yitian | 698.3µ | 735.9µ | 5.38% |
| runtime/internal/math | amd64 | 1.390n | 1.465n | 5.40% |
| crypto/rsa | amd64 | 970.1_ | 1.025m | 5.67% |
| net/url | amd64 | 385.5n | 412.0n | 6.86% |
| internal/intern | amd64 | 135.3n | 145.1n | 7.24% |
| os | amd64 | 1.671_ | 1.799_ | 7.67% |
| hash/maphash | amd64 | 45.60n | 49.25n | 8.02% |
| index/suffixarray | amd64 | 79.15m | 85.55m | 8.08% |
| net/http | amd64 | 4.357_ | 4.712_ | 8.16% |
| context | amd64 | 1.236_ | 1.350_ | 9.18% |

![image](https://github.com/golang/go/assets/5010047/00e12516-7bf2-4436-a2e4-9c2d243a97b7)

Because loop rotation significantly changes the control flow, it also affects block layout and register allocation, leading to regressions in some benchmarks. I have fixed some of these, but completely eliminating these regressions remains a long-term task. To prevent the patch from growing larger and mixing code changes that increase the difficulty of review, such as the fix to the register allocation algorithm and adjust to the block layouting algorithm, and to keep features independent, I have introduced `GOEXPERIMENT=loopopts` to control these optimizations, with the expectation of gradually fixing all the regressions and reducing compilation time in the follow-up patches.

Below is a more detailed explanation of the changes, which may be helpful to the reviewer. Thank you for your patience!

-----

## 1. Loop Invariant Code Motion
The main idea behind LICM is to move loop invariant values outside of the loop
so that they are only executed once, instead of being repeatedly executed with
each iteration of the loop. In the context of LICM, if a loop invariant can be
speculatively executed, then it can be freely hoisted to the loop entry.
However, if it cannot be speculatively executed, there is still a chance that
it can be hoisted outside the loop under a few prerequisites:

- `#1 Instruction is guaranteed to execute unconditionally`
- `#2 Instruction does not access memory locations that may alias with other memory operations inside the loop`

For `#1`, this is guaranteed by loop rotation, where the loop is guaranteed to
execute at least once after rotation. But that's not the whole story. If the
instruction is guarded by a conditional expression (e.g., loading from a memory
address usually guarded by an IsInBound check), in this case, we try to hoist
it only if the loop invariant dominates all loop exits, which implies that it
will be executed unconditionally as soon as it enters the loop.
For `#2`, we rely on a simple but efficient type-based alias analysis to know
whether two memory access instructions may access the same memory location.

## 2. Type-based Alias Analysis
**Type-based Alias Analysis(TBAA)** is described in Amer Diwan, Kathryn S. McKinley, J. Eliot B. Moss: Type-Based Alias Analysis. PLDI 1998
TBAA relies on the fact that Golang is a type-safe language, i.e. different
pointer types cannot be converted to each other in Golang. Under assumption,
TBAA attempts to identify whether two pointers may point to same memory based
on their type and value semantics. They can be summarized as follows rules:
```
  #0 unsafe pointer may aliases with anything even if their types are different
  #1 a must aliases with b if a==b
  #2 a.f aliases with b.g if f==g and a aliases with b
  #3 a.f aliases with *b if they have same types
  #4 a[i] aliases with *b if they have same types
  #5 a.f never aliases with b[i]
  #6 a[i] aliases with b[j] if a==b
  #7 a aliases with b if they have same types
```

## 3. Loop Close SSA Form
**Loop closed SSA form(LCSSA)** is a special form of SSA form, which is used to simplify
loop optimization. It ensures that all values defined inside the loop are only
used within loop. The transformation looks up loop uses outside the loop and
inserts the appropriate "proxy phi" at the loop exit, after which the outside
of the loop does not use the loop def directly but the proxy phi.
```
   loop header:                         loop header:
   v3 = Phi(0, v4)                      v3 = Phi(0, v4)
   If cond->loop latch,loop exit        If cond->loop latch,loop exit

   loop latch:                          loop latch:
   v4 = Add(v3, 1)                =>    v4 = Add(v3, 1)
   Plain->loop header                   Plain->loop header

   loop exit:                           loop exit:
   v5 = Add(5, v3)                      v6 = Phi(v3)  <= Proxy Phi
   Ret v18                              v5 = Add(5, v6)
                                        Ret v18
```
Previously, v5 used v3 directly, where v5 is in the loop exit which is outside
the loop. After LCSSA transformation, v5 uses v6, which in turn uses v3. Here,
v6 is the proxy phi. In the context of LCSSA, we can consider the use block of
v6 to be the loop header rather than the loop exit. This way, all values defined
in the loop are loop "closed", i.e. only used within the loop.

Any further changes to the loop definition only need to update the proxy phi,
rather than iterating through all its uses and handling properties such as
This significantly simplifies implementation of **Loop Rotation**

## 4. Loop Rotation
### 4.1 CFG transformation
**Loop rotation**, also known as **loop inversion**, it transforms while/for loop to
do-while style loop, e.g.

```golang
// Before
for i := 0; i < cnt; i++ {
  ...
}

// After
if 0 < cnt {
  i := 0
  do {
    ...
  } while(i < cnt)
}
```
The original natural loop is in form of below IR
```
       entry
         │
         │  ┌───loop latch
         ▼  ▼       ▲
    loop header     │
         │  │       │
         │  └──►loop body
         ▼
     loop exit
```
In the terminology, loop entry dominates the entire loop, loop header contains
the loop conditional test, loop body refers to the code that is repeated, loop
latch contains the backedge to loop header, for simple loops, the loop body is
equal to loop latch, and loop exit refers to the block that dominated by the
entire loop.

We move the conditional test from loop header to loop latch, incoming backedge
argument of conditional test should be updated as well otherwise we would lose
one update. Also note that any other uses of moved values should be updated
because moved Values now live in loop latch and may no longer dominates their
uses. At this point, loop latch determines whether loop continues or exits
based on rotated test.

```
      entry
        │
        │
        ▼
    loop header◄──┐
        │         │
        │         │
        ▼         │
    loop body     │
        │         │
        │         │
        ▼         │
    loop latch────┘
        │
        │
        ▼
    loop exit
```
Now loop header and loop body are executed unconditionally, this may changes
program semantics while original program executes them only if test is okay.
A so-called loop guard is inserted to ensure loop is executed at least once.

```
        entry
          │
          │
          ▼
   ┌──loop guard
   │      │
   │      │
   │      ▼
   │  loop header◄──┐
   │      │         │
   │      │         │
   │      ▼         │
   │  loop body     │
   │      │         │
   │      │         │
   │      ▼         │
   │  loop latch────┘
   │      │
   │      │
   │      ▼
   └─►loop exit
```
Loop header no longer dominates entire loop, loop guard dominates it instead.
If Values defined in the loop were used outside loop, all these uses should be
replaced by a new Phi node at loop exit which merges control flow from loop
header and loop guard. Based on Loop Closed SSA Form, these Phis have already
been created. All we need to do is simply reset their operands to accurately
reflect the fact that loop exit is a merge point now.

One of the main purposes of Loop Rotation is to assist other optimizations
such as LICM. They may require that the rotated loop has a proper while safe
block to place new Values, an optional **loop land** block is hereby created to
give these optimizations a chance to keep them from being homeless.
```
         entry
           │
           │
           ▼
    ┌──loop guard
    │      │
    │      │
    │      ▼
    |  loop land <= safe land to place Values
    │      │
    │      │
    │      ▼
    │  loop header◄──┐
    │      │         │
    │      │         │
    │      ▼         │
    │  loop body     │
    │      │         │
    │      │         │
    │      ▼         │
    │  loop latch────┘
    │      │
    │      │
    │      ▼
    └─► loop exit
```

The detailed loop rotation algorithm is summarized as following steps

1. Transform the loop to Loop Closed SSA Form
  1.1 All uses of loop defined Values will be replaced by uses of proxy phis
2. Check whether loop can apply loop rotate
  2.1 Loop must be a natural loop and have a single exit and so on..
3. Rotate loop conditional test and rewire loop edges
  3.1. Rewire loop header to loop body unconditionally.
  3.2 Rewire loop latch to header and exit based on new conditional test.
  3.3 Create new loop guard block and rewire loop entry to loop guard.
  3.4 Clone conditional test from loop header to loop guard.
  3.5 Rewire loop guard to original loop header and loop exit
4. Reconcile broken data dependency after CFG transformation
  4.1 Move conditional test from loop header to loop latch
  4.2 Update uses of moved Values because these defs no longer dominates uses after they were moved to loop latch
  4.3 Add corresponding argument for phis at loop exits since new edge from loop guard to loop exit had been created
  4.4 Update proxy phi to use the loop phi's incoming argument which comes from loop latch since loop latch may terminate the loop now


Some gory details in data dependencies after CFG transformation that deviate from
intuition need to be taken into account. This has led to a more complex loop
rotation implementation, making it less intuitive.

### 4.2 Fix Data Dependencies
The most challenging part of Loop Rotation is the update of data dependencies
(refer to steps 3 and 4 of the algorithm).

#### Update conditional test operands
In original while/for loop, a critical edge is inserted at the
end of each iteration, Phi values are updated. All subsequent
uses of Phi rely on updated values. However, when converted
to a do-while loop, Phi nodes may be used at the end of each
iteration before they are updated. Therefore, we need to
replace all subsequent uses of Phi with use of Phi parameter.
This way, it is equivalent to using updated values of Phi
values. Here is a simple example:

Normal case, if v2 uses v1 phi, and the backedge operand v4
of v1 phi is located in the loop latch block, we only need to
modify the usage of v1 by v2 to the usage of v4. This prevents
loss of updates, and the dominance relationship will not be
broken even after v2 is moved to the loop latch.

```
Before:
 loop header:
 v1 = phi(0, v4)
 v2 = v1 + 1
 If v2 < 3 -> loop body, loop exit

 loop latch:
 v4 = const 512

After:
 loop header:
 v1 = phi(0, v4)

 loop latch:
 v4 = const 512
 v2 = v4 + 1
 If v2 < 3 -> loop header, loop exit
```
After updating uses of val, we may create yet another cyclic
dependency, i.e.
```
 loop header:
 v1 = phi(0, v4)
 v2 = v1 + 1
 If v2 < 3 -> loop body, loop exit

 loop latch:
 v4 = v2 + 1

After updating iarg of val to newUse, it becomes

 loop header:
 v1 = phi(0, v4)

 loop latch:
 v2 = v4 + 1   ;;; cyclic dependency
 v4 = v2 + 1
 If v2 < 3 -> loop header, loop exit
```
This is similiar to below case, and it would be properly handled
by updateMovedUses. For now, we just skip it to avoid infinite
recursion.

If there is a value v1 in the loop header that is used to define
a v2 phi in the same basic block, and this v2 phi is used in
turn to use the value v1, there is a cyclic dependencies, i.e.

```
loop header:
v1 = phi(0, v2)
v2 = v1 + 1
If v2 < 3 -> loop body, loop exit
```

In this case, we need to first convert the v1 phi into its
normal form, where its back edge parameter uses the value defined
in the loop latch.

```
loop header:
v1 = phi(0, v3)
v2 = v1 + 1
If v2 < 3 -> loop body, loop exit

loop latch:
v3 = Copy v2
```

After this, the strange v1 phi is treated in the same way as
other phis. After moving the conditional test to the loop latch,
the relevant parameters will also be updated, i.e., v2 will
use v3 instead of v1 phi:

```
loop header:
v1 = phi(0, v3)

loop latch:
v3 = Copy v2
v2 = v3 + 1
If v2 < 3 -> loop header, loop exit
```

Finally, since v3 is use of v2, after moving v2 to the loop
latch, updateMovedUses will update these uses and insert a
new v4 Phi.

```
loop header:
v1 = phi(0, v3)
v4 = phi(v2', v2)    ;;; v2' lives in loop guard

loop latch:
v3 = Copy v4
v2 = v3 + 1
If v2 < 3 -> loop header, loop exit
```
#### Update uses of moved Values because these defs no longer dominates uses after they were moved to loop latch
If the loop conditional test is "trivial", we will move the chain of this
conditional test values to the loop latch, after that, they may not dominate
the in-loop uses anymore:
```
  loop header
  v1 = phi(0, ...)
  v2 = v1 + 1
  If v2 < 3 ...

  loop body:
  v4 = v2 - 1
```

So we need to create a new phi v5 at the loop header to merge the control flow
from the loop guard to the loop header and the loop latch to the loop header
and use this phi to replace the in-loop use v4. e.g.
```
  loop header:
  v1 = phi(0, ...)
  v5 = phi(v2', v2)     ;;; v2' lives in loop guard

  loop body:
  v4 = v5 - 1

  loop latch:
  v2 = v1 + 1
  If v2 < 3 ...
```

#### Add corresponding argument for phis at loop exits since new edge from loop guard to loop exit had been created
Loop header no longer dominates loop exit, a new edge from loop guard to loop
exit is created, this is not reflected in proxy phis in loop exits, i.e. these
proxy phis miss one argument that comes from loop guard, we need to reconcile
the divergence

```
                              loop guard
                                    |
loop exit               loop exit  /
    |          =>            |    /
v1=phi(v1)              v1=phi(v1 v1') <= add missing g2e argument v1'
```

Since LCSSA ensures that all loop uses are closed, i.e. any out-of-loop uses
are replaced by proxy phis in loop exit, we only need to add missing argument
v1' to v1 proxy phi

#### Update proxy phi to use the loop phi's incoming argument which comes from loop latch since loop latch may terminate the loop now
Loop latch now terminates the loop. If proxy phi uses the loop phi that lives
in loop header, it should be replaced by using the loop phi's incoming argument
which comes from loop latch instead, this avoids losing one update.
```
Before:
  loop header:
  v1 = phi(0, v4)

  loop latch:
  v4 = v1 + 1

  loop exit
  v3 = phi(v1, ...)

After:
  loop header:
  v1 = phi(0, v4)

  loop latch:
  v4 = v1 + 1

  loop exit
  v3 = phi(v4, ...)  ;; use v4 instead of v1
```

## 5. Bug Fix
The loop exit created by BlockJumpTable will not be discovered by `findExits()`.

![image](https://github.com/golang/go/assets/5010047/4de8479d-9eb5-41a5-8389-251fa47f3d87)

The current loop exits are only b23 and b43, but the expected ones should include b10 and b13.

## 6. Performance Regression
### 6.1 block layout
In original block layout algorithm, regardless of whether successors of the current basic block have `likely` attribute, the layout algorithm would place each successor right after the current basic block one by one. The improved algorithm employs a greedy approach, aiming to allocate the basic blocks of a "trace" together as much as possible, in order to minimize excessive jumps. i.e. For given IR:
```
b1:
BlockIf -> b2 b3 (likely)

b2:
BlockPlain->b4

b4:
...
```
The final block orders are as follows:
```
baseline:
b1 b2 b3 b4

opt:
b1 b2 b4 b3
```


### 6.2 register allocation
After loop rotation, there are some significant performance regression. Taking the strconv.Atoi benchmark as an example, the performance deteriorates by about 30%. In the following example:

```asm
;; baseline
b12
00025  JMP 30
00026  LEAQ (BX)(BX*4), R8
00027  MOVBLZX DI, DI
00028  INCQ AX
00029  LEAQ (DI)(R8*2), BX
00030  CMPQ AX, SI
b18
00031  JGE 38
00032  MOVBLZX (DX)(AX*1), DI
00033  ADDL $-48, DI
00034  CMPB DI, $9
b19
00035  JLS 26

;; opt
b27
00027  JMP 29
00028  MOVQ R8, AX
00029  MOVBLZX (DX)(AX*1), DI
00030  ADDL $-48, DI
00031  CMPB DI, $9
b19
00032  JHI 40
00033  LEAQ 1(AX), R8
00034  LEAQ (BX)(BX*4), BX
00035  MOVBLZX DI, DI
00036  LEAQ (DI)(BX*2), BX
00037  CMPQ SI, R8
b20
00038  JGT 28
b20
00039  JMP 43
```

In the code generated by the baseline, the loop includes 10 instructions `[26-35]`, whereas the optimized code contains 11 instructions `[28-38]`. This is because the baseline's loop increment instruction `i++` produced `incq ax`, while the optimized code generated is:

```
00028  MOVQ R8, AX
...
00033  LEAQ 1(AX), R8
```

The culprit is empty basic block created during the split critical edge after rotation, which affects the logic of register allocation;

![image](https://github.com/golang/go/assets/5010047/5a6f9cf2-2430-4fef-a707-ac9b189594d6)

During the register allocation process, for the instruction `v82 = ADDQconst <int> [1] v59`, the register allocator checks if v59 is in the next basic block and examines the expected register for v59. For example, if v59 is in the next basic block and its expected register is R8, then the register allocator would also allocate the R8 register for `v82 = ADDQconst <int> [1] v59`, because the input and result are in the same register, allowing the production of `incq`. However, after loop rotation, the next basic block for v82 is the newly inserted empty basic block from the split critical edge, preventing the register allocator from knowing the expected register for v59, so v82 and v59 end up being allocated to different registers, R8 and RAX, resulting in an extra `mov` instruction. The proposed fix is to skip empty basic block and directly check the next non-empty basic block for the expected register.

### 6.3 high branch miss
`BenchmarkSrcsetFilterNoSpecials` shows noticeable performance regression due to high branch mispredictions after the optimization

![image](https://github.com/golang/go/assets/5010047/985d5c7c-6bac-4abd-bff6-5b078f98a56c)
![image](https://github.com/golang/go/assets/5010047/6bebb993-9629-400e-8ec3-6977f5a4b4f5)

The program execution flow is b16->b19->b103->b17->b16(loop), but b103 is placed below the loop latch b17, leading to high branch misses as illustrated above (`ja 8b(goto b103)`,`jmp 7f(goto b17)`, and `jg 30(goto b16)`). I propose to always place the loop header(or loop latch after real rotation) to the bottom of the loop.

## 7. Test fix
### 7.1 codegen/issue58166.go
`codegen/issue58166.go` fails, it expected loop increment to generate `incq`, but instead, it ended up generating `leaq+mov`.

During register allocation, v114 couldn't be allocated to the same register r8 as v149, due to the fact that there were two uses of v114 (v136 and v146) right after v149. As a result, unexpected instructions were produced.
```
b26:
v149 = ADDQconst <int> [1] v114
v135 = MULSD <float64> v87 v123
v136 = ADDSDloadidx8 <float64> v135 v50 v114 v160 ;;use of v114
v146 = MOVSDstoreidx8 <mem> v50 v114 v136 v160 ;; use of v114
v91 = CMPQ <flags> v15 v149
```
After loop rotation, the schedule pass calculated scores for b26, resulting in the following order:
```
Before schedule:
b26:
v135 (12) = MULSD <float64> v87 v123
v136 (12) = ADDSDloadidx8 <float64> v135 v50 v114 v160
v146 (12) = MOVSDstoreidx8 <mem> v50 v114 v136 v160
v149 (+11) = ADDQconst <int> [1] v114
v91 (+11) = CMPQ <flags> v15 v149

After schedule:
b26:
v149 = ADDQconst <int> [1] v114
v135 = MULSD <float64> v87 v123
v136 = ADDSDloadidx8 <float64> v135 v50 v114 v160
v146 = MOVSDstoreidx8 <mem> v50 v114 v136 v160
v91 = CMPQ <flags> v15 v149
```
An optimization was made as per https://go-review.googlesource.com/c/go/+/463751, which prioritizes values that are used within their live blocks. Since v149 is used by v91 in the same block, it has a higher priority.

The proposed fix is that if a value is used by a control value, then it should have the lowest priority. This would allow v146 and v91 to be scheduled closely together. After this scheduling, there would be no use of v114 after v146, thus allowing v146 and v114 to be allocated to the same register, ultimately generating an incq instruction.

### 7.2 test/nilptr3.go
```
func f4(x *[10]int) {
  ...
  _ = x[9] // ERROR "generated nil check"
	for {
		if x[9] != 0 { // ERROR "removed nil check"
			break
		}
	}
}
```
Loop rotation duplicates conditional test Values into loop gaurd block:
```
...
loopGuard:
v6 (+159) = NilCheck <*[10]int> v5 v1
v15 (159) = OffPtr <*int> [72] v6
v9 (159) = Load <int> v15 v1
v7 (159) = Neq64 <bool> v9 v13
If v7 → b5 loopHeader

loopHeader:
v8 (+159) = NilCheck <*[10]int> v5 v1
v11 (159) = OffPtr <*int> [72] v8
v12 (159) = Load <int> v11 v1
v14 (159) = Neq64 <bool> v12 v13
Plain → b3
```
So we need to add a `loop nilcheckelim` pass after all other loop opts to remove the `v6 NilCheck` in the loop guard, but that's not the whole story.

The code `_ = x[9]` is expected to generate a nilcheck, but`tighten` pass will move the LoweredNilCheck from the loop entry b1 to the loop guard. At that point, there happens to be a CMPQconstload in the loop guard that meets the conditions for `late nilcheckelim`, resulting in the elimination of the LoweredNilCheck as well. Therefore, we also need to modify the test to no longer expect the generated code to contain a nilcheck.

![image](https://github.com/golang/go/assets/5010047/497e24de-6f4a-4276-98e3-08632c7f4125)

### 7.3 test/writebarrier.go
```go
func f4(x *[2]string, y [2]string) {
	*x = y // ERROR "write barrier"

	z := y // no barrier
	*x = z // ERROR "write barrier"
}
```
We added `loop opt` pass after `writebarrier` pass, `*x = y` becomes dead store, simply removing ERROR directive fixes this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: loop invariant code motion #63670

1. Loop Invariant Code Motion

2. Type-based Alias Analysis

3. Loop Close SSA Form

4. Loop Rotation

4.1 CFG transformation

4.2 Fix Data Dependencies

Update conditional test operands

Update uses of moved Values because these defs no longer dominates uses after they were moved to loop latch

Add corresponding argument for phis at loop exits since new edge from loop guard to loop exit had been created

Update proxy phi to use the loop phi's incoming argument which comes from loop latch since loop latch may terminate the loop now

5. Bug Fix

6. Performance Regression

6.1 block layout

6.2 register allocation

6.3 high branch miss

7. Test fix

7.1 codegen/issue58166.go

7.2 test/nilptr3.go

7.3 test/writebarrier.go

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

package	arch	baseline	opt	delta
unicode/utf16	amd64	12.83n	10.59n	-17.45%
math/bits	yitian	0.8228n	0.7231n	-12.11%
regexp/syntax	amd64	175.3n	157.5n	-10.17%
unicode/utf16	yitian	8.191n	7.582n	-7.43%
math/rand	amd64	54.91n	50.90n	-7.31%
expvar	yitian	162.7n	154.0n	-5.36%
runtime/internal/atomic	yitian	6.194n	5.889n	-4.92%
mime	amd64	48.96n	46.73n	-4.57%
encoding/binary	amd64	28.94n	27.83n	-3.84%
runtime/cgo	yitian	801.9n	771.6n	-3.79%
encoding/binary	yitian	23.65n	22.76n	-3.75%
sync/atomic	yitian	4.748n	4.577n	-3.61%
archive/zip	amd64	326.9_	316.0_	-3.33%
time	yitian	489.9n	474.3n	-3.18%
image	amd64	8.690n	8.427n	-3.03%
fmt	yitian	79.12n	76.77n	-2.97%
encoding/csv	amd64	2.189_	2.132_	-2.62%
net/rpc	amd64	10.62_	10.35_	-2.61%
encoding/base64	amd64	373.4n	363.8n	-2.56%
time	amd64	745.2n	726.3n	-2.54%
text/template/parse	yitian	17.42µ	17.00µ	-2.41%
bufio	amd64	319.2n	311.5n	-2.40%
encoding/gob	amd64	7.626_	7.445_	-2.37%
encoding/base32	yitian	17.63µ	17.22µ	-2.36%
crypto/rc4	amd64	1.967_	1.921_	-2.30%
math/bits	amd64	1.377n	1.346n	-2.28%
encoding/xml	amd64	3.507_	3.430_	-2.18%
fmt	amd64	167.0n	163.5n	-2.14%
log/slog/internal/benchmarks	amd64	178.6n	174.7n	-2.14%
crypto/x509	amd64	121.9_	119.3_	-2.12%
log/slog/internal/benchmarks	yitian	76.34n	74.75n	-2.08%
encoding/gob	yitian	3.439µ	3.368µ	-2.07%
internal/poll	amd64	140.6n	137.7n	-2.04%
image/jpeg	yitian	252.0µ	247.4µ	-1.82%
compress/lzw	yitian	946.3µ	929.7µ	-1.75%
text/template/parse	amd64	21.72_	21.34_	-1.72%
slices	amd64	144.1n	141.7n	-1.67%
compress/lzw	amd64	1.446m	1.422m	-1.66%
crypto/x509	yitian	98.15µ	96.54µ	-1.64%
io	amd64	7.584_	7.461_	-1.62%
crypto/des	amd64	236.3n	232.5n	-1.60%
log/slog	amd64	452.1n	445.0n	-1.59%
crypto/elliptic	amd64	57.17_	56.33_	-1.48%
crypto/internal/nistec	yitian	131.1µ	129.1µ	-1.48%
net/rpc	yitian	8.009µ	7.896µ	-1.42%
crypto/internal/nistec/fiat	amd64	89.35n	88.15n	-1.34%
crypto/internal/nistec/fiat	yitian	51.92n	51.25n	-1.30%
slices	yitian	99.47n	98.17n	-1.30%
encoding/asn1	yitian	2.581µ	2.549µ	-1.24%
crypto/internal/nistec	amd64	208.6_	206.2_	-1.15%
sort	yitian	528.5µ	522.7µ	-1.08%
reflect	amd64	36.59n	36.20n	-1.06%
compress/flate	amd64	1.597m	1.582m	-0.92%
internal/fuzz	amd64	9.127_	9.043_	-0.92%
image/gif	yitian	4.505m	4.464m	-0.92%
reflect	yitian	23.30n	23.09n	-0.90%
debug/elf	yitian	5.180µ	5.134µ	-0.88%
debug/gosym	yitian	4.722µ	4.683µ	-0.83%
regexp	amd64	10.57_	10.48_	-0.81%
crypto/elliptic	yitian	37.49µ	37.18µ	-0.81%
encoding/json	amd64	3.434_	3.407_	-0.80%
image	yitian	6.054n	6.005n	-0.80%
encoding/pem	amd64	170.6_	169.4_	-0.71%
runtime/cgo	amd64	845.7n	839.9n	-0.69%
debug/elf	amd64	5.933_	5.894_	-0.67%
crypto/internal/bigmod	yitian	8.593µ	8.538µ	-0.64%
hash/crc32	amd64	287.9n	286.1n	-0.63%
go/constant	amd64	40.93_	40.67_	-0.62%
image/png	amd64	1.220m	1.212m	-0.60%
math/big	amd64	3.557_	3.536_	-0.59%
crypto/aes	amd64	27.64n	27.48n	-0.58%
net/netip	yitian	44.12n	43.87n	-0.57%
math/big	yitian	2.412µ	2.399µ	-0.51%
math/cmplx	amd64	39.62n	39.43n	-0.46%
crypto/subtle	yitian	8.622n	8.583n	-0.46%
math	yitian	5.104n	5.083n	-0.41%
sort	amd64	825.4_	822.0_	-0.40%
bytes	amd64	1.733_	1.727_	-0.37%
image/color	yitian	3.669n	3.655n	-0.36%
image/gif	amd64	4.927m	4.910m	-0.35%
html/template	yitian	494.9n	493.3n	-0.32%
hash/crc64	yitian	10.65µ	10.63µ	-0.27%
internal/poll	yitian	62.14n	61.99n	-0.25%
crypto/internal/edwards25519	yitian	28.41µ	28.34µ	-0.24%
mime/multipart	yitian	35.87µ	35.79µ	-0.22%
crypto/md5	amd64	4.305_	4.298_	-0.17%
crypto/sha1	amd64	1.381_	1.378_	-0.17%
crypto/cipher	amd64	1.182_	1.180_	-0.16%
crypto/ecdh	amd64	483.4_	482.7_	-0.14%
unicode/utf8	yitian	55.44n	55.38n	-0.12%
html/template	amd64	664.7n	664.0n	-0.10%
bytes	yitian	1.154µ	1.153µ	-0.09%
crypto/internal/bigmod	amd64	9.538_	9.532_	-0.07%
compress/bzip2	yitian	5.270m	5.266m	-0.07%
crypto/rc4	yitian	1.501µ	1.500µ	-0.06%
archive/tar	yitian	3.656µ	3.654µ	-0.05%
crypto/ecdsa	yitian	187.3µ	187.2µ	-0.05%
crypto/sha1	yitian	483.7n	483.7n	-0.01%
crypto/internal/edwards25519/field	yitian	34.96n	34.97n	0.02%
hash/fnv	amd64	2.075_	2.075_	0.03%
crypto/md5	yitian	3.746µ	3.747µ	0.03%
crypto/cipher	yitian	816.4n	816.9n	0.06%
crypto/hmac	amd64	1.677_	1.678_	0.08%
crypto/ed25519	amd64	42.51_	42.55_	0.09%
encoding/asn1	amd64	3.390_	3.394_	0.10%
crypto/sha256	amd64	3.200_	3.204_	0.11%
crypto/ecdh	yitian	307.9µ	308.2µ	0.11%
crypto/sha512	yitian	1.036µ	1.037µ	0.11%
hash/crc32	yitian	166.1n	166.3n	0.11%
image/png	yitian	787.7µ	788.5µ	0.11%
crypto/hmac	yitian	473.6n	474.1n	0.12%
sync	amd64	85.76n	85.89n	0.15%
image/jpeg	amd64	417.5_	418.2_	0.17%
archive/zip	yitian	205.1µ	205.5µ	0.20%
go/printer	amd64	232.9_	233.5_	0.23%
strconv	yitian	42.90n	43.00n	0.23%
log/slog	yitian	335.1n	335.9n	0.24%
image/color	amd64	5.199n	5.213n	0.26%
crypto/ed25519	yitian	32.43µ	32.51µ	0.27%
crypto/internal/edwards25519/field	amd64	44.30n	44.42n	0.28%
crypto/internal/edwards25519	amd64	36.84_	36.95_	0.30%
compress/flate	yitian	990.4µ	993.5µ	0.32%
debug/gosym	amd64	5.990_	6.011_	0.36%
strings	yitian	816.3n	819.3n	0.36%
go/constant	yitian	39.23µ	39.37µ	0.37%
math/rand/v2	yitian	5.932n	5.954n	0.37%
crypto/rsa	yitian	734.8µ	738.4µ	0.49%
crypto/ecdsa	amd64	312.1_	313.7_	0.51%
log	amd64	175.0n	175.9n	0.53%
regexp	yitian	7.101µ	7.138µ	0.53%
crypto/sha256	yitian	581.6n	585.0n	0.58%
encoding/csv	yitian	1.870µ	1.881µ	0.62%
crypto/sha512	amd64	2.774_	2.792_	0.65%
runtime	yitian	34.38n	34.61n	0.65%
sync	yitian	56.61n	56.98n	0.65%
image/draw	amd64	626.6_	631.0_	0.69%
runtime/trace	amd64	4.299n	4.329n	0.69%
strings	amd64	1.152_	1.160_	0.70%
internal/fuzz	yitian	6.616µ	6.663µ	0.71%
crypto/subtle	amd64	12.80n	12.90n	0.74%
crypto/aes	yitian	17.14n	17.27n	0.76%
text/tabwriter	amd64	388.0_	391.0_	0.77%
go/parser	amd64	1.634m	1.647m	0.80%
net/textproto	yitian	1.647µ	1.661µ	0.80%
crypto/des	yitian	162.4n	163.7n	0.81%
internal/intern	yitian	128.0n	129.1n	0.85%
encoding/hex	amd64	5.941_	5.993_	0.86%
compress/bzip2	amd64	8.345m	8.419m	0.89%
go/parser	yitian	1.233m	1.244m	0.89%
encoding/base64	yitian	240.5n	242.8n	0.92%
log	yitian	180.6n	182.4n	0.96%
encoding/hex	yitian	4.120µ	4.161µ	0.97%
encoding/base32	amd64	27.58_	27.85_	0.98%
net/http	yitian	3.268µ	3.301µ	1.03%
encoding/json	yitian	1.572µ	1.589µ	1.06%
math/cmplx	yitian	26.43n	26.74n	1.17%
mime	yitian	24.37n	24.65n	1.17%
strconv	amd64	67.41n	68.25n	1.24%
runtime	amd64	51.06n	51.71n	1.26%
runtime/trace	yitian	2.411n	2.443n	1.31%
math/rand	yitian	40.09n	40.64n	1.38%
html	yitian	2.562µ	2.599µ	1.46%
bufio	yitian	267.9n	272.0n	1.50%
expvar	amd64	264.3n	268.5n	1.60%
net/netip	amd64	64.72n	65.76n	1.62%
os	yitian	1.387µ	1.414µ	1.94%
encoding/xml	yitian	3.475µ	3.544µ	2.00%
go/types	yitian	497.5µ	507.8µ	2.07%
io	yitian	1.994µ	2.036µ	2.12%
regexp/syntax	yitian	106.3n	108.6n	2.14%
math/rand/v2	amd64	8.660n	8.854n	2.24%
context	yitian	953.0n	975.0n	2.31%
encoding/pem	yitian	106.5µ	108.9µ	2.31%
mime/multipart	amd64	42.46_	43.55_	2.57%
database/sql	amd64	853.1_	875.5_	2.62%
go/scanner	amd64	241.1_	247.4_	2.62%
hash/maphash	yitian	28.44n	29.21n	2.71%
runtime/internal/math	yitian	0.9047n	0.9295n	2.74%
go/scanner	yitian	164.8µ	169.6µ	2.94%
errors	yitian	132.0n	136.1n	3.16%
image/draw	yitian	370.5µ	382.8µ	3.30%
unicode/utf8	amd64	76.28n	78.91n	3.45%
text/tabwriter	yitian	284.5µ	294.3µ	3.45%
runtime/internal/atomic	amd64	13.41n	13.95n	4.01%
html	amd64	2.673_	2.783_	4.12%
errors	amd64	166.2n	173.3n	4.25%
net/url	yitian	254.0n	264.9n	4.29%
go/printer	yitian	159.7µ	166.6µ	4.36%
hash/crc64	amd64	12.87_	13.43_	4.38%
math	amd64	9.082n	9.480n	4.39%
archive/tar	amd64	4.891_	5.107_	4.42%
index/suffixarray	yitian	46.87m	49.02m	4.60%
go/types	amd64	685.3_	718.3_	4.81%
net/textproto	amd64	2.142_	2.246_	4.87%
hash/fnv	yitian	1.453µ	1.526µ	5.00%
sync/atomic	amd64	8.594n	9.055n	5.37%
database/sql	yitian	698.3µ	735.9µ	5.38%
runtime/internal/math	amd64	1.390n	1.465n	5.40%
crypto/rsa	amd64	970.1_	1.025m	5.67%
net/url	amd64	385.5n	412.0n	6.86%
internal/intern	amd64	135.3n	145.1n	7.24%
os	amd64	1.671_	1.799_	7.67%
hash/maphash	amd64	45.60n	49.25n	8.02%
index/suffixarray	amd64	79.15m	85.55m	8.08%
net/http	amd64	4.357_	4.712_	8.16%
context	amd64	1.236_	1.350_	9.18%

cmd/compile: loop invariant code motion #63670

Description

1. Loop Invariant Code Motion

2. Type-based Alias Analysis

3. Loop Close SSA Form

4. Loop Rotation

4.1 CFG transformation

4.2 Fix Data Dependencies

Update conditional test operands

Update uses of moved Values because these defs no longer dominates uses after they were moved to loop latch

Add corresponding argument for phis at loop exits since new edge from loop guard to loop exit had been created

Update proxy phi to use the loop phi's incoming argument which comes from loop latch since loop latch may terminate the loop now

5. Bug Fix

6. Performance Regression

6.1 block layout

6.2 register allocation

6.3 high branch miss

7. Test fix

7.1 codegen/issue58166.go

7.2 test/nilptr3.go

7.3 test/writebarrier.go

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions