-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
Prompted by the recent change to support the proposed Zibi extension (PR #146858 and PR #127463), I looked into whether the same code size reduction could be achieved with compressed add+branches. At least for some cases we should be able to emit
c.addi a0, -CMP; c.beqz a0, ...
(4 bytes) instead of the current c.li a1, CMP, beq a0, a1
(6 bytes).
Interestingly this allows representing all immeidates that Zibi will handle but of course comes with register pressure and codegen challenges that Zibi avoids.
Looking at the most common immediates in spec2017 and 2017, the distribution does seem quite random but it seems like a 6 bit immediate covers many of them (https://gist.github.com/arichardson/b88cc3d3cac1a7fec85ee1d24b463d99)
I made a draft change in https://github.com/arichardson/upstream-llvm-project/tree/2025-compressed-branch-imm, but doing this in tablegen does not seem to be particularly useful.
If I emmit the pattern unconditionally, we end up needed a c.mv $TMP, a0
in many cases since the c.addi requires same in and output registers. In that case we are better off with the c.li.
However, in cases where the register is dead after the comparison, using c.addi+c.beqz avoids the need for an extra register (same as Zibi), so it would be good to have an optimization that uses this pattern whenever possible to asses how much Zibi actually helps code size+reg pressure. Of course Zibi can still be beneficial even with identical code size for simple cores since there is no need to modify a register and/or do somewhat complex macro-op fusion.
Given the following code (https://godbolt.org/z/fn4rx1vr8):
int foo();
int bar();
int test(int num) {
if (num == 11) {
return foo();
} else if (num == 16) {
return bar();
}
return 1;
}
Both clang and GCC emit two li+bne/beq:
test:
li a1, 16
beq a0, a1, .LBB0_3
li a1, 11
bne a0, a1, .LBB0_4
tail foo
.LBB0_3:
tail bar
.LBB0_4:
li a0, 1
ret
But since a0 is dead after the branch we should be able to save at least two bytes for the final branch quite easily:
test:
c.li a1,11
beq a0,a5,.LBB0_3
c.addi a0,-16
c.beqz a0,.LBB0_4
c.li a0,1
ret
.LBB0_3:
tail foo
.LBB0_4:
tail bar
With some more effort we could track the value of a0 and save another 2 bytes to generate:
test:
c.addi a0,-11
c.beqz a0,.LBB0_3
c.addi a0,-5. # subtract another 5 to get to the expected -16
c.beqz a0,.LBB0_4
c.li a0,1
ret
.LBB0_3:
tail foo
.LBB0_4:
tail bar