Skip to content

ARM/AArch64 backend aggressively pessimizes code with broadcasted constants #102195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsharlet opened this issue Aug 6, 2024 · 20 comments
Open

Comments

@dsharlet
Copy link

dsharlet commented Aug 6, 2024

I'm having a lot of trouble with the arm (32 and 64 bit) backends de-optimizing code related to broadcasted constants. There are several issues:

  • LLVM attempts to observe constants through memory, and propagate them.
  • LLVM moves broadcasts into loops.
  • LLVM spills broadcasts by redoing the broadcast, rather than spilling and reloading a vector.

Here's an example that demonstrates several issues: https://godbolt.org/z/chjx4d4vh

If the compiler would compile the code as written, there would be no register spills, because the constants would occupy half as many registers. I included a commented call to make_opaque that is one attempted workaround, to trick the compiler into not thinking these are constants (at the expense of a function call...), and it does work to do that, but the compiler still moves the broadcasts (dup instructions) out of the loop and spills some of the registers.

I run into this issue very frequently. Any suggested workarounds, e.g. some annotation to force the compiler to keep a broadcast outside of the loop, or possible fixes to LLVM, would be very welcome. As it stands, I find vmla_lane_X intrinsics to be almost useless because of this issue.

@llvmbot
Copy link
Member

llvmbot commented Aug 6, 2024

@llvm/issue-subscribers-backend-aarch64

Author: Dillon (dsharlet)

I'm having a lot of trouble with the arm (32 and 64 bit) backends de-optimizing code related to broadcasted constants. There are several issues:
  • LLVM attempts to observe constants through memory, and propagate them.
  • LLVM moves broadcasts into loops.
  • LLVM spills broadcasts by redoing the broadcast, rather than spilling and reloading a vector.

Here's an example that demonstrates several issues: https://godbolt.org/z/chjx4d4vh

If the compiler would compile the code as written, there would be no register spills, because the constants would occupy half as many registers. I included a commented call to make_opaque that is one attempted workaround, to trick the compiler into not thinking these are constants (at the expense of a function call...), and it does work to do that, but the compiler still moves the broadcasts (dup instructions) out of the loop and spills some of the registers.

I run into this issue very frequently. Any suggested workarounds, e.g. some annotation to force the compiler to keep a broadcast outside of the loop, or other possible, fixes would be very welcome. As it stands, I find vmla_lane_X intrinsics to be almost useless because of this issue.

@pinskia
Copy link

pinskia commented Aug 6, 2024

Note vmlaq_lane_f32 should not be using the fused multiply add instruction either ...
I noticed GCC fixed that for GCC 12 while LLVM still has not been changed yet.
Note If you want to use the fused multiply add, you should use vfmaq_lane_f32 intrinics instead.

@fbarchard
Copy link

Theres about 4 bugs here (see below)
I've reported (2) (3) and (4) but (1) is new and mainly what dsharlet is encountering.
When an immediate constant (e,.g. 1) it duplicated and put into a vector at the top of a function, if there is register pressure, clang puts a broadcast inside the loop
I've tried replacing the constant with a vector load of an array outside the loop, and it copies it to the stack
movdqa+movdqu to put it on the stack, not unaligned, and then movdqu inside the loop instead of broadcast.

  1. vectors are broadcast/loaded inside loops. I see this on x86 as well (clang 16)
  2. lanes are replaced with preduplicated values. which causes register spill, so the vectors are loaded, dup'ed to 4 vectors, stored to stack, then loaded inside the loop.
  3. constants and code sequences that generate constants are replaced with constants in memory and a load. which is bigger and slower, especially on x86-32 with -fpic that does a call/pop/lea to get the address with a ret branch mispredict.
    In this article http://0x80.pl/notesen/2023-01-19-avx512-consts.html it shows ways to generate simple constants on x86, typically with 2 or 3 vector instructions. But clang replaces the instructions with a static constant in the text segment and a code sequence to load it. Which is slow and what I'm trying to avoid.
    My preferred solution is the clang x86 assembler handle a movi psuedo op that generates a code sequence, similar to arm. For example, to move the immediate value 1 into bytes of register in avx512/avx10
    VPTERNLOGD $0xff, Z1, Z1, Z1
    VPABSB Z1, Z4 // 0x01010101
  4. broadcast is slow on newer cpus.. typically 5 cycles. I tried using a shuffle, which is faster, but clang replaced it with an extract + broadcast (inside the main loop). embedded broadcasting is equally slow latency.

ps there was a bug in vld21_dup_f32 but its fixed in head

@dsharlet
Copy link
Author

dsharlet commented Aug 6, 2024

Note vmlaq_lane_f32 should not be using the fused multiply add instruction either ...
I noticed GCC fixed that for GCC 12 while LLVM still has not been changed yet.
Note If you want to use the fused multiply add, you should use vfmaq_lane_f32 intrinics instead.

Thanks for pointing that out, I was unaware of this, especially because it generated the instruction I expected!

That said, I corrected the example (and added -ffast-math for good measure, and it still has the issue I reported: https://godbolt.org/z/6EM9drsc6

edit: I forgot to check, my workaround does work in this case now! However, my workaround has the cost of a function call. So I still would really appreciate a fix to this bug, and also any workarounds that don't require overhead if you can think of any.

@davemgreen davemgreen changed the title ARM backend aggressively pessimizes code with broadcasted constants ARM/AArch64 backend aggressively pessimizes code with broadcasted constants Aug 7, 2024
@davemgreen
Copy link
Collaborator

Note vmlaq_lane_f32 should not be using the fused multiply add instruction either ...
I noticed GCC fixed that for GCC 12 while LLVM still has not been changed yet.
Note If you want to use the fused multiply add, you should use vfmaq_lane_f32 intrinics instead.

It is apparently controlled by -ffp-contract, which defaults to on. The fmuladd intrinsics don't have the same optimizations for sinking splats into the loop BB as fma - I can add a quick fix for that.

For an actual fix, I agree it would be nice if the compiler understood and performed this optimization. It is not very obvious where that would happen considering the way llvm canonicalizes constants. In the meantime adding volatile to the array manages to address it somewhat, but leaves some extra stores in the preheader: https://godbolt.org/z/c87ejfo9T. There might be an alternative where the value is passed into a nop inline-asm block which the compiler cannot see through.

davemgreen added a commit to davemgreen/llvm-project that referenced this issue Aug 7, 2024
A fmuladd can be treated as a fma when sinking operands to the intrinsic,
similar to D126234.

Addresses a part of llvm#102195
@dsharlet
Copy link
Author

dsharlet commented Aug 8, 2024

Thanks for the suggestion. I've been experimenting with volatile to work around this, and I've run into a few issues.

First off, arm is not the only target affected by this general class of issues, it's just the one I was looking at and worked up the motivation to file a bug.

I'm trying a pattern like this:

volatile float16x8_t constant = vreinterpretq_f16_u16(vmovq_n_u16(0x1234));
...
// Use in loop

This causes the compiler to reload constant from memory every time it's used inside the loop, which I think is what I expect.

The interesting thing, if I use this same attempted workaround on x86, it causes the compiler to use the stack spilled broadcast, just how I want!

What I'm confused about is why the x86 and ARM backends treat this so very differently? And I have to admit, I'm so frustrated, trying to chase down a workaround for this class of problems... it's such a clean simple solution that works well on x86, but fails completely on ARM.

There might be an alternative where the value is passed into a nop inline-asm block which the compiler cannot see through.

I tried this, and the compiler generates pretty messy code that is noticeably slower, for example:

        @APP
        @NO_APP
        vmov.32 d17[0], r4
        cmp     r0, #16
        @APP
        @NO_APP
        vmov.32 d19[0], r6
        vmov.32 d18[0], r4
        @APP
        @NO_APP
        str     r7, [sp, #12]                   @ 4-byte Spill
        vmov.32 d21[0], r4
        vmov.32 d20[0], r6
        @APP
        @NO_APP
        str     r5, [sp, #4]                    @ 4-byte Spill
        str     r7, [sp, #20]                   @ 4-byte Spill
        vmov.32 d23[0], r4
        vmov.32 d22[0], r6
        @APP
        @NO_APP
        str     r5, [sp, #8]                    @ 4-byte Spill
        str     r7, [sp, #24]                   @ 4-byte Spill
        vmov.32 d25[0], r6
        vmov.32 d16[0], r12
        vmov.32 d24[0], r4
        @APP
        @NO_APP
        str     r7, [sp, #16]                   @ 4-byte Spill
        str     r4, [sp, #28]                   @ 4-byte Spill
        add     r7, r3, #4
        vmov.32 d26[0], r6
        ldr     r6, [sp, #12]                   @ 4-byte Reload
        vld1.16 {d0[], d1[]}, [r3:16]!
        vld1.16 {d28[], d29[]}, [r7:16]
        vld1.16 {d30[], d31[]}, [r3:16]
        vmov.32 d21[1], r11
        vmov.32 d20[1], r6
        ldr     r6, [sp, #4]                    @ 4-byte Reload
        vmov.32 d27[0], r12
        vmov.32 d23[1], r6
        ldr     r6, [sp, #20]                   @ 4-byte Reload
        vmov.32 d17[1], lr
        vmov.32 d22[1], r6
        ldr     r6, [sp, #8]                    @ 4-byte Reload
        vmov.32 d19[1], r8
        vmov.32 d25[1], r6
        ldr     r6, [sp, #24]                   @ 4-byte Reload
        vmov.32 d16[1], r9
        vmov.32 d24[1], r6
        ldr     r6, [sp, #16]                   @ 4-byte Reload
        vmov.32 d18[1], r10
        vmov.32 d27[1], r6
        ldr     r6, [sp, #28]                   @ 4-byte Reload
        vmov.32 d26[1], r6
        blo     .LBB0_2

I think that by adding the inline asm to force storing the vectors, it also causes the compiler to spill and reload all the scalars that are live at the time too...?

@dsharlet
Copy link
Author

dsharlet commented Aug 8, 2024

What I'm confused about is why the x86 and ARM backends treat this so very differently? And I have to admit, I'm so frustrated, trying to chase down a workaround for this class of problems... it's such a clean simple solution that works well on x86, but fails completely on ARM.

To expand on this, the thing that works for x86 is:

volatile const __m128 vx = _mm_set1_ps(1.0f);

But the thing that works on ARM is:

volatile const float x = 1.0f;
const float32x4_t vx = vmovq_n_f32(x);

AFAICT, they really are achieving the same thing: forcing the compiler to broadcast and store the broadcast to the stack, and then reload that broadcasted value/keep it in a register. The thing that is confusing is I really don't actually expect either one to work: the x86 one seems like it would force the compiler reload it every time it used it, rather than keep it in a register. And the ARM one seems like it shouldn't matter at all, but it does.

@dsharlet
Copy link
Author

dsharlet commented Aug 8, 2024

Actually, I'm wrong, on x86, it does just reload the vector every time, as expected when it is volatile. It was tricky because:

  • x86 can do many ops with memory operands, so the loads are free in such cases (good)
  • There are so few registers on x86 that the difference between reloading every time explicitly vs. spills was actually not that big :) (bad)

davemgreen added a commit that referenced this issue Aug 9, 2024
A fmuladd can be treated as a fma when sinking operands to the
intrinsic, similar to D126234.

Addresses a small part of #102195
@dsharlet
Copy link
Author

I found a less destructive workaround. The inline assembly I was using above in this comment was using __asm volatile("" : "+r"(x));.

__asm volatile("" : : "m"(x)) generates much cleaner code and is the solution I'm using now. I'd still love to see a proper fix for LLVM to treat broadcasts with a more reasonable cost, but this workaround is serviceable, despite resulting in an unnecessary store-load sequence in cases when no stack spill would have been necessary.

@alexander-shaposhnikov
Copy link
Collaborator

alexander-shaposhnikov commented Aug 17, 2024

cc @nikic , @efriedma-quic , @arsenm

@alexander-shaposhnikov
Copy link
Collaborator

alexander-shaposhnikov commented Aug 27, 2024

I've looked into this issue a bit and created a small prototype for a MIR pass that collects broadcasts (whose users can be switched to indexed forms, e.g., FMLAv4i32_indexed) and attempts to perform this replacement & "combine" broadcasts. I don't see a good existing place for such a transformation, but I might be missing something. Any suggestions or advice would be greatly appreciated, maybe there is a simpler/better approach. @MatzeB @davemgreen @efriedma-quic @RKSimon

@fbarchard
Copy link

We're not expecting the compiler to combine multiple dup's.
When there is an opportunity to 'combine broadcast' values we do it on our end, combining several scalars into structure loaded into a vector. For the scaling values, we use vmulq_lane_f32 to isolate the constant. For the others, we can vdupq_n_s32 the field before using it etc

We're not expecting dup'ed vectors to be made into scalars
If a single value is vld1q_dup_f32 or vdupq_n_f32 and then used with mul or fma that supports lanes, there may be an opportunity to simplify the load/dup to only fill in a single element. e.g. use ldr instead of ld1_dup. and then use a lane
But the initial set1 is usually outside the loop, while the mul lane costs a micro-op and vector unit to do the dup inside the loop, which is likely not a performance win on modern cpus.

Lanes are currently converting to dups, which is problematic
When constants are loaded outside the main loop and use fma with lanes inside the loop, clang is sometimes vduping the values into a multiple registers, causing register spill.
outside the loop, a each register gets dup'ed 4 times (4 floats) and 4 vectors saved to the stack, and inside the loop the 4 fma lanes becomes 4 loads from stack+4 fma.

On x86 I've tried replacing broadcast/set1 with a full vector and using shuffle instructions (faster than broadcast) to isolate the lanes I want and clang replaced the shuffle with extract+broadcast.
On x86 clang under estimates the performance of broadcast/embedded broadcast. broadcast is 3 to 5 cycles and requires a register. embedded broadcast has the same latency

On x86 we'd expect memory arguments and embedded broadcast be used, to avoid register spill
In one of our kernels, an aligned vector of constants (the value is 8 as a byte) it pre-duplicates into an array of 32 bytes and aligned in memory, and then loaded. The constants are loaded with alignment, but due to spill, saves to the stack unaligned. Then inside the loop, load unaligned into a register. The value is used for vsubb. If I use set1_epi8(8) is puts a broadcastb instead... both use a register, causing the main loop to save the register before the broadcast and then restore it, for a total of 4 instructions - save/broadcast/subb/restore.

On x86, I'd like to see set1() generate a code sequence to create vectors with immediates, instead of loading from memory. There are well known techniques for many immediates, and a simple one is mov immediate the constant into a GPR and then broadcast to a vector. Many constants are simple masks or powers of 2 that can be generated with 2 or 3 instructions
See Also
http://0x80.pl/notesen/2023-01-19-avx512-consts.html

@dzaima
Copy link

dzaima commented Aug 28, 2024

For the constant array, the important thing is to make it static, otherwise it'll be copied to stack, which is where the stack stores/loads come from in the workaround-ful versions. (compilers could conceivably improve on this if they can determine that there are no stores to the buffer, but for whatever reason neither gcc nor clang does; can't apply with the asm optimization-fence though). So here's a workaround version with ideal codegen.

@davemgreen
Copy link
Collaborator

I've looked into this issue a bit and created a small prototype for a MIR pass that collects broadcasts (whose users can be switched to indexed forms, e.g., FMLAv4i32_indexed) and attempts to perform this replacement & "combine" broadcasts. I don't see a good existing place for such a transformation, but I might be missing something. Any suggestions or advice would be greatly appreciated, maybe there is a simpler/better approach. @MatzeB @davemgreen @efriedma-quic @RKSimon

Does it need to combine the various ways we generate constants into a constant pool? It sounds like it is hopefully a sensible approach. We sometimes do similar things pre-isel by hoisting the constants and hiding them behind a bitcast to make sure they stay in another block. After ISel would have the advantage that any optimizations based on the values can happen first though. And it sounds like it is more general than just constants? If you have a prototype, lets see how it does in the backend.

For the constant array, the important thing is to make it static, otherwise it'll be copied to stack, which is where the stack stores/loads come from in the workaround-ful versions. (compilers could conceivably improve on this if they can determine that there are no stores to the buffer, but for whatever reason neither gcc nor clang does; can't apply with the asm optimization-fence though). So here's a workaround version with ideal codegen.

It's good to hear you found a work-around. I'm not sure what your real case looks like, but you might be able to use vmlaq_laneq_f32 to index more lanes and use less registers. This is unrelated and you might already be aware, depending on what you need to calculate it might be beneficial to reassociate the operations into multiple chains that operate in parallel. Some CPUs have multiple vector units that can perform multiple operations per cycle if there is enough instruction-level parallelism in the core. One big long chain will be more difficult for it to get the best performance out of.

@qcolombet
Copy link
Collaborator

I had a quick look after @alexander-shaposhnikov pinged me offline and I am wondering if I am looking at the right thing.
I do not see any spill within the loop from the compiler explorer links.

Alternatively could you share a .ll and the llc command line to reproduce the issue?

One thing that we won't get around though is the fact that instcombine propagates constants aggressively. Like @davemgreen said we have a pass that mitigate some of this after the fact pre-isel (llvm/lib/Transforms/Scalar/ConstantHoisting.cpp).

@efriedma-quic
Copy link
Collaborator

We could potentially add an "aarch64_fma_lane" intrinsic to LLVM, and make clang call it instead of using the generic fma intrinsic. That wouldn't really solve anything for generic code, but it would block the constant propagation optimization that's causing trouble here.

The general problem of packing arbitrary values into vectors registers to reduce register pressure is potentially interesting, but hard to solve well.

@qcolombet
Copy link
Collaborator

Talked to @alexander-shaposhnikov offline and understand what's left to fix now.
Anyhow, I agree with @efriedma-quic!

@davemgreen
Copy link
Collaborator

I know we have added lane-wise intrinsics in the past, but I don't love when we have had to do it. Especially for something like fma which is so widely used. The loss of performance from not constant folding / other optimizations would worry me.

There are always two types of users for the intrinsics (or a spectrum of people between the two ends of the extreme). There are expert users that know exactly the instructions they want where, and really just want the compiler to do register allocation and maybe a bit of scheduling for them. On the other end there are users who know much less about the architecture, let alone the micro-architecture. They often use higher level simd libraries that are built up out of lower level intrinsics and expect the compiler to do a lot of optimization to get them into the best shape possible. We need to consider both.

My vote would be to try and optimize this case in the backend if we have a patch to do it. It might not be perfect but we can make it better as we find more cases where it doesn't work and improve it over time.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 30, 2024

On X86 we've added the X86FixupVectorConstants pass that detects constant vector loads / folded instructions that can be converted to broadcasts/extload/avx512-folded-broadcasts etc.

The next step is to remove the DAG folds of vector constants to VBROADCAST_LOAD/SUBV_BROADCAST_LOAD nodes and let the pass handle it entirely: https://github.com/RKSimon/llvm-project/tree/perf/broadcast-avx512 - but untangling the regressions isn't fun and I've gotten distracted with other things recently.

I've also been considering an unfold pass (#86669) - a bit like MachineLICM but could be used to help x86 cases where we might be able to save constant pool space, pack scalar constants into a single vector register,, create constants without memory access etc. depending on register pressure.

@alexander-shaposhnikov
Copy link
Collaborator

Thanks everyone for the feedback,
I'm going to spend a few more days doing experiments and then will send a PR (~early next week).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests