-
Notifications
You must be signed in to change notification settings - Fork 13.5k
ARM/AArch64 backend aggressively pessimizes code with broadcasted constants #102195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-aarch64 Author: Dillon (dsharlet)
I'm having a lot of trouble with the arm (32 and 64 bit) backends de-optimizing code related to broadcasted constants. There are several issues:
Here's an example that demonstrates several issues: https://godbolt.org/z/chjx4d4vh If the compiler would compile the code as written, there would be no register spills, because the constants would occupy half as many registers. I included a commented call to I run into this issue very frequently. Any suggested workarounds, e.g. some annotation to force the compiler to keep a broadcast outside of the loop, or other possible, fixes would be very welcome. As it stands, I find |
Note vmlaq_lane_f32 should not be using the fused multiply add instruction either ... |
Theres about 4 bugs here (see below)
ps there was a bug in vld21_dup_f32 but its fixed in head |
Thanks for pointing that out, I was unaware of this, especially because it generated the instruction I expected! That said, I corrected the example (and added edit: I forgot to check, my workaround does work in this case now! However, my workaround has the cost of a function call. So I still would really appreciate a fix to this bug, and also any workarounds that don't require overhead if you can think of any. |
It is apparently controlled by -ffp-contract, which defaults to on. The fmuladd intrinsics don't have the same optimizations for sinking splats into the loop BB as fma - I can add a quick fix for that. For an actual fix, I agree it would be nice if the compiler understood and performed this optimization. It is not very obvious where that would happen considering the way llvm canonicalizes constants. In the meantime adding volatile to the array manages to address it somewhat, but leaves some extra stores in the preheader: https://godbolt.org/z/c87ejfo9T. There might be an alternative where the value is passed into a nop inline-asm block which the compiler cannot see through. |
A fmuladd can be treated as a fma when sinking operands to the intrinsic, similar to D126234. Addresses a part of llvm#102195
Thanks for the suggestion. I've been experimenting with volatile to work around this, and I've run into a few issues. First off, arm is not the only target affected by this general class of issues, it's just the one I was looking at and worked up the motivation to file a bug. I'm trying a pattern like this:
This causes the compiler to reload The interesting thing, if I use this same attempted workaround on x86, it causes the compiler to use the stack spilled broadcast, just how I want! What I'm confused about is why the x86 and ARM backends treat this so very differently? And I have to admit, I'm so frustrated, trying to chase down a workaround for this class of problems... it's such a clean simple solution that works well on x86, but fails completely on ARM.
I tried this, and the compiler generates pretty messy code that is noticeably slower, for example:
I think that by adding the inline asm to force storing the vectors, it also causes the compiler to spill and reload all the scalars that are live at the time too...? |
To expand on this, the thing that works for x86 is:
But the thing that works on ARM is:
AFAICT, they really are achieving the same thing: forcing the compiler to broadcast and store the broadcast to the stack, and then reload that broadcasted value/keep it in a register. The thing that is confusing is I really don't actually expect either one to work: the x86 one seems like it would force the compiler reload it every time it used it, rather than keep it in a register. And the ARM one seems like it shouldn't matter at all, but it does. |
Actually, I'm wrong, on x86, it does just reload the vector every time, as expected when it is volatile. It was tricky because:
|
A fmuladd can be treated as a fma when sinking operands to the intrinsic, similar to D126234. Addresses a small part of #102195
I found a less destructive workaround. The inline assembly I was using above in this comment was using
|
cc @nikic , @efriedma-quic , @arsenm |
I've looked into this issue a bit and created a small prototype for a MIR pass that collects broadcasts (whose users can be switched to indexed forms, e.g., FMLAv4i32_indexed) and attempts to perform this replacement & "combine" broadcasts. I don't see a good existing place for such a transformation, but I might be missing something. Any suggestions or advice would be greatly appreciated, maybe there is a simpler/better approach. @MatzeB @davemgreen @efriedma-quic @RKSimon |
We're not expecting the compiler to combine multiple dup's. We're not expecting dup'ed vectors to be made into scalars Lanes are currently converting to dups, which is problematic On x86 I've tried replacing broadcast/set1 with a full vector and using shuffle instructions (faster than broadcast) to isolate the lanes I want and clang replaced the shuffle with extract+broadcast. On x86 we'd expect memory arguments and embedded broadcast be used, to avoid register spill On x86, I'd like to see set1() generate a code sequence to create vectors with immediates, instead of loading from memory. There are well known techniques for many immediates, and a simple one is mov immediate the constant into a GPR and then broadcast to a vector. Many constants are simple masks or powers of 2 that can be generated with 2 or 3 instructions |
For the constant array, the important thing is to make it |
Does it need to combine the various ways we generate constants into a constant pool? It sounds like it is hopefully a sensible approach. We sometimes do similar things pre-isel by hoisting the constants and hiding them behind a bitcast to make sure they stay in another block. After ISel would have the advantage that any optimizations based on the values can happen first though. And it sounds like it is more general than just constants? If you have a prototype, lets see how it does in the backend.
It's good to hear you found a work-around. I'm not sure what your real case looks like, but you might be able to use vmlaq_laneq_f32 to index more lanes and use less registers. This is unrelated and you might already be aware, depending on what you need to calculate it might be beneficial to reassociate the operations into multiple chains that operate in parallel. Some CPUs have multiple vector units that can perform multiple operations per cycle if there is enough instruction-level parallelism in the core. One big long chain will be more difficult for it to get the best performance out of. |
I had a quick look after @alexander-shaposhnikov pinged me offline and I am wondering if I am looking at the right thing. Alternatively could you share a .ll and the llc command line to reproduce the issue? One thing that we won't get around though is the fact that |
We could potentially add an "aarch64_fma_lane" intrinsic to LLVM, and make clang call it instead of using the generic fma intrinsic. That wouldn't really solve anything for generic code, but it would block the constant propagation optimization that's causing trouble here. The general problem of packing arbitrary values into vectors registers to reduce register pressure is potentially interesting, but hard to solve well. |
Talked to @alexander-shaposhnikov offline and understand what's left to fix now. |
I know we have added lane-wise intrinsics in the past, but I don't love when we have had to do it. Especially for something like fma which is so widely used. The loss of performance from not constant folding / other optimizations would worry me. There are always two types of users for the intrinsics (or a spectrum of people between the two ends of the extreme). There are expert users that know exactly the instructions they want where, and really just want the compiler to do register allocation and maybe a bit of scheduling for them. On the other end there are users who know much less about the architecture, let alone the micro-architecture. They often use higher level simd libraries that are built up out of lower level intrinsics and expect the compiler to do a lot of optimization to get them into the best shape possible. We need to consider both. My vote would be to try and optimize this case in the backend if we have a patch to do it. It might not be perfect but we can make it better as we find more cases where it doesn't work and improve it over time. |
On X86 we've added the X86FixupVectorConstants pass that detects constant vector loads / folded instructions that can be converted to broadcasts/extload/avx512-folded-broadcasts etc. The next step is to remove the DAG folds of vector constants to VBROADCAST_LOAD/SUBV_BROADCAST_LOAD nodes and let the pass handle it entirely: https://github.com/RKSimon/llvm-project/tree/perf/broadcast-avx512 - but untangling the regressions isn't fun and I've gotten distracted with other things recently. I've also been considering an unfold pass (#86669) - a bit like MachineLICM but could be used to help x86 cases where we might be able to save constant pool space, pack scalar constants into a single vector register,, create constants without memory access etc. depending on register pressure. |
Thanks everyone for the feedback, |
Uh oh!
There was an error while loading. Please reload this page.
I'm having a lot of trouble with the arm (32 and 64 bit) backends de-optimizing code related to broadcasted constants. There are several issues:
Here's an example that demonstrates several issues: https://godbolt.org/z/chjx4d4vh
If the compiler would compile the code as written, there would be no register spills, because the constants would occupy half as many registers. I included a commented call to
make_opaque
that is one attempted workaround, to trick the compiler into not thinking these are constants (at the expense of a function call...), and it does work to do that, but the compiler still moves the broadcasts (dup
instructions) out of the loop and spills some of the registers.I run into this issue very frequently. Any suggested workarounds, e.g. some annotation to force the compiler to keep a broadcast outside of the loop, or possible fixes to LLVM, would be very welcome. As it stands, I find
vmla_lane_X
intrinsics to be almost useless because of this issue.The text was updated successfully, but these errors were encountered: