You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's clear that the generated assembly output is first setting the index to zero and then comparing the index, if lower than 29. If not, increase it by one (add 1).
However, with optimization (performance-focused than the size), e.g. (-O2, -O3, -Ofast, etc.) clang generates:
movq stdout@GOTPCREL(%rip), %r14movq (%r14), %rdileaq .L.str(%rip), %rbxmovq %rbx, %rsixorl %edx, %edxxorl %eax, %eaxcallq fprintf@PLTmovq (%r14), %rdimovq %rbx, %rsimovl$1, %edxxorl %eax, %eaxcallq fprintf@PLTmovq (%r14), %rdimovq %rbx, %rsimovl$2, %edxxorl %eax, %eax[ ... similar copies with different index value ... ]movq (%r14), %rdimovq %rbx, %rsimovl$28, %edxxorl %eax, %eaxcallq fprintf@PLTxorl %eax, %eax
In this assembly output, clang generates the entire body code (fprintf()) for each individual loop and sets up the index (which would be after incrementing the index). Note that this behavior only happens if the loop is > 0 and <= 28. In comparison, GCC generates the following assembly output (with -O2, -Ofast, etc):
This is quite similar to Clang's -O0 output. With -Os, and -Ofast, GCC generates nearly identical assembly with a few changes.
I've tested it with Clang 17.0.1 and GCC 13.2. Here's the Godbolt link: https://godbolt.org/z/1xq4qfGzd
I've only tested this on x86-64 and RISC-V 64-bit platforms (although I don't think the platform may vary since it only happens with performance-wise optimizations are enabled).
The text was updated successfully, but these errors were encountered:
Clang by default aggressively unrolls, so in a way this is working as expected at the moment. See #42332 for a bit more in-depth discussion
Thanks for the update! In issue #42332 there seems to be no update on this, and the last update was more than a year ago. Does the LLVM team have any plans to address this issue? In most cases unrolling such loops seems absolutely unnecessary and rather problematic (e.g. there are 10 more loops that will be unrolled) and I'd rather prefer explicit -funroll-all-loops (...which Clang doesn't support) when needed.
Hello,
As the title implies, for a certain size loop, clang misleadingly generates copies of the entire body of the loop, each one individually.
For example:
With optimization level (
-O0
) clang generates:(Ignore other labels)
It's clear that the generated assembly output is first setting the index to zero and then comparing the index, if lower than 29. If not, increase it by one (add 1).
However, with optimization (performance-focused than the size), e.g. (
-O2
,-O3
,-Ofast
, etc.) clang generates:In this assembly output, clang generates the entire body code (
fprintf()
) for each individual loop and sets up the index (which would be after incrementing the index). Note that this behavior only happens if the loop is> 0
and<= 28
. In comparison, GCC generates the following assembly output (with-O2
,-Ofast
, etc):This is quite similar to Clang's
-O0
output. With-Os
, and-Ofast
, GCC generates nearly identical assembly with a few changes.I've tested it with Clang 17.0.1 and GCC 13.2. Here's the Godbolt link: https://godbolt.org/z/1xq4qfGzd
I've only tested this on x86-64 and RISC-V 64-bit platforms (although I don't think the platform may vary since it only happens with performance-wise optimizations are enabled).
The text was updated successfully, but these errors were encountered: