Skip to content

[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rilysh opened this issue Nov 26, 2023 · 2 comments

Comments

@rilysh
Copy link
Contributor

rilysh commented Nov 26, 2023

Hello,
As the title implies, for a certain size loop, clang misleadingly generates copies of the entire body of the loop, each one individually.

For example:

#include <stdio.h>

int main(void)
{
	unsigned int i;

	for (i = 0; i < 29; i++)
		fprintf(stdout, "i: %d\n", i);
}

With optimization level (-O0) clang generates:

.LBB0_1:
movl    $0, -4(%rbp)
movl    $0, -8(%rbp)
cmpl    $29, -8(%rbp)
jae     .LBB0_4
movq    stdout@GOTPCREL(%rip), %rax
movq    (%rax), %rdi
movl    -8(%rbp), %edx
leaq    .L.str(%rip), %rsi
movb    $0, %al
callq   fprintf@PLT
movl    -8(%rbp), %eax
addl    $1, %eax
movl    %eax, -8(%rbp)
jmp .LBB0_1

(Ignore other labels)

It's clear that the generated assembly output is first setting the index to zero and then comparing the index, if lower than 29. If not, increase it by one (add 1).

However, with optimization (performance-focused than the size), e.g. (-O2, -O3, -Ofast, etc.) clang generates:

movq	stdout@GOTPCREL(%rip), %r14
movq	(%r14), %rdi
leaq	.L.str(%rip), %rbx
movq	%rbx, %rsi
xorl	%edx, %edx
xorl	%eax, %eax
callq	fprintf@PLT
movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$1, %edx
xorl	%eax, %eax
callq	fprintf@PLT
movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$2, %edx
xorl	%eax, %eax

[ ... similar copies with different index value ... ]

movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$28, %edx
xorl	%eax, %eax
callq	fprintf@PLT
xorl	%eax, %eax

In this assembly output, clang generates the entire body code (fprintf()) for each individual loop and sets up the index (which would be after incrementing the index). Note that this behavior only happens if the loop is > 0 and <= 28. In comparison, GCC generates the following assembly output (with -O2, -Ofast, etc):

xorl	%ebx, %ebx
movq	stdout(%rip), %rdi
movl	%ebx, %edx
movq	%rbp, %rsi
xorl	%eax, %eax
addl	$1, %ebx
call	fprintf@PLT
cmpl	$29, %ebx
jne	.L2
addq	$8, %rsp
xorl	%eax, %eax

This is quite similar to Clang's -O0 output. With -Os, and -Ofast, GCC generates nearly identical assembly with a few changes.

I've tested it with Clang 17.0.1 and GCC 13.2. Here's the Godbolt link: https://godbolt.org/z/1xq4qfGzd
I've only tested this on x86-64 and RISC-V 64-bit platforms (although I don't think the platform may vary since it only happens with performance-wise optimizations are enabled).

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Nov 26, 2023
@EugeneZelenko EugeneZelenko added loopoptim and removed clang Clang issues not falling into any other category labels Nov 27, 2023
@fhahn
Copy link
Contributor

fhahn commented Nov 27, 2023

Clang by default aggressively unrolls, so in a way this is working as expected at the moment. See #42332 for a bit more in-depth discussion

@rilysh
Copy link
Contributor Author

rilysh commented Nov 27, 2023

Clang by default aggressively unrolls, so in a way this is working as expected at the moment. See #42332 for a bit more in-depth discussion

Thanks for the update! In issue #42332 there seems to be no update on this, and the last update was more than a year ago. Does the LLVM team have any plans to address this issue? In most cases unrolling such loops seems absolutely unnecessary and rather problematic (e.g. there are 10 more loops that will be unrolled) and I'd rather prefer explicit -funroll-all-loops (...which Clang doesn't support) when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants