[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

rilysh · 2023-11-26T19:04:20Z

Hello,
As the title implies, for a certain size loop, clang misleadingly generates copies of the entire body of the loop, each one individually.

For example:

#include <stdio.h>

int main(void)
{
	unsigned int i;

	for (i = 0; i < 29; i++)
		fprintf(stdout, "i: %d\n", i);
}

With optimization level (-O0) clang generates:

.LBB0_1:
movl    $0, -4(%rbp)
movl    $0, -8(%rbp)
cmpl    $29, -8(%rbp)
jae     .LBB0_4
movq    stdout@GOTPCREL(%rip), %rax
movq    (%rax), %rdi
movl    -8(%rbp), %edx
leaq    .L.str(%rip), %rsi
movb    $0, %al
callq   fprintf@PLT
movl    -8(%rbp), %eax
addl    $1, %eax
movl    %eax, -8(%rbp)
jmp .LBB0_1

(Ignore other labels)

It's clear that the generated assembly output is first setting the index to zero and then comparing the index, if lower than 29. If not, increase it by one (add 1).

However, with optimization (performance-focused than the size), e.g. (-O2, -O3, -Ofast, etc.) clang generates:

movq	stdout@GOTPCREL(%rip), %r14
movq	(%r14), %rdi
leaq	.L.str(%rip), %rbx
movq	%rbx, %rsi
xorl	%edx, %edx
xorl	%eax, %eax
callq	fprintf@PLT
movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$1, %edx
xorl	%eax, %eax
callq	fprintf@PLT
movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$2, %edx
xorl	%eax, %eax

[ ... similar copies with different index value ... ]

movq	(%r14), %rdi
movq	%rbx, %rsi
movl	$28, %edx
xorl	%eax, %eax
callq	fprintf@PLT
xorl	%eax, %eax

In this assembly output, clang generates the entire body code (fprintf()) for each individual loop and sets up the index (which would be after incrementing the index). Note that this behavior only happens if the loop is > 0 and <= 28. In comparison, GCC generates the following assembly output (with -O2, -Ofast, etc):

xorl	%ebx, %ebx
movq	stdout(%rip), %rdi
movl	%ebx, %edx
movq	%rbp, %rsi
xorl	%eax, %eax
addl	$1, %ebx
call	fprintf@PLT
cmpl	$29, %ebx
jne	.L2
addq	$8, %rsp
xorl	%eax, %eax

This is quite similar to Clang's -O0 output. With -Os, and -Ofast, GCC generates nearly identical assembly with a few changes.

I've tested it with Clang 17.0.1 and GCC 13.2. Here's the Godbolt link: https://godbolt.org/z/1xq4qfGzd
I've only tested this on x86-64 and RISC-V 64-bit platforms (although I don't think the platform may vary since it only happens with performance-wise optimizations are enabled).

The text was updated successfully, but these errors were encountered:

fhahn · 2023-11-27T10:14:43Z

Clang by default aggressively unrolls, so in a way this is working as expected at the moment. See #42332 for a bit more in-depth discussion

rilysh · 2023-11-27T15:25:19Z

Clang by default aggressively unrolls, so in a way this is working as expected at the moment. See #42332 for a bit more in-depth discussion

Thanks for the update! In issue #42332 there seems to be no update on this, and the last update was more than a year ago. Does the LLVM team have any plans to address this issue? In most cases unrolling such loops seems absolutely unnecessary and rather problematic (e.g. there are 10 more loops that will be unrolled) and I'd rather prefer explicit -funroll-all-loops (...which Clang doesn't support) when needed.

github-actions bot added the clang Clang issues not falling into any other category label Nov 26, 2023

EugeneZelenko added loopoptim and removed clang Clang issues not falling into any other category labels Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

rilysh commented Nov 26, 2023 •

edited

Loading

fhahn commented Nov 27, 2023

rilysh commented Nov 27, 2023

[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

[clang] On a fixed-size loop clang generates an individual copy of the body of the loop when specific optimization is enabled #73456

Comments

rilysh commented Nov 26, 2023 • edited Loading

fhahn commented Nov 27, 2023

rilysh commented Nov 27, 2023

rilysh commented Nov 26, 2023 •

edited

Loading