Open
Description
After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading
Consider the following pseudo code:
float *restrict arr = ...; // prevent aliasing
loop {
exhaust vector registers
__mm256 x = _mm256_set1_ps(arr[0]);
use x
}
When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.
vbroadcastss ymm0, dword ptr [rdx]
vmovups ymmword ptr [rsp - 72], ymm0
loop:
...
load x from stack
use x
jmp loop
The expected behavior is:
loop:
...
vbroadcastss x, dword ptr [rdx]
use x
jmp loop
Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)