Skip to content

AVX mem broadcasts are cached on the stack #120015

Open
@KyleSiefring

Description

@KyleSiefring

After exhausting registers inside of a loop, clang stores the results of a broadcast on the stack. This is inefficient, since broadcasting from memory is as fast as loading

Consider the following pseudo code:

float *restrict arr = ...; // prevent aliasing
loop {
     exhaust vector registers
     __mm256 x = _mm256_set1_ps(arr[0]);
     use x
}

When clang compiles this, arr[0] is broadcasted outside the loop then x is stored on the stack.

        vbroadcastss    ymm0, dword ptr [rdx]
        vmovups ymmword ptr [rsp - 72], ymm0
loop:
        ...
        load x from stack
        use x
        jmp loop

The expected behavior is:

loop:
       ...
        vbroadcastss    x, dword ptr [rdx]
        use x
        jmp loop

Obligatory Godbolt Sample: https://godbolt.org/z/v7MYcefxY (Sorry if my method of stressing register allocation results in too much asm/bytecode.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions