Skip to content

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

Closed
@hkratz

Description

@hkratz

While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit uint8x16_t values with single-byte load instructions instead of 128-bit loads.

It turns out, that the vld1q_u8 intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.

Testcase and proposed fix on Godbolt

The same issue likely applies is to the other vld1q intrinsics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions