Closed
Description
While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit uint8x16_t
values with single-byte load instructions instead of 128-bit loads.
It turns out, that the vld1q_u8
intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.
Testcase and proposed fix on Godbolt
The same issue likely applies is to the other vld1q
intrinsics.
Metadata
Metadata
Assignees
Labels
No labels