You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. Load 4 uint16_t values and extend to 4 x uint32_t vector. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:
I've run into this doing vectorizing on the dav1d video codec (expanding i8 to i16 lanes); shuffles work but seem awkward.
In theory the runtime could detect the shuffle pattern as an interleave/de-interleave and optimize it, I guess, but I'm not sure I want to rely on that.
There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. Load 4 uint16_t values and extend to 4 x uint32_t vector. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:
PMOVZXWD xmm, [mem]
on x86 with SSE4.1MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0
on SSE2VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX
on ARMv7+NEONLD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H
on ARM64The text was updated successfully, but these errors were encountered: