Load with extension operation #23

Maratyszcza · 2017-11-05T07:32:10Z

There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. Load 4 uint16_t values and extend to 4 x uint32_t vector. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:

PMOVZXWD xmm, [mem] on x86 with SSE4.1
MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0 on SSE2
VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX on ARMv7+NEON
LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H on ARM64

The text was updated successfully, but these errors were encountered:

bvibber · 2019-03-04T17:36:34Z

I've run into this doing vectorizing on the dav1d video codec (expanding i8 to i16 lanes); shuffles work but seem awkward.

In theory the runtime could detect the shuffle pattern as an interleave/de-interleave and optimize it, I guess, but I'm not sure I want to rely on that.

dtig · 2019-09-13T22:06:19Z

Closing as #98 is merged.

Maratyszcza mentioned this issue Apr 18, 2018

Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

Closed

penzn mentioned this issue Aug 28, 2019

Introduce Load and Extend #98

Merged

dtig closed this as completed Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load with extension operation #23

Load with extension operation #23

Maratyszcza commented Nov 5, 2017

bvibber commented Mar 4, 2019

Uh oh!

dtig commented Sep 13, 2019

Uh oh!

Load with extension operation #23

Load with extension operation #23

Comments

Maratyszcza commented Nov 5, 2017

bvibber commented Mar 4, 2019

Uh oh!

dtig commented Sep 13, 2019

Uh oh!