Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Load with extension operation #23

Closed
Maratyszcza opened this issue Nov 5, 2017 · 2 comments
Closed

Load with extension operation #23

Maratyszcza opened this issue Nov 5, 2017 · 2 comments

Comments

@Maratyszcza
Copy link
Contributor

There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. Load 4 uint16_t values and extend to 4 x uint32_t vector. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:

  • PMOVZXWD xmm, [mem] on x86 with SSE4.1
  • MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0 on SSE2
  • VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX on ARMv7+NEON
  • LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H on ARM64
@bvibber
Copy link

bvibber commented Mar 4, 2019

I've run into this doing vectorizing on the dav1d video codec (expanding i8 to i16 lanes); shuffles work but seem awkward.

In theory the runtime could detect the shuffle pattern as an interleave/de-interleave and optimize it, I guess, but I'm not sure I want to rely on that.

@dtig
Copy link
Member

dtig commented Sep 13, 2019

Closing as #98 is merged.

@dtig dtig closed this as completed Sep 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants