-
Notifications
You must be signed in to change notification settings - Fork 43
Alternative to Swizzle / Shuffle #8
Description
These operations will be very difficult to implement efficiently.
Even on Intel (SSE, AVX) which has native shuffle instructions, the code generation for this is very complex. Here is the Intel instruction selection lowering in LLVM. There is quite a lot of code to implement the shuffle intrinsic:
The ARM version is worse, since many cases require use of 2 'vtbl' instructions, in addition to materializing the two constant shuffle masks in d-registers. LLVM uses a pre-generated table of 26K entries to generate fast instruction sequences for the 32x4 shuffles.
This is a lot of complexity for any compiler, and too much for WebAssembly translators. Most of these swizzles and shuffles will never be used. It is a hazard to provide this feature if we can't guarantee that all shuffles will be fast on all platforms.
An alternative is to implement a small set of primitive permutations that we know can be implemented efficiently, without lots of work in the translator. I think these should cover most real-world cases, and can be composed by the programmer or compiler for other shuffles. By being similar to a real ISA, it should also be straightforward to modify toolchains to support WASM SIMD.
I'm recommending a set along the lines of the ARM permutation instructions:
- Interleave(low, high) (merge elements from two source vectors into a single vector, with low and high modifiers so we only have a single result vector.)
- De-interleave(low, high) (inverse shuffle from interleave)
- Transpose(low, high) (swap even elements from first source with odd elements from second source, low and high modifiers.)
- Concatenate(k) (concatenate two source vectors, top k bytes from first, bottom 16-k bytes from second, 0 < k < 16, AKA "slide" or "window" shuffle.
Additionally, we may want to have shuffles that reverse the lanes in various patterns like the ARM vrev instructions.
On Intel these can be implemented using the pshuf instructions. We're assuming SSE 4.1 as a baseline for SIMD support right now. POWER and MIPS have similar primitive shuffles.