-
Notifications
You must be signed in to change notification settings - Fork 87
Do The SIMD Shuffle #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As I mentioned before, I'm OK with this particularly API being unstable a little longer if that's what it takes, rather than having two ways to do such a basic thing. I recognize the importance of the shuffle API, and also that's exactly why I would want it to be transparently recognizable even if that means it comes a little later. But also, even a delayed schedule will probably not hold this up forever after our initial stabilizations. |
Yeah, a person can do plenty of useful work without shuffles. It's "necessary eventually", but not "required immediately". |
Note that some ISAs also support non-const shuffles. Some of those ISAs require the vector to be in memory (x86 -- |
Dynamic shuffles will probably have to be part of the final API, but we probably will want to provide an additional function rather than attempting to do implicit const folding on a single shuffle function, so that our API obeys the principle of least surprise: if you use a const shuffle, you get a const shuffle. If you use a dynamic shuffle, you usually get a dynamic shuffle (and sometimes you get a treat when it is possible to const-fold it despite what you thought and your code runs faster than you expected). |
For some problems, dynamic shuffles are absolutely crucial and hard to replace, so it would be definitely nice to have them rather sooner than later. One common use case to mention - is when in the shuffle the argument is compile-time const but the index is runtime, so it acts like a lookup table of sorts (but that will always be a dynamic shuffle from simd perspective). Having two separate shuffle apis (const and dynamic) is definitely a good idea; if your algorithm relies on the shuffle being const, why not spell it out explicitly. It would also be nice if the two shuffle APIs looked closer to each other than they do in packed_simd so it would be more discoverable and easier to grok. |
One possibility we discussed is using const generics for the const shuffles (hinted at in some of the comments above). I'm not sure how close that would look to the dynamic API but I imagine it would be better than packed_simd's macros. |
@aldanor I am curious what you mean by "closer"? Do you mean using a more similar name? I was under the impression that choosing For this case, it's looking like const-generic shuffles will be unpretty (especially at first) but viable. |
An initial implementation was landed in #62 but we are considering other possible APIs in addition to that one, or future modifications of that one, including ones that are both more and less flexible. |
I looked into dynamic shuffles a little bit more and found WebAssembly/simd#68 which sums up architecture support pretty well, the bottom line is that many (most?) architectures support some sort of dynamic byte shuffle so there isn't any good reason for us not to support it. The only limiting factor is LLVM not exposing a dynamic shuffle instruction. Wasm SIMD does support dynamic shuffles, so LLVM is generating them somehow, it's just a matter of exposing that as an instruction. |
This is great. I'm used to dynamic shuffles from the PowerPC/SPU architecture (vperm). They are very powerful tools, especially |
See Gather -- an actual SIMD table lookup. |
Sadly, gather is too slow to use for this purpose. |
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&expand=2983 Performance |
Reading more into wasm, I think dynamic shuffles were drafted but didn't have much momentum. So I don't think it ever made it into a spec and there isn't any compiler support for generic dynamic shuffles. |
Haswell Gather was terribly naive, essentially a stub instruction that allowed saving code size but not speed. It has improved in speed significantly on x86 since even Broadwell, so it's now reasonable to use on Lake and Zen architectures. But I of course wouldn't recommend it when a load and shuffle would do, however. |
@calebzulawski Byte-wise "swizzle" is supported, see https://doc.rust-lang.org/core/arch/wasm32/fn.i8x16_swizzle.html |
Chiming in, that dynamic swizzle is definitely still missing from portable simd. |
A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.
Example (pseudo-code):
Shuffles are very important for particular SIMD tasks, but the requirement that the input be a compile time constant complicates the API:
Still,
min_const_generics
is aimed to be stable by the end of the year and most likely that'll be enough to do shuffle basics on stable.The text was updated successfully, but these errors were encountered: