Skip to content

Do The SIMD Shuffle #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lokathor opened this issue Sep 28, 2020 · 17 comments
Open

Do The SIMD Shuffle #11

Lokathor opened this issue Sep 28, 2020 · 17 comments

Comments

@Lokathor
Copy link
Contributor

A "shuffle", in SIMD terms, takes a SIMD vector (or possibly two vectors) and a pattern of source lane indexes (usually as an immediate), and then produces a new SIMD vector where the output is the source lane values in the pattern given.

Example (pseudo-code):

let a = simd(5, 6, 7, 8);
assert_eq!(shuffle(a, [2, 2, 1, 0]), simd(7, 7, 6, 5));

Shuffles are very important for particular SIMD tasks, but the requirement that the input be a compile time constant complicates the API:

  • Rust doesn't have a stable way for users to express this, so they never expect it.
  • Rather than just having a const-required arg (which totally works) a lot of people have said "you should use const generics for that!" (which actually doesn't work as well on current rust).

Still, min_const_generics is aimed to be stable by the end of the year and most likely that'll be enough to do shuffle basics on stable.

@workingjubilee
Copy link
Member

workingjubilee commented Sep 28, 2020

As I mentioned before, I'm OK with this particularly API being unstable a little longer if that's what it takes, rather than having two ways to do such a basic thing. I recognize the importance of the shuffle API, and also that's exactly why I would want it to be transparently recognizable even if that means it comes a little later. But also, even a delayed schedule will probably not hold this up forever after our initial stabilizations.

@Lokathor
Copy link
Contributor Author

Lokathor commented Sep 28, 2020

Yeah, a person can do plenty of useful work without shuffles.

It's "necessary eventually", but not "required immediately".

@programmerjake
Copy link
Member

Note that some ISAs also support non-const shuffles. Some of those ISAs require the vector to be in memory (x86 -- VPGATHERDD), others also support non-const shuffles just in registers (RISC-V V extension; Libre-SOC's SimpleV; probably more).

@workingjubilee
Copy link
Member

Dynamic shuffles will probably have to be part of the final API, but we probably will want to provide an additional function rather than attempting to do implicit const folding on a single shuffle function, so that our API obeys the principle of least surprise: if you use a const shuffle, you get a const shuffle. If you use a dynamic shuffle, you usually get a dynamic shuffle (and sometimes you get a treat when it is possible to const-fold it despite what you thought and your code runs faster than you expected).

@aldanor
Copy link

aldanor commented Dec 25, 2020

For some problems, dynamic shuffles are absolutely crucial and hard to replace, so it would be definitely nice to have them rather sooner than later. One common use case to mention - is when in the shuffle the argument is compile-time const but the index is runtime, so it acts like a lookup table of sorts (but that will always be a dynamic shuffle from simd perspective).

Having two separate shuffle apis (const and dynamic) is definitely a good idea; if your algorithm relies on the shuffle being const, why not spell it out explicitly. It would also be nice if the two shuffle APIs looked closer to each other than they do in packed_simd so it would be more discoverable and easier to grok.

@calebzulawski
Copy link
Member

calebzulawski commented Dec 25, 2020

One possibility we discussed is using const generics for the const shuffles (hinted at in some of the comments above). I'm not sure how close that would look to the dynamic API but I imagine it would be better than packed_simd's macros.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 3, 2021

@aldanor I am curious what you mean by "closer"? Do you mean using a more similar name? I was under the impression that choosing shuffle_dyn as a name was fine. If you mean the fact that one's a macro, yeah, as Caleb said, that's definitely something we're going to try to avoid in the public API. Ideally both should be functions that are implemented on the same struct or trait, and it will be very obvious which you are using because one will have its argument in spikes (<>) and the other will have it as a regular function parameter.

For this case, it's looking like const-generic shuffles will be unpretty (especially at first) but viable.

@workingjubilee
Copy link
Member

An initial implementation was landed in #62 but we are considering other possible APIs in addition to that one, or future modifications of that one, including ones that are both more and less flexible.

@calebzulawski
Copy link
Member

I looked into dynamic shuffles a little bit more and found WebAssembly/simd#68 which sums up architecture support pretty well, the bottom line is that many (most?) architectures support some sort of dynamic byte shuffle so there isn't any good reason for us not to support it. The only limiting factor is LLVM not exposing a dynamic shuffle instruction.

Wasm SIMD does support dynamic shuffles, so LLVM is generating them somehow, it's just a matter of exposing that as an instruction.

@andy-thomason
Copy link

This is great. I'm used to dynamic shuffles from the PowerPC/SPU architecture (vperm). They are very powerful tools, especially
for doing parsing with SIMD used for character classification. They are as close as we get to a table lookup in SIMD.

@programmerjake
Copy link
Member

They are as close as we get to a table lookup in SIMD.

See Gather -- an actual SIMD table lookup.

@andy-thomason
Copy link

Sadly, gather is too slow to use for this purpose.

@andy-thomason
Copy link

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&expand=2983

Performance
Architecture Latency Throughput (CPI)
Broadwell ~19 ~6
Haswell ~20 ~10

@calebzulawski
Copy link
Member

Reading more into wasm, I think dynamic shuffles were drafted but didn't have much momentum. So I don't think it ever made it into a spec and there isn't any compiler support for generic dynamic shuffles.

@workingjubilee
Copy link
Member

Haswell Gather was terribly naive, essentially a stub instruction that allowed saving code size but not speed. It has improved in speed significantly on x86 since even Broadwell, so it's now reasonable to use on Lake and Zen architectures. But I of course wouldn't recommend it when a load and shuffle would do, however.

@hkratz
Copy link
Contributor

hkratz commented Sep 22, 2021

Reading more into wasm, I think dynamic shuffles were drafted but didn't have much momentum. So I don't think it ever made it into a spec and there isn't any compiler support for generic dynamic shuffles.

@calebzulawski Byte-wise "swizzle" is supported, see https://doc.rust-lang.org/core/arch/wasm32/fn.i8x16_swizzle.html

@Ygg01
Copy link

Ygg01 commented Jul 5, 2024

Chiming in, that dynamic swizzle is definitely still missing from portable simd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants