-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Wasm SIMD intrinsics #8559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasm SIMD intrinsics #8559
Conversation
TODO: truncation, conversion, and shufflevector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names and type signature of all of these intrinsics are entirely up for discussion. Nothing is set in stone. Comments welcome.
system/include/simd128.h
Outdated
WebAssembly SIMD128 Intrinsics | ||
*/ | ||
|
||
#include <stdint.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it bad to have this header file include other header files? We could use the same types as the builtins (vector of char, short, int, etc.) instead of using the C99 stdint types. This would slightly less nice from a typing point of view, but would be simpler and mostly invisible to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably OK to include other headers, I see a few system headers in Clang's sources include stdint.h
, and quite a few also include each other. On the other hand, converting between external and internal types might not be necessary as it would work with out it and it is generally a good idea to reduce amount of C-style casting :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I was getting type errors when trying to use the user-facing integer vector types with the builtin functions, so those casts are necessary. They have no runtime impact, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. What I was thinking, is that user-facing type would be the same as internal, so casts would not be needed. I don't feel strongly either way though.
The main header file here isn't specific to Emscripten; would it make sense for it to live in upstream clang, in lib/Headers, instead of in Emscripten? That way it could be used by the wasm32-wasi and wasm32-unknown-unknown targets as well. Also, keeping it in clang is similar to what other targets do, such as altivec.h for Power and xmmintrin.h for x86 for x86. |
@sunfishcode Yes, I agree that this should ultimately be in clang. The reason I am trying to get this into emscripten first is so that it can be developed alongside executable tests. |
In WebAssembly/tool-conventions#108 @Maratyszcza suggested that |
system/include/simd128.h
Outdated
// wasm_v128_store(v128 *mem, v128 a) | ||
static __inline__ void __DEFAULT_FN_ATTRS wasm_v128_store(v128* mem, v128 a) { | ||
*mem = a; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do unaligned load and stores work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if someone does, e.g.,
float* ptr = /*...not aligned to a v128 boundary...*/;
v128 val = wasm_v128_load((v128*)ptr); // UB
the behavior is undefined, because wasm_v128_load
will perform an aligned load on an unaligned address. Given that the WASM instruction does support unaligned loads just fine, this feels like a footgun.
May be worth it to call this out, at least in a documentation comment, and maybe also, e.g., by using _aligned
/_unaligned
suffixes to make that clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently they do not. The underlying WebAssembly architecture does support unaligned loads and stores, of course, but using them may have silent and very large performance problems on some platforms, so we definitely don't want to encourage their use. Given that, I wonder how important it is to allow for unaligned loads and stores in this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable to match the builtin, and document this somehow, unless you are going to change the builtins before merging this in.
By the way, what happens with the alignment hint in practice? Do some implementations do something different for different alignment (aside from checking it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@penzn Are implementations required to check the alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense that one cannot pass alignments smaller than the natural alignment in the hint. That has to actually be one of the main reasons for there to be a hint.
@tlively, the problem isn't that unaligned accesses are "slow". The problem is that unaligned accesses that have an incorrect alignment hint, e.g., one indicating that they are aligned when they are not, can be very slow.
These intrinsics prevent the slowness by never letting users do an unaligned access. Rust (e.g. packed_simd
) prevents the slowness by passing a correct alignment hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happens is that
memarg
passes an alignment hint, and the WASM SIMD spec states that for the loads and stores, that's the natural alignment ofv128
, which is 16.
Yep!
The "front-end" (e.g. emscripten in this case) always has to pass 16 as the alignment hint for
v128.load/store
, even if it knows that the alignment could be less (e.g. if we were to add an_unaligned
intrinsic here).
Not quite. "Normal" loads and stores will get the natural alignment (16 in this case), but if you explicitly mark at the source level that some other alignment could be used then that will be reflected in the LLVM IR and respected in the alignment hints generated by the wasm backend.
This results in the WASM machine code generator having no idea that a load could be unaligned, and therefore not being able to use an unaligned load instruction, which would be much faster than doing an aligned load, trapping, recovering, etc.
Since the alignment hints are only hints, engines have to be able to handle any load being potentially unaligned. But they might do this in the generated code if the alignment hint says a non-natural alignment will be used and do it only in traps (which would be faster in the common case) if the alignment hint says natural alignment will be used.
Or is the alignment hint allowed to specify an alignment lower than the natural alignment?
Yes it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having wasm_v128_load()
and wasm_v128_loadu()
for 16-byte aligned and 1-byte aligned (unaligned), or carrying the alignment hint 1-byte/4-byte/16-byte via an __attribute__
. Unaligned loads and stores are very important from developers' point of view, there are times when one has to pack/unpack SIMD data to save memory, which equates to needing to do unaligned ops. One common use case is when doing fast loading/saving of files, where one does not want to carry any padding bytes around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juj we currently have wasm_v128_load
do unaligned loads, and if users want aligned loads they can just dereference a v128_t*
. Do you think that is sufficient or would you like to see separate functions for aligned and unaligned accesses? If the latter, can you explain more why that would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
They are important. A common use case is doing SIMD on a slice of an The elements are typically aligned, eg. at 4 byte boundaries for f32x4, The current load/store intrinsics here are just wrappers around plain For an example of how to arrange for unaligned accesses, see the use of the |
Thanks, @sunfishcode! That's helpful. I will generalize the load and store intrinsics to enable unaligned loads and stores. In other news, clang is being changed to disable int-to-float and float-to-int vector conversions by default. This will force the user to either change their build flags or add explicit casts wherever they want to reinterpret vectors in this way. Does this change the calculus of whether we want to provide a type for each possible vector interpretation or provide just a single v128 type? |
system/include/simd128.h
Outdated
#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("simd128"), __min_vector_width__(128))) | ||
|
||
// v128 wasm_v128_load(void* mem) | ||
static __inline__ v128 __DEFAULT_FN_ATTRS wasm_v128_load(void* __mem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be const void* __mem
to allow loads via const-pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
FWIW I still personally feel that choosing a single |
The |
@gnzlbg, yes @alexcrichton I could get on board for just exposing one vector type. @rrwinterton, do you have thoughts on that? |
@tlively I think one goal of the headers is to allow code to fully use the WASM SIMD spec, so the headers must allow doing everything the spec allows (e.g. unaligned loads). Another goal of the headers is to allow people to write "portable" code. Ideally, this "C" header would become part of the WASM SIMD spec "somehow" (e.g. as an appendix) and toolchains will be encouraged to implement this header "as is". Different toolchains have different levels of C extension support (e.g. when it comes to portable packed SIMD types like i8x16). C toolchains might also want to expose the header from C++, or via other frontends like D, Nim, Rust, etc. with varying levels of C FFI support. From this POV, I think it makes sense to keep the header as simple and close to standard C as possible. The current PR uses clang portable packed SIMD vector types in the APIs in these headers directly. This makes these types public, and makes the API of these types part of the API of the headers as well. This means that other toolchains do not only need to have types with these names, but these types need to support all operations that they support in clang (e.g. overloaded Using an opaque Those wanting to expose a portable packed vector type library on top of this header, can do so in a library. That library can be "clang-specific" or "gcc-specific" or can work around toolchain differences using macros. |
There is value in exposing compiler-independent types, it would make it easier for different compilers to implement the same interface (though we only have one C/C++ toolchain at the moment). On the other hand, definitions in those headers can be somewhat different between different compilers, since those have different builtin support, see, for example, |
Right. The interface this header exposes should be standardized, but its implementation is necessarily clang-specific. The question is whether we should expose a single It sounds like most people lean toward having just a single type. Is there anyone who thinks exposing multiple types as the current PR does is valuable? |
I think the issue I am raising might be being misunderstood. Right now, because of how In fact, what should I think that, at least initially (we can always relax this later), we should not only limit the API to How each compiler decides to implement that is up to them, but otherwise we are kind of making "whatever methods clang happens to support today for |
Maybe in other words: the API exposed by these headers should be the same for all compilers. I agree that the implementation details will necessarily be different, but this implementation, which is the reference implementation, is making the details of this compiler public and part of the exported API, and that appears to be more accidental than by design. |
My apologies if I misunderstood the comment. Are you suggesting making // GCC
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
// Clang
typedef float __m128 __attribute__((__vector_size__(16), __aligned__(16))); We can try to expose a reference header with only the prototypes and then include it from a compiler-specific header. Sorry, a separate question, for the use of alignment which @sunfishcode pointed out should there be alignment-specific versions of loads and stores? |
@gnzlbg Thanks for clarifying, your concern makes sense to me now. I think it's not an issue that the |
@penzn Alignment-specific loads and stores might make sense, given that WebAssembly can express them. We now have a way to do naturally-aligned loads and stores (dereferencing a |
@tlively For completeness, another option is to wrap types like typedef char __v128 __attribute__((__vector_size__(16), __aligned__(16)));
typedef struct {
__v128 __raw;
} v128; This effectively makes the contents private, because accessing fields that start with |
system/include/simd128.h
Outdated
|
||
// v128_t wasm_i8x16_neg(v128_t a) | ||
static __inline__ v128_t __DEFAULT_FN_ATTRS wasm_i8x16_neg(v128_t a) { | ||
return (v128_t)(-(__i8x16)a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to cast to __u8x16
. Signed overflow is UB in C/C++, and this applies to SIMD too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch!
system/include/simd128.h
Outdated
|
||
// v128_t wasm_i8x16_mul(v128_t a, v128_t b) | ||
static __inline__ v128_t __DEFAULT_FN_ATTRS wasm_i8x16_mul(v128_t a, v128_t b) { | ||
return (v128_t)((__i8x16)a * (__i8x16)b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast to __u8x16
to avoid UB
Any chance there would exist a concise summary somewhere of how the wasm simd API differs from the earlier SIMD.js API? Also, have there been further charting as to how existing SSE* and NEON code would port over and target hardware SSE* and NEON instructions? E.g. which instructions will be unavailable on both hardware? |
@juj Unfortunately I do not know of any such documentation, but you can find a concise list of available wasm SIMD instructions here. The SIMD proposal is a direct successor to SIMD.js, so it should be largely similar. I don't think anyone has done a detailed analysis of how much of SSE* or NEON will be missing from the currently SIMD proposal either; it will be quite a lot. We've generally deferred ISA-specific considerations to a future SIMD v2 that does not exist yet. |
- fix copy-paste error where intrinsics weren't used in test fns - fix some inconsistent return types - change some names and comments
Note that they do not enforce that their arguments are constant because there is no way to do that for floats. Builtin functions could be used to enforce this requirement for ints, but without float support that would be a half-baked solution.
Along with https://reviews.llvm.org/D63615 fixes the issues with i64x2 shifts.
Since we can't enforce that both integer and float |
IIUC the issue seems to be that clang has support for adding intrinsics that take integer arguments and is able to require these integer arguments to be constant expressions, but it does not have support for allowing those builtins to take floating-point arguments that are required to be constant expressions. Is that correct? |
If non-constant values are passed it would emit a bunch of replace lane instructions, right? I think it is probably better to rename the intrinsics, for example one of SSE analogs is called Not very high priority, but this would help if any other toolchains that lack watertight constant detection would decide to adopt this header as well. |
@gnzlbg Yes, that is the situation. The capability to require constant floating-point arguments could of course be added to clang, but that seems out of scope for our particular effort here. |
I think one should separate "clang limitations" from the "specification" of which headers implementations should provide. The problem of having to add multiple constructors is one that all implementations that support C are going to run into, so I think it makes sense to add them to the header "spec", requiring there that the arguments to the constructors must be constant expressions. This PR with the clang implementation should enfenforce that for the constructors taking integers, but it cannot enforce that for the constructors taking floats due to a limitation in clang. That's ok, I agree that this is not the place to fix that, so maybe this can be documented as a "bug" here, and that's it? |
I agree with @gnzlbg. The intrinsic specification must provide a reliable way to generate Probably we could work around limitation on constantness specification in Clang intrinsics argument using |
I didn't know about |
I don't think there are any large outstanding issues left, so how does everyone feel about merging what we have so far and starting to document it in the tool-conventions repo? Of course we can still make changes as new issues and additions come up. It would also be good to get these merged so people can start experimenting with them in real projects. |
I have a small concern about filename: when this header is standardized and directly supported by compilers, it will leave in compiler-level Other than filename, the header LGTM. |
|
Thanks, everyone! Looking forward to getting these tested out in the wild and developed further ! |
* Initial commit of intrinsics and test * Get intrinsic tests compiling with -msimd128 * Fix tests and unimplemented-simd128 build TODO: truncation, conversion, and shufflevector * Finish implementing instructions and clean up * Add explicit alignments and make load and store unaligned * Add const to loaded pointer * Rewrite intrinsics to expose only v128_t * Address recent comments - fix copy-paste error where intrinsics weren't used in test fns - fix some inconsistent return types - change some names and comments * Add v128.const intrinsics for all types Note that they do not enforce that their arguments are constant because there is no way to do that for floats. Builtin functions could be used to enforce this requirement for ints, but without float support that would be a half-baked solution. * Fix some codegen inefficiencies Along with https://reviews.llvm.org/D63615 fixes the issues with i64x2 shifts. * Use __builtin_constant_p in wasm_*_const * Add documentation * Add stability disclaimer * Rename to wasm_simd128.h * Add wasm_*_make convenience functions * Fix whitespace
cc @gdeepti @rrwinterton @jing-bao