Skip to content

Wasm SIMD intrinsics #8559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 26, 2019
Merged

Conversation

tlively
Copy link
Member

@tlively tlively commented May 8, 2019

Copy link
Member Author

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names and type signature of all of these intrinsics are entirely up for discussion. Nothing is set in stone. Comments welcome.

WebAssembly SIMD128 Intrinsics
*/

#include <stdint.h>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it bad to have this header file include other header files? We could use the same types as the builtins (vector of char, short, int, etc.) instead of using the C99 stdint types. This would slightly less nice from a typing point of view, but would be simpler and mostly invisible to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably OK to include other headers, I see a few system headers in Clang's sources include stdint.h, and quite a few also include each other. On the other hand, converting between external and internal types might not be necessary as it would work with out it and it is generally a good idea to reduce amount of C-style casting :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I was getting type errors when trying to use the user-facing integer vector types with the builtin functions, so those casts are necessary. They have no runtime impact, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. What I was thinking, is that user-facing type would be the same as internal, so casts would not be needed. I don't feel strongly either way though.

@sunfishcode
Copy link
Collaborator

The main header file here isn't specific to Emscripten; would it make sense for it to live in upstream clang, in lib/Headers, instead of in Emscripten? That way it could be used by the wasm32-wasi and wasm32-unknown-unknown targets as well.

Also, keeping it in clang is similar to what other targets do, such as altivec.h for Power and xmmintrin.h for x86 for x86.

@tlively
Copy link
Member Author

tlively commented May 8, 2019

@sunfishcode Yes, I agree that this should ultimately be in clang. The reason I am trying to get this into emscripten first is so that it can be developed alongside executable tests.

@tlively
Copy link
Member Author

tlively commented May 8, 2019

In WebAssembly/tool-conventions#108 @Maratyszcza suggested that wasm_v128_load and wasm_v128_store take void* instead of v128* to enable unaligned loads and stores without triggering undefined behavior. In order to actually implement those intrinsics without triggering the same undefined behavior we would have to implement new clang builtins that took void* or char* and returned a v128. This would be nontrivial, so I'm wondering how important it is to allow unaligned loads and stores in this API. It seems to me that it wouldn't be a problem to enforce natural alignment.

// wasm_v128_store(v128 *mem, v128 a)
static __inline__ void __DEFAULT_FN_ATTRS wasm_v128_store(v128* mem, v128 a) {
*mem = a;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do unaligned load and stores work ?

Copy link

@gnzlbg gnzlbg May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if someone does, e.g.,

float* ptr = /*...not aligned to a v128 boundary...*/;
v128 val = wasm_v128_load((v128*)ptr); // UB

the behavior is undefined, because wasm_v128_load will perform an aligned load on an unaligned address. Given that the WASM instruction does support unaligned loads just fine, this feels like a footgun.

May be worth it to call this out, at least in a documentation comment, and maybe also, e.g., by using _aligned/_unaligned suffixes to make that clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they do not. The underlying WebAssembly architecture does support unaligned loads and stores, of course, but using them may have silent and very large performance problems on some platforms, so we definitely don't want to encourage their use. Given that, I wonder how important it is to allow for unaligned loads and stores in this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to match the builtin, and document this somehow, unless you are going to change the builtins before merging this in.

By the way, what happens with the alignment hint in practice? Do some implementations do something different for different alignment (aside from checking it)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@penzn Are implementations required to check the alignment?

Copy link

@gnzlbg gnzlbg May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense that one cannot pass alignments smaller than the natural alignment in the hint. That has to actually be one of the main reasons for there to be a hint.

@tlively, the problem isn't that unaligned accesses are "slow". The problem is that unaligned accesses that have an incorrect alignment hint, e.g., one indicating that they are aligned when they are not, can be very slow.

These intrinsics prevent the slowness by never letting users do an unaligned access. Rust (e.g. packed_simd) prevents the slowness by passing a correct alignment hint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what happens is that memarg passes an alignment hint, and the WASM SIMD spec states that for the loads and stores, that's the natural alignment of v128, which is 16.

Yep!

The "front-end" (e.g. emscripten in this case) always has to pass 16 as the alignment hint for v128.load/store, even if it knows that the alignment could be less (e.g. if we were to add an _unaligned intrinsic here).

Not quite. "Normal" loads and stores will get the natural alignment (16 in this case), but if you explicitly mark at the source level that some other alignment could be used then that will be reflected in the LLVM IR and respected in the alignment hints generated by the wasm backend.

This results in the WASM machine code generator having no idea that a load could be unaligned, and therefore not being able to use an unaligned load instruction, which would be much faster than doing an aligned load, trapping, recovering, etc.

Since the alignment hints are only hints, engines have to be able to handle any load being potentially unaligned. But they might do this in the generated code if the alignment hint says a non-natural alignment will be used and do it only in traps (which would be faster in the common case) if the alignment hint says natural alignment will be used.

Or is the alignment hint allowed to specify an alignment lower than the natural alignment?

Yes it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having wasm_v128_load() and wasm_v128_loadu() for 16-byte aligned and 1-byte aligned (unaligned), or carrying the alignment hint 1-byte/4-byte/16-byte via an __attribute__. Unaligned loads and stores are very important from developers' point of view, there are times when one has to pack/unpack SIMD data to save memory, which equates to needing to do unaligned ops. One common use case is when doing fast loading/saving of files, where one does not want to carry any padding bytes around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juj we currently have wasm_v128_load do unaligned loads, and if users want aligned loads they can just dereference a v128_t*. Do you think that is sufficient or would you like to see separate functions for aligned and unaligned accesses? If the latter, can you explain more why that would be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

@sunfishcode
Copy link
Collaborator

I wonder how important it is to allow for unaligned loads and stores in this API.

They are important. A common use case is doing SIMD on a slice of an
array. Another common use case is vectorizing over an array which is
passed in as a parameter to a function and alignment is not known.

The elements are typically aligned, eg. at 4 byte boundaries for f32x4,
however whole SIMD accesses are frequently not 16-byte aligned. This is
sufficiently important that hardware designers make it fast -- for
example, movups is the same speed as movaps on most x86 chips these
days.

The current load/store intrinsics here are just wrappers around plain
dereferences, which aren't especially useful -- if users want that
behavior, they can just write *mem themselves. wasm_v128_store and
wasm_v128_load should be the wasm load and store and have defined
behavior on unaligned accesses.

For an example of how to arrange for unaligned accesses, see the use of the __aligned__ attribute in clang's xmmintrin.h.

@tlively
Copy link
Member Author

tlively commented May 9, 2019

Thanks, @sunfishcode! That's helpful. I will generalize the load and store intrinsics to enable unaligned loads and stores.

In other news, clang is being changed to disable int-to-float and float-to-int vector conversions by default. This will force the user to either change their build flags or add explicit casts wherever they want to reinterpret vectors in this way. Does this change the calculus of whether we want to provide a type for each possible vector interpretation or provide just a single v128 type?

#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("simd128"), __min_vector_width__(128)))

// v128 wasm_v128_load(void* mem)
static __inline__ v128 __DEFAULT_FN_ATTRS wasm_v128_load(void* __mem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be const void* __mem to allow loads via const-pointers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@alexcrichton
Copy link

FWIW I still personally feel that choosing a single v128 type across the board is a good way to go. I had some previous comments but with the planned changes that to LLVM that @tlively mentions it seems good to divorce the headers from facets of the current implementations (which allow implicit casting) and instead match what wasm is giving us (only one type)

@gnzlbg
Copy link

gnzlbg commented May 15, 2019

@tlively
Copy link
Member Author

tlively commented May 15, 2019

@gnzlbg, yes v8x16.shuffle1 is missing from the entire toolchain so there is no way to implement it right now. I will implement it eventually, but I have a lot of other stuff on my plate right now as well. Note that the entire toolchain, including these intrinsics, also uses the name v8x16.shuffle for the instruction that is currently called v8x16.shuffle2_imm in the spec proposal. To avoid duplicate work we should resolve the name bikeshedding before updating the toolchain. I opened WebAssembly/simd#78 so we can get started on that.

@alexcrichton I could get on board for just exposing one vector type. @rrwinterton, do you have thoughts on that?

@gnzlbg
Copy link

gnzlbg commented May 16, 2019

@tlively I think one goal of the headers is to allow code to fully use the WASM SIMD spec, so the headers must allow doing everything the spec allows (e.g. unaligned loads).

Another goal of the headers is to allow people to write "portable" code. Ideally, this "C" header would become part of the WASM SIMD spec "somehow" (e.g. as an appendix) and toolchains will be encouraged to implement this header "as is". Different toolchains have different levels of C extension support (e.g. when it comes to portable packed SIMD types like i8x16). C toolchains might also want to expose the header from C++, or via other frontends like D, Nim, Rust, etc. with varying levels of C FFI support. From this POV, I think it makes sense to keep the header as simple and close to standard C as possible.

The current PR uses clang portable packed SIMD vector types in the APIs in these headers directly. This makes these types public, and makes the API of these types part of the API of the headers as well. This means that other toolchains do not only need to have types with these names, but these types need to support all operations that they support in clang (e.g. overloaded operators, etc.).

Using an opaque v128 type without any extra APIs would make this header much simpler to implement portably by the different toolchains.

Those wanting to expose a portable packed vector type library on top of this header, can do so in a library. That library can be "clang-specific" or "gcc-specific" or can work around toolchain differences using macros.

@penzn
Copy link
Contributor

penzn commented May 20, 2019

There is value in exposing compiler-independent types, it would make it easier for different compilers to implement the same interface (though we only have one C/C++ toolchain at the moment). On the other hand, definitions in those headers can be somewhat different between different compilers, since those have different builtin support, see, for example, xmmintrin.h in GCC and Clang.

@tlively
Copy link
Member Author

tlively commented May 20, 2019

Right. The interface this header exposes should be standardized, but its implementation is necessarily clang-specific. The question is whether we should expose a single v128 type to the user or a separate type for each lane interpretation. How those types happen to be represented is an implementation detail that might be different for each compiler, so is unimportant to this discussion.

It sounds like most people lean toward having just a single type. Is there anyone who thinks exposing multiple types as the current PR does is valuable?

@gnzlbg
Copy link

gnzlbg commented May 20, 2019

The interface this header exposes should be standardized, but its implementation is necessarily clang-specific.

I think the issue I am raising might be being misunderstood.

Right now, because of how v128 is implemented, it comes with methods like operator+ such that code like v128 a; v128 b; a + b; compiles. Because v128 is part of the public API of this header, these methods are also part of the public API being proposed here for all compilers to implement, yet these methods have received little discussion.

In fact, what should a + b do for v128? This header has a using v128 = i8x16, but GCC could have using v128 = i64x2, such that both could do different things. Or are GCC and all other compilers required to implement v128 with i8x16 arithmetic operators ?

I think that, at least initially (we can always relax this later), we should not only limit the API to v128, but also that v128 should have no methods - neither in C (through language extensions) or C++. That is, a program doing a + b on v128 should be ill-formed (that is, a compilation error would be required).

How each compiler decides to implement that is up to them, but otherwise we are kind of making "whatever methods clang happens to support today for i8x16" part of the API exposed in these headers for v128. Clang supports a lot of methods on this particular vector type, so that's quite the API surface.

@gnzlbg
Copy link

gnzlbg commented May 20, 2019

Maybe in other words: the API exposed by these headers should be the same for all compilers. I agree that the implementation details will necessarily be different, but this implementation, which is the reference implementation, is making the details of this compiler public and part of the exported API, and that appears to be more accidental than by design.

@penzn
Copy link
Contributor

penzn commented May 20, 2019

My apologies if I misunderstood the comment. Are you suggesting making v128 an anonymous (forward-declared) type? I think any header that has implementations of exposed intrinsics would expose something compiler-dependent. For example, XMM headers that I mentioned before operate on specific types, like the ones in simd128 header here (__m128 is f32x4 instead of i8x16 but still explicit):

// GCC
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

// Clang
typedef float __m128 __attribute__((__vector_size__(16), __aligned__(16)));

We can try to expose a reference header with only the prototypes and then include it from a compiler-specific header.

Sorry, a separate question, for the use of alignment which @sunfishcode pointed out should there be alignment-specific versions of loads and stores?

@tlively
Copy link
Member Author

tlively commented May 20, 2019

@gnzlbg Thanks for clarifying, your concern makes sense to me now. I think it's not an issue that the v128 (or i32x4, etc.) support operators like +, -, etc. This is a side effect of the implementation and not part of the documented interface, so users who use such operators with the types defined by the header have stumbled themselves into non-standard, unportable behavior, just like with any undocumented API. The extra operators are explicitly not part of the portable interface we are defining and implementing here. Since C has no privacy boundaries in headers, there is always some hidden, undocumented functionality no matter how the documented functionality is implemented.

@tlively
Copy link
Member Author

tlively commented May 20, 2019

@penzn Alignment-specific loads and stores might make sense, given that WebAssembly can express them. We now have a way to do naturally-aligned loads and stores (dereferencing a v128*) and 1-byte aligned loads and stores (the intrinsics). What are the benefits of using other alignments, and are they worth adding functions to the header?

@sunfishcode
Copy link
Collaborator

@tlively For completeness, another option is to wrap types like v128 in structs, which would strongly discourage users from reaching around the API boundaries:

typedef char __v128 __attribute__((__vector_size__(16), __aligned__(16)));

typedef struct {
    __v128 __raw;
} v128;

This effectively makes the contents private, because accessing fields that start with __ in C/C++ is UB. And the wasm ABI even does a pretty good job passing and returning single-element structs in registers.


// v128_t wasm_i8x16_neg(v128_t a)
static __inline__ v128_t __DEFAULT_FN_ATTRS wasm_i8x16_neg(v128_t a) {
return (v128_t)(-(__i8x16)a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to cast to __u8x16. Signed overflow is UB in C/C++, and this applies to SIMD too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch!


// v128_t wasm_i8x16_mul(v128_t a, v128_t b)
static __inline__ v128_t __DEFAULT_FN_ATTRS wasm_i8x16_mul(v128_t a, v128_t b) {
return (v128_t)((__i8x16)a * (__i8x16)b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cast to __u8x16 to avoid UB

@juj
Copy link
Collaborator

juj commented Jun 16, 2019

Any chance there would exist a concise summary somewhere of how the wasm simd API differs from the earlier SIMD.js API?

Also, have there been further charting as to how existing SSE* and NEON code would port over and target hardware SSE* and NEON instructions? E.g. which instructions will be unavailable on both hardware?

@tlively
Copy link
Member Author

tlively commented Jun 18, 2019

@juj Unfortunately I do not know of any such documentation, but you can find a concise list of available wasm SIMD instructions here. The SIMD proposal is a direct successor to SIMD.js, so it should be largely similar.

I don't think anyone has done a detailed analysis of how much of SSE* or NEON will be missing from the currently SIMD proposal either; it will be quite a lot. We've generally deferred ISA-specific considerations to a future SIMD v2 that does not exist yet.

tlively added 3 commits June 19, 2019 16:26
 - fix copy-paste error where intrinsics weren't used in test fns
 - fix some inconsistent return types
 - change some names and comments
Note that they do not enforce that their arguments are constant
because there is no way to do that for floats. Builtin functions could
be used to enforce this requirement for ints, but without float
support that would be a half-baked solution.
Along with https://reviews.llvm.org/D63615 fixes the issues with i64x2
shifts.
@tlively
Copy link
Member Author

tlively commented Jun 21, 2019

Since we can't enforce that both integer and float const intrinsics have constant arguments, what do people think about providing a slightly higher level API for constructing vectors that is guaranteed to lower to a single v128.const if and only if all of its arguments are constant? We would create a new name for these intrinsics, maybe something like wasm_make_*.

@gnzlbg
Copy link

gnzlbg commented Jun 21, 2019

IIUC the issue seems to be that clang has support for adding intrinsics that take integer arguments and is able to require these integer arguments to be constant expressions, but it does not have support for allowing those builtins to take floating-point arguments that are required to be constant expressions. Is that correct?

@penzn
Copy link
Contributor

penzn commented Jun 21, 2019

If non-constant values are passed it would emit a bunch of replace lane instructions, right? I think it is probably better to rename the intrinsics, for example one of SSE analogs is called _mm_set_ps, but "make" is probably OK too.

Not very high priority, but this would help if any other toolchains that lack watertight constant detection would decide to adopt this header as well.

@tlively
Copy link
Member Author

tlively commented Jun 22, 2019

@gnzlbg Yes, that is the situation. The capability to require constant floating-point arguments could of course be added to clang, but that seems out of scope for our particular effort here.

@gnzlbg
Copy link

gnzlbg commented Jun 22, 2019

@tlively

Yes, that is the situation. The capability to require constant floating-point arguments could of course be added to clang, but that seems out of scope for our particular effort here.

I think one should separate "clang limitations" from the "specification" of which headers implementations should provide. The problem of having to add multiple constructors is one that all implementations that support C are going to run into, so I think it makes sense to add them to the header "spec", requiring there that the arguments to the constructors must be constant expressions.

This PR with the clang implementation should enfenforce that for the constructors taking integers, but it cannot enforce that for the constructors taking floats due to a limitation in clang. That's ok, I agree that this is not the place to fix that, so maybe this can be documented as a "bug" here, and that's it?

@Maratyszcza
Copy link
Contributor

Maratyszcza commented Jun 22, 2019

I agree with @gnzlbg. The intrinsic specification must provide a reliable way to generate v128.const instruction. Thus, it should guarantee that wasm_*_const intrinsics with compile-time constant arguments would produce v128.const WAsm instruction. It is desirable if toolchains produce an error if arguments are not compile-time constants, but not absolutely required, and the specification would make no guarantees as to efficiency or composition of generated code in this case.

Probably we could work around limitation on constantness specification in Clang intrinsics argument using static_assert with __builtin_constant_p?

@tlively
Copy link
Member Author

tlively commented Jun 23, 2019

I didn't know about __builtin_constant_p, but it's exactly what we needed here. Thanks!

@tlively
Copy link
Member Author

tlively commented Jun 24, 2019

I don't think there are any large outstanding issues left, so how does everyone feel about merging what we have so far and starting to document it in the tool-conventions repo? Of course we can still make changes as new issues and additions come up. It would also be good to get these merged so people can start experimenting with them in real projects.

@Maratyszcza
Copy link
Contributor

I have a small concern about filename: when this header is standardized and directly supported by compilers, it will leave in compiler-level include directory among other headers, and simd128.h is not specific enough. I suggest to rename it to wasm_simd.h or wasm_simd128.h (following the same pattern as arm_neon.h).

Other than filename, the header LGTM.

@gnzlbg
Copy link

gnzlbg commented Jun 25, 2019

wasm_simd128.h sounds good, simd128 is how the feature is called.

@tlively tlively merged commit 7da0dc8 into emscripten-core:incoming Jun 26, 2019
@tlively
Copy link
Member Author

tlively commented Jun 26, 2019

Thanks, everyone! Looking forward to getting these tested out in the wild and developed further !

belraquib pushed a commit to belraquib/emscripten that referenced this pull request Dec 23, 2020
* Initial commit of intrinsics and test

* Get intrinsic tests compiling with -msimd128

* Fix tests and unimplemented-simd128 build

TODO: truncation, conversion, and shufflevector

* Finish implementing instructions and clean up

* Add explicit alignments and make load and store unaligned

* Add const to loaded pointer

* Rewrite intrinsics to expose only v128_t

* Address recent comments

 - fix copy-paste error where intrinsics weren't used in test fns
 - fix some inconsistent return types
 - change some names and comments

* Add v128.const intrinsics for all types

Note that they do not enforce that their arguments are constant
because there is no way to do that for floats. Builtin functions could
be used to enforce this requirement for ints, but without float
support that would be a half-baked solution.

* Fix some codegen inefficiencies

Along with https://reviews.llvm.org/D63615 fixes the issues with i64x2
shifts.

* Use __builtin_constant_p in wasm_*_const

* Add documentation

* Add stability disclaimer

* Rename to wasm_simd128.h

* Add wasm_*_make convenience functions

* Fix whitespace
@tlively tlively deleted the wasm-simd-intrinsics branch February 5, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants