Description
This is intended to be a tracking issue for implementing all vendor intrinsics in this repository.
This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.
If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!
At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate target_feature
attribute. Here's an example for _mm_adds_epi16
:
/// Add packed 16-bit integers in `a` and `b` using saturation.
#[inline]
#[target_feature(enable = "sse2")]
#[cfg_attr(test, assert_instr(paddsw))]
pub unsafe fn _mm_adds_epi16(a: __m128i, b: __m128i) -> __m128i {
unsafe { paddsw(a, b) }
}
Let's break this down:
- The
#[inline]
is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance. - The
#[target_feature(enable = "sse2")]
attribute intructs the compiler to generate code with thesse2
target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't supportsse2
, the compiler will still generate code for_mm_adds_epi16
as ifsse2
support existed. Without this attribute, the compiler might not generate the intended CPU instruction. - The
#[cfg_attr(test, assert_instr(paddsw))]
attribute indicates that when we're testing the crate we'll assert that thepaddsw
instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction! - The types of the vectors given to the intrinsic should match exactly the types as provided in the vendor interface. (with things like
int64_t
translated toi64
in Rust) - The implementation of the vendor intrinsic is generally very simple. Remember, the goal is to compile a call to
_mm_adds_epi16
down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case,paddsw
) when one is available. More on this below as well. - The intrinsic itself is
unsafe
due to the usage of#[target_feature]
Once a function has been added, you should also add at least one test for basic functionality. Here's an example for _mm_adds_epi16
:
#[simd_test = "sse2"]
unsafe fn test_mm_adds_epi16() {
let a = _mm_set_epi16(0, 1, 2, 3, 4, 5, 6, 7);
let b = _mm_set_epi16(8, 9, 10, 11, 12, 13, 14, 15);
let r = _mm_adds_epi16(a, b);
let e = _mm_set_epi16(8, 10, 12, 14, 16, 18, 20, 22);
assert_eq_m128i(r, e);
}
Note that #[simd_test]
is the same as #[test]
, it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.
Finally, once that's done, send a PR!
Writing the implementation
An implementation of an intrinsic (so far) generally has one of three shapes:
- The vendor intrinsic does not have any corresponding compiler intrinsic, so you must write the implementation in such a way that the compiler will recognize it and produce the desired codegen. For example, the
_mm_add_epi16
intrinsic (note the missings
inadd
) is implemented viasimd_add(a, b)
, which compiles down to LLVM's cross platform SIMD vector API. - The vendor intrinsic does have a corresponding compiler intrinsic, so you must write an
extern
block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16
) uses this approach. - The vendor intrinsic has a parameter that must be a constant value when given to the CPU instruction, where that constant is often a parameter that impacts the operation of the intrinsic. This means the implementation of the vendor intrinsic must guarantee that a particular parameter be a constant. This is tricky because Rust doesn't (yet) have a stable way of doing this, so we have to do it ourselves. How you do it can vary, but one particularly gnarly example is
_mm_cmpestri
(make sure to look at theconstify_imm8!
macro).
References
All intel intrinsics can be found here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236
The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c
The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc
The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does:
https://github.com/llvm-mirror/clang/tree/master/lib/Headers
TODO
["MMX"]
Activity
alexcrichton commentedon Sep 25, 2017
cc @BurntSushi @gnzlbg, I've opened this up and moved
TODO.md
out here, I figure it may be easier to collaborate here to ensure we can attach names everywhere!mattico commentedon Sep 25, 2017
Could you edit the guide to suggest unsafe functions for the intrinsics? #21
alexcrichton commentedon Sep 26, 2017
@mattico makes sense yeah! Although we may want to wait until #21 is closed out to avoid inconsistencies
AdamNiederer commentedon Sep 26, 2017
For those wishing to implement intrinsics above SSE2, make sure you're running your tests with
RUSTFLAGS="-C target-cpu=native" cargo test
on something which supports that instruction set extension. It looks lilke it's only running the SSE2 tests otherwise.gnzlbg commentedon Sep 26, 2017
You can use `RUSTFLAGS="-C target-feature=+avx2" to enable a particular extension. Note however that a CPU that does support the extension is needed for running the tests. To develop tests for a different architecture (e.g. develop for ARM from x86) you can use cross-compilation. To run the tests... travis is an option. I don't know if there is a better option though.
AdamNiederer commentedon Sep 26, 2017
It looks like travis only runs SSE2 and below with our current config. I wonder if their machines support AVX...
alexcrichton commentedon Sep 26, 2017
@AdamNiederer oh that's actually a bug! I think I see what's going on though, I'll submit a fix.
gnzlbg commentedon Sep 26, 2017
@alexcrichton https://github.com/rust-lang-nursery/stdsimd/blob/master/ci/run.sh probably needs to set
RUSTFLAGS="-C target-cpu=native"
to run most tests. @AdamNiederer makes a point though, what instruction sets does travis support? If it doesn't support AVX2, those will never be tested (I am pretty sure travis does not support AVX512, so we'll need a different solution for that).AdamNiederer commentedon Sep 26, 2017
Added in #45. Let's see what Travis has to say about it.
EDIT: The build is failing, but those same 20 tests were failing for me on my Ivy Bridge box last night. I think LLVM might be spitting out wider version of 128 or 64-wide instructions on CPUs which support them. It also looks like travis supports AVX2 🎉
alexcrichton commentedon Sep 26, 2017
@gnzlbg oh I'm going to add
cfg_feature_enabled!
to all tests and enable them all unconditionally all the time, that way whatever your cpu supports we'll be testing everything (without any required interaction)@AdamNiederer thanks! I'll look into the failures and see if I can fix them.
dlrobertson commentedon Sep 28, 2017
Interested in helping out with this. Figured I'd start super small with
cvtps2dq
#65vbarrielle commentedon Sep 29, 2017
Hello, I've given a try at
__mm256_div_ps
and its double counterpart, see #73.dlrobertson commentedon Sep 30, 2017
Post #81 SSE 4.2 should be covered.
66 remaining items