-
Notifications
You must be signed in to change notification settings - Fork 43
Remove I64x2/F64x2 instructions #81
Comments
What kind of performance data would be needed? I64x2 SIMDs operations would be very useful for hash functions and cryptography in general. |
Typically any real world workload (crypto kernel) implemented using the 64x2 types showing performance benefits compared to an implementation without these types will help. |
I don’t have an i32-only baseline, but here’s a Blake2b implementation for WASM SIMD that uses i64x2.add and i64x2.shr_u heavily: https://github.com/WAVM/WAVM/blob/master/Examples/blake2b.wast Blake2b is specifically designed for efficient implementation using i64x2 SIMD, so I expect that emulating the i64x2 operations with i32x4 operations would be much slower. |
Computing the mandelbrot set is one of the benchmarks used in the Dart Javascript SIMD paper that uses 64x2 instructions. |
Are you referring to: McCutchan, John, et al. "A SIMD programming model for Dart, JavaScript, and other dynamically typed scripting languages." Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing. ACM, 2014. ? I check out the linked codebase (https://github.com/tc39/ecmascript_simd/tree/master/src/benchmarks) and it doesn't look like the benchmarks use 64x2 instructions. |
Yes, that's the paper. They did not add a 64x2 vector type, so Dart cannot use it. Here is the performance of Dart at a Mandelbrot implementation that uses 64-bit values compared against that of many other programming languages: https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/mandelbrot.html . IIRC the machine where those benchmarks run is a core2, and most of the results that are < 2.0x slower than the fastest generate machine code that use 64x2 vector instructions. |
I am very doubtful about this benchmark when I see Fortran 4.5 times slower than C. They probably messed their implementation up. Concerning i64x2, it's not that algorithms need to process 64 bit integers and they need to process them fast. When you deal with SIMD, you should try to have all your data type of the same size in order to avoid resizing your data over and over again. If you don't have i64x2, you will have no choice but shuffling your data each and every time you need to process both integers and floats. This is the reason why we need to have i64x2, even if only very few algorithms need integers that wide. EDIT: Mea culpa. I forgot that the topic was about removing not only i64x2 but also f64x2... |
Clicking on a benchmark reveals its source code, the compiler options used to compile it, etc. so could you be a bit more precise on which part of the Fortran implementation is "messed up"? The C programs are explicitly using the SSE instructions that the WASM opcodes being considered for removal would lower to, and most of them are also parallelized. The Fortran benchmarks do not use explicit SSE instructions because Fortran does not have intrinsics for them. |
I don't write Fortran myself, but I know that optimizing Fortran compilers are more effective than optimizing C compilers because of the rules of Fortran (mainly aliasing rules). So I would expect at least similar performance between fully optimized C and fully optimized Fortran. That's what I mean when I say there is something weird with this benchmark. |
Probably a combination of vectorization, parallelization, and memory layout, all of which differ between the Fortran benchmark and the C benchmarks.
I don't think this is correct. If you go from scalar code with a data-dependency, to SIMD code without data-dependencies, you can increase throughput by N * M where N is the number of vector landes and M is the number of SIMD instructions that can be scheduled in parallel (i.e. the number of ports that support each instruction).
On a similar machine and using the same techniques as the C code we do reproduce their results, getting a ~4x speedup with respect to scalar code with both Rust std::simd and ISPC (e.g. see here). |
Fortran compilers vectorize in the same way as c compilers do, and they use the same multithread library (OpenMP). The default memory layouts are different, sure (and Fortran has usually a better default than C concerning vectorization).
If your scalar code has data-dependency where your SIMD one doesn't, then why can't you write scalar without data-dependencies? SIMD only cannot give higher speedups than the cardinal of the SIMD. If you gain more, then you have applied other transformations that could have been applied to the scalar code as well.
A x4 speedup can also be explained by using AVX. But if that's the case, then it is not really relevant to the current discussion. Another thing weird with those benchmarks: the C++ code is faster than the C while no C++ feature has been used. Should not C be as fast as the C++ if the code is C in both cases? A last thing, the best C++ code has intrinsics for SSE, AVX and AVX512, while the command line is Is it a fair comparison? I don't know, and I have no way to know. So it's at the very least questionable. |
Of course you have a way to know. You could actually click the links, read them, google what you don't understand or ask, c&p the assembly in godbolt and clarify your own questions (https://godbolt.org/z/mDEt68). Instead, you just claim that Fortran is faster than C and therefore those results are incorrect, and then go completely offtopic about which vector instructions C or Fortran might be using even though both are compiled with I still claim that:
and I think this disproves the claim that 64x2 instructions are useless. |
"Benchmakring game" seems to be a set of unrelated implementations, by lots of different people. I don't think it is practical to compare those to each other (other than for fun); for example, there is this single-threaded scalar C implementation that does very poorly. The top ones use SIMD and threads, bottom ones are completely sequential; even if they produce the same result, they get to it in dramatically different ways from performance point of view. If you are interested in using SIMD semi-directly in Fortran, there are OpenMP SIMD directives, let me know if you want to submit an implementation, I could probably help :) Back to the topic, I personally see value in 64-bit ops: double precision floats are important, and it makes sense to include them for at least for compatibility reasons in SIMD operations, with a similar argument for int64. I also agree that they should be faster than scalar, we just don't know by how much. I know there was a group vote to exclude them from spec, so if we want them in, we might need to provide justification. Let me see if I can pull up any compute code, but if you have any (or can think of any), let's take a look at that as well. |
I don't have time to put that much time in the process, and I guess I'm far from being the only one in that case. My point is, this benchmark is not reliable. If you want to prove anything, either you create your own unbiased benchmark, or you find a well crafted benchmark that expose the results.
I've never said the results were incorrect, I just said that is highly surprinsing to see Fortran x4.5 slower despite being usually at least as good as C (and is not explained just by the absence of SIMD intrinsics). And that encourages suspicion, like the other points I made.
You misunderstood my last point: I just wanted to show the possible biases that this benchmark has, and why you probably don't want to rely on it. It was not about Fortran.
That is exactly my point: by using questionable benchmarks, you disprove nothing. This creates an hint, not a proof. Don't get me wrong, I also want to keep v64x2 operations avialable and I also think they have good performance impact when an algorithm need them. But if you want to convince people, I think you need better benchmarks. |
We can start with matrix multiplication which should be easy enough to have an idea. I guess Mersenne Twister is also a good use case for i64x2, but I don't the algorithm well enough to be sure. I also want to add that ARM and PowerPC added 64 bit types to their SIMD instruction sets, and they wouldn't do that if it was useless. |
The benchmark has a specification (https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/mandelbrot.html#mandelbrot), validation for the benchmark results is provided there as well. Dozens of implementations of this benchmark are provided in dozens of programming languages, many of which can be compiled down to WASM. The benchmark also uses So if you are claiming that this benchmark is unreliable and cannot be used to measure the impact of f64x2 and i64x2 WASM SIMD intrinsics, can you explain why ? |
We did provide two implementations of that benchmark, one in Rust that can compile to WASM, and one in ISPC, both validated against the results produced by others, and our measurements match those of the results provided for this benchmark by others as well. |
As a quick test, I added the LLVM Scalarizer to the WAVM optimization pipeline. This pass replaces vectors and vector operations with scalars and scalar operations. WAVM subsequently runs LLVM's instruction combining pass, which revectorizes the cross-lane shuffles, but WAVM does not use any of LLVM's auto-vectorization passes, so most of the generated code remains scalar. Using the scalarizer pass with the Blake2b example runs 2.26x slower than without the scalarizer pass. The assembly looks reasonable, though it spends some effort moving between scalar and vector registers for the cross-lane shuffles. However, without the vector cross-lane shuffles, the scalar code suffers a further ~4x slowdown. This test is an approximation of how you would scalarize the WASM However, if you did not have I think it's clear that Blake2 benefits substantially from |
You know what? Let me waste my time doing proper benchmarks and show you: https://gist.github.com/lemaitre/eb2c4e99849ff2f927c09041a7a45c46. Here is a table of the time I got on a i9-7900X, 10C/20T @4gHz for N=16000 compiled for SSE4:
As you can see, my reimplementation has the same performance than the one from benchmarkgame, yet the SIMD version is "only" ~x2 faster. There is no x4 like you want to believe/show from benchmarkgame or from your benchmark. So why benchmarkgame is not reliable to compare "with v64x2" and "without v64x2"? Simply because there is no version that are comparable in every point except the use of SIMD instructions. You see a certain speedup between a SIMD and a scalar version and you directly believe that the speedup is only due to SIMD, but you forget the presence of other transformations that can also improve the computation time. So those benchmarks give you a wrong impression. Now that we have unbiased benchmark to see the impact of the presence of v64x2 instructions, let's use this one instead of the one you proposed.
If you have a x4 with your custom benchmark, either you targeted AVX (which would not be relevant for v64x2 instructions), or the versions are not comparable and don't have the same set of transformations applied (or the same algorithm). |
I honestly think you are completely confused. You decided that writing a Fortran program that's 4.5x slower than a C program is impossible and also that obtaining a 2x speed up using explicit vectorization was very hard. I pointed out to you that some of the key difference between these programs is that the C program is both explicitly parallelized and explicitly vectorized, that the hardware used there is a 2 core CPU, and that our experiments comparing a scalar serial program with an explicitly vectorized and explicit parallelized program on a similar two-core CPU showed a 4x speed-up (probably a 2x vectorization and a 2x parallelization speed-up, although we did not analyze that in any detail). You also made other claims, like obtaining larger than 2x speed-ups using SIMD is impossible, because the transformations that would enable this can be applied to scalar code, which is another claim that's also incorrect. I have no idea what these issues have to do with me pointing out that all implementations less than 2x slower than the fastest one are using SIMD instructions on 64-bit elements. That's a fact. I have no idea how anything you are mentioning applies to that. From all I can tell, I think you agree with that?
That's what I would expect, and it matches our results without explicit parallelization.
One goal of WASM is performance parity with native code. All the fastest implementations, and up to those that are <=2x slower than the fastest, are all native code. They are implemented in different programming languages, using different compilation backends, and yet they all agree on using 64xN vector instructions. I do agree with you that comparing widely different implementations does not make much sense. Dismissing this result wouldn't IMO make much sense either. The same compilers generating machine code for those programming languages are very often used to generate WASM as well, and they do all generate slightly different WASM. Not being able to generate 64x2 vector instructions would be a regression that applies to all of them. |
Fortran slowness was only my suspicion entry point, nothing more. I don't care about Fortran, and my last point was not about it. I also never said that x2 was very hard, but that higher than x2 was seemingly impossible.
What the multithread speedup has to do with the usefulness of SIMD instructions? This is a transformation that can also be applied to scalar, and thus has to be applied if you want to know if SIMD instructions are worth. If you did not investigate where the speed up come from, how can you tell what part of this speedup is due to SIMD, and thus if the SIMD instructions are worth?
Using SIMD only, it is seemingly impossible to have a speedup higher than its cardinal. Exactly like with multi-threading: expecting a speedup higher than the number of cores is unrealistic (but still achievable in some very rare corner cases, and nothing big). All those claims are from how CPUs work and are confirmed by many people experience. And that's exactly what I showed you with my reimplementation of mandelbrot. I have never seen such counter examples (even though I know they can exist in theory), yet optimizing software using SIMD is my every day job.
Because that's a selection bias. Of course people trying to be the fastest will try SIMD intrinsics. Almost nobody will put that much effort to optimize Mandelbrot without using SIMD. If people don't have SIMD instructions, they will optimize the scalar code as much as they do currently with SIMD. And this future code is what we want to compare against, not what people currently write in scalar, because that is less optimized than what they could do. If this future code does not exist, then we have to write it ourselves and only afterwards we could see the benefit of SIMD instructions.
So why confuse everybody with multithreading? |
"very hard" weren't my words, you literally said "very hard" here (emphasis mine):
where by perfect speed up you mean 2x (unless you are now changing your mind about achieving a speed up higher than 2x using 64x2 instructions being possible).
@ngzhian asked for a benchmark that uses 64x2 vector instructions. I pointed out that this particular benchmark uses 64x2 SIMD instructions, and that all the fast implementations of this benchmark use those. Having implemented this benchmark many times in the past (both scalar, vectorized, and vectorized+parallelized versions), I knew that the use of 64x2 SIMD instructions was important for an efficient implementation, which is why I suggested it. For some reason, you did not believe any of that, but instead of checking the facts, decided to build up a conspiracy theory about Fortran, misrepresent my answers to build strawmans, and then attacking and debunking those.
The claim I made is that explicitly vectorized programs do show speedups larger than the number of vector lanes available in a single vector register, because CPUs often have multiple ports that allow executing operations on multiple vector registers in parallel. For example, this kernel double sum(double* x, int_t len) {
double s = 0.0;
for (int_t i = 0; i < len; i += 1) {
s += b[i];
}
return s;
} can be explicitly vectorized as: f64x4 sum(f64* x, int_t len) {
f64x4 s = f64x4.splat(0.0);
for (int_t i = 0; i < len; i += 4) {
s += f64x4.load(b + i);
}
s.sum()
} On a CPU without AVX, a naive compiler lowers f64x4 to use 2x 128-bit wide vector registers such that operations on those that can be executed in parallel. This transformation produces almost a 4x speed-up over the scalar code on my machine when the vector is on cache. You claim that this cannot happen because scaling is limited by the number of lanes in a SIMD vector. This is not the case, scaling is limited by the number of lanes that can be processed in parallel, and that can be larger than the number of lanes in a single vector if multiple vectors can be processed in parallel. You also claim that, if these transformations exist, they can be applied to scalar code, but that's not the case: applying this transformation to scalar code is unsound because it would change the program semantics and its results because FP arithmetic is not associative. This is why proper explicitly vectorized code can often show a speed up larger than the number of vector lanes when compared with scalar code.
This isn't rare, it is super common for memory-bound kernels and has a large performance impact. It happens when, by increasing the number of threads, the sub-problems that each one solve start fitting in faster levels of the memory hierarchy. E.g. if you have a CPU with 4 cores, and you get a 3x speed up from going from 1 core to three, you often get a > 4x speed up if ,when going to 4 cores, each core does not have to access RAM anymore, but only L3, which is much faster.
This claim is incorrect. Many of the top implementations are scalar code (no explicit vectorization) that the compiler auto-vectorization and I do know that the people involved in some of them invested effort optimizing them. Again, what I pointed out is that the generated machine code uses 64x2 instructions, which isn't the same as whether the source code is explicitly vectorized to use those.
I think that you just confused yourself, why? I don't know. If I had to guess, it feels like you didn't wanted to understand what I was saying. I mean, long after the discussion started, you claim "I don't know, and I have no way to know." yet all the information you needed was right there, in the first comment, had you actually wanted to understand anything, you would have read it. You later on claim that you had no time to do so, but all the comments you have made since prove otherwise. |
Thanks for all the input on this thread, I'd like to interject with a friendly reminder that even though discussions can sometimes get heated, let's keep the conversation civil, and productive. There's a few leads to concrete benchmarks on this thread, if there are any more that aren't already mentioned here, happy to hear about them. |
Fair enough. For those interested, I continue the discussion here: https://gist.github.com/lemaitre/eb2c4e99849ff2f927c09041a7a45c46#gistcomment-2992018 So I will just put the results of my benchmark on mandelbrot here for reference:
In both cases, the usage of f64x2 instructions allows to get a ~x2 speedup compared to without them. The code of this benchmark is here: https://gist.github.com/lemaitre/eb2c4e99849ff2f927c09041a7a45c46 |
Results looks promising. It would be interesting to see how they would change if the code were run via WebAssembly in a web engine. I assume that even with 64x2 ops there would be some overhead from the reduced expressiveness of WebAssembly vs native code, but the question is how much. |
This is good results. Thanks @lemaitre. It is useful from a spec standpoint to understand any significant overheads caused due to limitations in 64x2 expressiveness so that we can address those. |
The results do look promising, we've had cases in the past where we've been able to make a case for including operations without actual engine performance if swapping out the code with SIMD intrinsics that map to the current operations still give the same performance. It's less clear in this example to enumerate what would map exactly to which operation we are trying to Spec, so having an engine run this, or taking a closer look at the LLVM IR to evaluate which set of operations get used would be interesting information. |
I have some preliminary results to share here: https://github.com/ngzhian/simd-benchmarks |
@ngzhian good work! Only comment is about double operations (from README):
Does this mean that SIMD instructions were not produced for the loop? If so, how would you know that they have no effect? |
The auto-vectorizer cannot use 64x2 because that optimization is unsound (it would alter the program results - nan propagation doesn't really have much to do with this). If you change your program to: double sum(double arr[])
// enable WASM SIMD for the function:
__attribute__((target("simd128,unimplemented-simd128")))
{
// Assume that arr is properly aligned - clang still
// emits vector instructions without this, but the loads
// have hints about potentially unaligned accesses
double* p = (double*)__builtin_assume_aligned(arr, 16);
double sum[2] = {0.0, 0.0};
for (int i = 0; i < SIZE; i+=2) {
sum[0] += p[i];
sum[1] += p[i+1];
}
return ((sum[0] + sum[1])/SIZE);
} then clang (godbolt) does emit vector instructions for the manually vectorized code: _Z3sumPd: # @_Z3sumPd
f64.const 0x0p0
f64x2.splat
local.set 1
i32.const 0
local.set 2
loop # label0:
local.get 1
local.get 0
v128.load 0
f64x2.add
local.set 1
local.get 0
i32.const 16
i32.add
local.set 0
local.get 2
i32.const 2
i32.add
local.tee 2
i32.const 32768
i32.lt_u
br_if 0 # 0: up to label0
end_loop
local.get 1
f64x2.extract_lane 0
local.get 1
f64x2.extract_lane 1
f64.add
f64.const 0x1p-15
f64.mul
end_function |
@penzn Thanks!
Yes, I dumped the wasm file and did not see any f64x2 add instructions. @gnzlbg Great point, I altered double_average.cpp and updated the benchmarks with the results, please see https://github.com/ngzhian/simd-benchmarks thank you! |
@ngzhian excellent work, thank you! To my eye, 64x2 operations give pretty good speedup. |
Closing this as a subset of 64x2 operations is included in the proposal in #101. |
We discussed this at the 2019 CG meeting and decided to remove the i64x2/f64x2 instructions until there is performance data to justify their inclusion.
See https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#simd--1-hr.
The text was updated successfully, but these errors were encountered: