-
Notifications
You must be signed in to change notification settings - Fork 43
i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412
Conversation
When re-introducing some 64x2 instructions in #101, majority voted for the option which did not include comparisons. I did not find a lot of use cases requiring those instructions (happy to be corrected). That vote was slightly bias since it only gave 3 options, but no one voiced strong opinions against omitting the comparisons instructions. What has changed since then? This PR only includes signed instructions, so it seems like we only want a subset of instructions that lower reasonably well. What about the asymmetry of Wasm SIMD? |
We discussed the issue of missing 64-bit forms in several meetings, and I have an action item to create PRs for missing instruction forms and document how they could lower to native ISA. This primary concern with missing 64-bit forms is the resulting non-orthogonality of the WebAssembly SIMD instruction set. I grouped similar instructions into one PR, but there will be more PRs coming for other instructions. |
The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so. To avoid doing work that we've already done before - are there code examples where these are particularly useful? IIRC, @ngzhian evaluated several benchmarks, and only included the ones that would be lowered efficiently and were being used in real world use cases. For the more commonly used 32x4, 16x8, and 8x16 types, the symmetry argument I agree with - but for 64x2 types I'm not sure that it does? My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations. |
Special case not explicitly covered above: If the JIT can detect that the two operands are identical, it should always call the cheapest method for returning zero regardless of whether or not the underlying instruction is used. (corrected to be accurate for cmpgt) |
I think some of the driving motivation here is that it really looks like the instruction set will be finalized within a meeting or two. If that's the case, do we want to push forward a standard that doesn't have ordering operators on 64 bit? I don't think unsigned operations were left out intentionally. |
Thanks for explaining your concerns. Added links to applications. As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering. |
Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions. |
b2bb166
to
b1b3625
Compare
Even so, why is it a concern? |
I don't necessary think instructions in this PR are expensive (though others in the same subfamily would, also I agree that we have made a decision not to include those), but I want to make a point about application examples. Sorry, I've been a broken record on this lately, and I should probably stop :) On a semi-personal point, I am quite familiar with the "legacy" flang compiler (the first example), and is very unlikely to be targeting wasm at any point, at the very least because it is going to be deprecated in favor of similarly-named project already in LLVM source tree. However, I am not sure how similar functionality would be implemented then. Going further down the least of examples, some of those are obviously AVX, so while it is technically possible to port them, but for this proposal in its current state it would mean going from AVX to SSE, which might not have enough parallelism. On the other hand, AVX examples would be very important for any "bigger" SIMD proposals, like flexible vectors. It is possible to find examples of intrinsics' use by doing a github search (example), but by themselves those are not wasm examples yet - there might be other issues getting in the way of getting good performance for those and they might be impossible to port efficiently for reasons other than SIMD. Probably more imporantly, why do we need to chase exact instruction sequences - I thought there is a consensus that adding something for symmetry or to match native is not a goal.
|
I would agree with @dtig's comment above and would extend it a bit: if we were to add these signed comparisons but not #414 (unsigned comparisons with worse lowerings for x86), wouldn't this be making the spec less orthogonal? To be consistent, I think actually this would make sense to me, having a patchwork of instructions that directly map to the supported ISAs, but that seems to contradict the orthogonality intent of a bunch of these |
I remember that historically we have been somewhat skeptical about 64x2 instructions, as they handle just two elements, and the cost can add up quickly when lowering is not very good. I think it would be great to get perf data for non-trivial lowerings. I am at least somewhat comfortable with merging in "non-orthogonal" fashion, though :) This applies to #414 as well. |
b1b3625
to
8c4d63e
Compare
Today's meeting raised concerns about non-wrapper use-cases for these instruction, so I'd like to point out some examples:
|
8c4d63e
to
299bbbf
Compare
I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a To put it another way, most (if not all) other instructions proposed have strong use cases in that XNNPACK will use them once those instructions are standardized, and XNNPACK benchmarks indicate performance improvements. This list of examples do not meet the same bar. |
x86 SIMD extensions are so fragmented that many projects optimize only for a few of them. E.g. PyTorch has explicit vectorization only for AVX2 (thus
PyTorch vector primitives have a port to POWER VSX, which is a 128-bit SIMD extension similar to WAsm SIMD. E.g. here's the signed 64-bit comparison for greater-than. I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available. |
That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.
I am not saying we shouldn't, I am saying SIMD v1 doesn't need to. This SIMD proposal is not the end, it's only the beginning.
They could. If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument. Your work on instructions like load/store lane, benchmarking on XNNPACK, shows obvious wins for inclusion, and is a real world, immediately relevant use case. The rest of the examples given here, less so. |
I agree in principle, but it is important to quantify which subset of users might experience poor performance. The proposed instructions lower to 1-3 instructions on x86 with SSE4.2, ARM with NEON, and ARM64. The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2. These are Intel Core 2 Duo/Quad CPUs on 45 nm process (earlier Intel Core 2 processors on 65 nm didn't support SSE4.1, and later Nehalem-generation processors support SSE4.2). From the Wikipedia list the latest among these processors is Core 2 Quad Q9500, released in January 2010, and long discontinued. Is this processor a good fit for WebAssembly SIMD? I doubt so: per Agner Fog, on these processors unaligned loads are internally decoposed into 4 microoperations and unaligned stores - into 9 microoperations, so any SIMD code would likely perform worse than scalar (unless the code never use full 128-bit SIMD loads/stores). So, in general, I agree that portable performance is important and should be our goal. However, IMO performance portability to decade-old processors that are not suited for WebAssembly SIMD anyway does not weight much towards this goal.
Parts of PyTorch were ported to Emscripten. SIMD is not there yet, though. |
A small correction, Highway targets SSE4.1 and others, but not SSE2, indeed because there are way too many combinations. I agree with @ngzhian that performance portability is important but also with @Maratyszcza that <SSE4.2 is getting quite old and less relevant. My opinion on orthogonality has shifted a bit recently, I wanted to detect a sign change and if so, flip all bits - also for i64. Without i64.shr nor i64.gt_s nor signselect that would have been difficult :) Now 6 ops does look a bit scary from the point of performance cliffs, but the alternative of going scalar also includes store + either store to load forwarding stall (from storing one i64 half then loading i62x2) or pinsrq (2 cycles, surprisingly enough). Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets. |
Thanks for the detailed breakdown. I recall @lars-t-hansen mentioned some metrics he saw regarding SSE4.1, it was in the lower percentages. I looked at Chrome's numbers, ~10% of clients don't have at least SSE4.2. This number will surely go down with time, but it's not insignificant.
We have i64.shr. You need all 3 of the above?
If we add gt_s, lt_s comes for free, since we can swap the operands. So that will make this group look slightly more complete. |
Adding a preliminary vote for the inclusion of i64x2 signed comparison operations to the SIMD proposal below. Please vote with - 👍 For including i64x2 signed comparison operations |
IMO, PyTorch is one of those examples that has much larger issues than SIMD support, as Python is not really running in Wasm today. Libraries like PyTorch, NumPy, Flang RTL would at least require their "entry" language to be compiled to Wasm (NumPy stands out as it also requires Fortran). That's why I personally don't think those are valid examples of apps by themselves - code running on top of them would, which makes using them for our purposes even more far-fetched. |
@penzn I ported CPython to Emscripten even |
No doubt it is possible to compile it, but what about performance - what does it use for GC and how well does that work? |
CPython objects are reference-counted |
299bbbf
to
12d1340
Compare
12d1340
to
42f9a22
Compare
42f9a22
to
acbf991
Compare
These instructions were merged in WebAssembly#412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.
Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 unsigned comparisons were added in WebAssembly#412.
Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 signed comparisons were added in WebAssembly#412.
Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 signed comparisons were added in #412.
These instructions were merged in #412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.
This is a partial revert of https://crrev.com/c/2457669/. This change is slightly longer (in code-generator-x64.cc) because we also implement support when SSE4_2 is not supported (the reverted change seems to assume SSE4_2, which is not always the case). This code sequence is from WebAssembly/simd#412. Bug: v8:11415 Change-Id: I3eef415667b4142887cf1c449d27d19ba5bbd208 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2683219 Commit-Queue: Zhi An Ng <[email protected]> Reviewed-by: Bill Budge <[email protected]> Cr-Commit-Position: refs/heads/master@{#72611}
Introduction
This is proposal to add 64-bit variant of existing
gt_s
,lt_s
,ge_s
, andle_s
instructions. ARM64 and x86 (since SSE4.2) natively support thei64x2.gt
instruction, and on ARMv7 NEON can be efficiently emulated with 3-4 instructions.i64x2.lt_s
instruction is equivalent toi64x2.gt_s
with reversed order of input operands.i64x2.le_s
andi64x2.ge_s
are equivalent to binary NOT operation applies to results ofi64x2.gt_s
andi64x2.lt_s
accordingly.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.ge_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_b, xmm_a
VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
y = i64x2.le_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
x86/x86-64 processors with XOP instruction set
y = i64x2.ge_s(a, b)
is lowered toVPCOMGEQ xmm_y, xmm_a, xmm_b
y = i64x2.le_s(a, b)
is lowered toVPCOMLEQ xmm_y, xmm_a, xmm_b
x86/x86-64 processors with AVX instruction set
y = i64x2.gt_s(a, b)
is lowered toVPCMPGTQ xmm_y, xmm_a, xmm_b
y = i64x2.lt_s(a, b)
is lowered toVPCMPGTQ xmm_y, xmm_b, xmm_a
y = i64x2.ge_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_b, xmm_a
VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_b
VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
x86/x86-64 processors with SSE4.2 instruction set
y = i64x2.gt_s(a, b)
(y
is notb
) is lowered toMOVDQA xmm_y, xmm_a
+PCMPGTQ xmm_y, xmm_b
y = i64x2.lt_s(a, b)
(y
is nota
) is lowered toMOVDQA xmm_y, xmm_b
+PCMPGTQ xmm_y, xmm_a
y = i64x2.ge_s(a, b)
(y
is nota
) is lowered to:MOVDQA xmm_y, xmm_b
PCMPGTQ xmm_y, xmm_a
PXOR xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
(y
is notb
) is lowered to:MOVDQA xmm_y, xmm_a
PCMPGTQ xmm_y, xmm_b
PXOR xmm_y, [wasm_i64x2_splat(-1)]
x86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.lt_s(a, b)
(y
is nota
andy
is notb
) is lowered to:MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_b
PSUBQ xmm_y, xmm_b
PCMPEQD xmm_tmp, xmm_a
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PCMPGTD xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
y = i64x2.ge_s(a, b)
(y
is nota
) is lowered to:MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_b
PSUBQ xmm_y, xmm_b
PCMPEQD xmm_tmp, xmm_a
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_b
PCMPGTD xmm_tmp, xmm_a
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i64x2_splat(-1)]
y = i64x2.le_s(a, b)
(y
is notb
) is lowered to:MOVDQA xmm_y, xmm_b
MOVDQA xmm_tmp, xmm_a
PSUBQ xmm_y, xmm_a
PCMPEQD xmm_tmp, xmm_b
PAND xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_a
PCMPGTD xmm_tmp, xmm_b
POR xmm_y, xmm_tmp
PSHUFD xmm_y, xmm_y, 0xF5
PXOR xmm_y, [wasm_i64x2_splat(-1)]
ARM64 processors
y = i64x2.gt_s(a, b)
is lowered toCMGT Vy.2D, Va.2D, Vb.2D
y = i64x2.lt_s(a, b)
is lowered toCMGT Vy.2D, Vb.2D, Va.2D
y = i64x2.ge_s(a, b)
is lowered toCMGE Vy.2D, Va.2D, Vb.2D
y = i64x2.le_s(a, b)
is lowered toCMGE Vy.2D, Vb.2D, Va.2D
ARMv7 processors with NEON instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_s(a, b)
is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
y = i64x2.lt_s(a, b)
is lowered to:VQSUB.S64 Qy, Qa, Qb
VSHR.S64 Qy, Qy, #63
y = i64x2.ge_s(a, b)
is lowered to:VQSUB.S64 Qy, Qa, Qb
VSHR.S64 Qy, Qy, #63
VMVN Qy, Qy
y = i64x2.le_s(a, b)
is lowered to:VQSUB.S64 Qy, Qb, Qa
VSHR.S64 Qy, Qy, #63
VMVN Qy, Qy