i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412

Maratyszcza · 2020-12-23T08:56:10Z

Introduction

This is proposal to add 64-bit variant of existing gt_s, lt_s, ge_s, and le_s instructions. ARM64 and x86 (since SSE4.2) natively support the i64x2.gt instruction, and on ARMv7 NEON can be efficiently emulated with 3-4 instructions. i64x2.lt_s instruction is equivalent to i64x2.gt_s with reversed order of input operands. i64x2.le_s and i64x2.ge_s are equivalent to binary NOT operation applies to results of i64x2.gt_s and i64x2.lt_s accordingly.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_b, xmm_a
  - VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_a, xmm_b
  - VPTERNLOGQ xmm_y, xmm_y, xmm_y, 0x55

x86/x86-64 processors with XOP instruction set

i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to VPCOMGEQ xmm_y, xmm_a, xmm_b
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to VPCOMLEQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with AVX instruction set

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to VPCMPGTQ xmm_y, xmm_a, xmm_b
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to VPCMPGTQ xmm_y, xmm_b, xmm_a
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_b, xmm_a
  - VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_a, xmm_b
  - VPXOR xmm_y, xmm_y, [wasm_i64x2_splat(-1)]

x86/x86-64 processors with SSE4.2 instruction set

i64x2.gt_s
- y = i64x2.gt_s(a, b) (y is not b) is lowered to MOVDQA xmm_y, xmm_a + PCMPGTQ xmm_y, xmm_b
i64x2.lt_s
- y = i64x2.lt_s(a, b) (y is not a) is lowered to MOVDQA xmm_y, xmm_b + PCMPGTQ xmm_y, xmm_a
i64x2.ge_s
- y = i64x2.ge_s(a, b) (y is not a) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - PCMPGTQ xmm_y, xmm_a
  - PXOR xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) (y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_a
  - PCMPGTQ xmm_y, xmm_b
  - PXOR xmm_y, [wasm_i64x2_splat(-1)]

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.gt_s
- y = i64x2.gt_s(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm_tmp, xmm_a
  - PSUBQ xmm_y, xmm_a
  - PCMPEQD xmm_tmp, xmm_b
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PCMPGTD xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - PSHUFD xmm_y, xmm_y, 0xF5
i64x2.lt_s
- y = i64x2.lt_s(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_a
  - MOVDQA xmm_tmp, xmm_b
  - PSUBQ xmm_y, xmm_b
  - PCMPEQD xmm_tmp, xmm_a
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PCMPGTD xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSHUFD xmm_y, xmm_y, 0xF5
i64x2.ge_s
- y = i64x2.ge_s(a, b) (y is not a) is lowered to:
  - MOVDQA xmm_y, xmm_a
  - MOVDQA xmm_tmp, xmm_b
  - PSUBQ xmm_y, xmm_b
  - PCMPEQD xmm_tmp, xmm_a
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PCMPGTD xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PXOR xmm_y, [wasm_i64x2_splat(-1)]
i64x2.le_s
- y = i64x2.le_s(a, b) (y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm_tmp, xmm_a
  - PSUBQ xmm_y, xmm_a
  - PCMPEQD xmm_tmp, xmm_b
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PCMPGTD xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PXOR xmm_y, [wasm_i64x2_splat(-1)]

ARM64 processors

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to CMGT Vy.2D, Va.2D, Vb.2D
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to CMGT Vy.2D, Vb.2D, Va.2D
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to CMGE Vy.2D, Va.2D, Vb.2D
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to CMGE Vy.2D, Vb.2D, Va.2D

ARMv7 processors with NEON instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.gt_s
- y = i64x2.gt_s(a, b) is lowered to:
  - VQSUB.S64 Qy, Qb, Qa
  - VSHR.S64 Qy, Qy, #63
i64x2.lt_s
- y = i64x2.lt_s(a, b) is lowered to:
  - VQSUB.S64 Qy, Qa, Qb
  - VSHR.S64 Qy, Qy, #63
i64x2.ge_s
- y = i64x2.ge_s(a, b) is lowered to:
  - VQSUB.S64 Qy, Qa, Qb
  - VSHR.S64 Qy, Qy, #63
  - VMVN Qy, Qy
i64x2.le_s
- y = i64x2.le_s(a, b) is lowered to:
  - VQSUB.S64 Qy, Qb, Qa
  - VSHR.S64 Qy, Qy, #63
  - VMVN Qy, Qy

ngzhian · 2020-12-23T10:16:42Z

When re-introducing some 64x2 instructions in #101, majority voted for the option which did not include comparisons. I did not find a lot of use cases requiring those instructions (happy to be corrected). That vote was slightly bias since it only gave 3 options, but no one voiced strong opinions against omitting the comparisons instructions. What has changed since then?

This PR only includes signed instructions, so it seems like we only want a subset of instructions that lower reasonably well. What about the asymmetry of Wasm SIMD?

Maratyszcza · 2020-12-23T19:01:37Z

We discussed the issue of missing 64-bit forms in several meetings, and I have an action item to create PRs for missing instruction forms and document how they could lower to native ISA. This primary concern with missing 64-bit forms is the resulting non-orthogonality of the WebAssembly SIMD instruction set. I grouped similar instructions into one PR, but there will be more PRs coming for other instructions.

dtig · 2020-12-23T20:24:23Z

The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so. To avoid doing work that we've already done before - are there code examples where these are particularly useful? IIRC, @ngzhian evaluated several benchmarks, and only included the ones that would be lowered efficiently and were being used in real world use cases. For the more commonly used 32x4, 16x8, and 8x16 types, the symmetry argument I agree with - but for 64x2 types I'm not sure that it does?

My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.

omnisip · 2020-12-23T23:59:35Z

Special case not explicitly covered above:

If the JIT can detect that the two operands are identical, it should always call the cheapest method for returning zero regardless of whether or not the underlying instruction is used.

(corrected to be accurate for cmpgt)

omnisip · 2020-12-24T00:22:44Z

@ngzhian @dtig

I think some of the driving motivation here is that it really looks like the instruction set will be finalized within a meeting or two. If that's the case, do we want to push forward a standard that doesn't have ordering operators on 64 bit? I don't think unsigned operations were left out intentionally.

Maratyszcza · 2020-12-24T22:41:04Z

are there code examples where these are particularly useful? ... My takeaway from the meeting was that if we want to revisit this discussion, it should be based on use cases in code. If there are efficient lowerings across architectures, and we can prove that they will be useful in real world code bases, then we can include them, but I would lean against opcode space bloat for 64x2 operations.

Thanks for explaining your concerns. Added links to applications.

As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.

ngzhian · 2020-12-29T00:39:16Z

As for opcode bloat, I don't think it would be a concern as most 64-bit ops (although not the compare instructions) got reserved opcodes during the last renumbering.

Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.

Maratyszcza · 2020-12-30T04:18:14Z

Most of them have already been taken up by other prototype ops, with these instructions I think we will most certainly exceed 256 instructions.

Even so, why is it a concern?

penzn · 2021-01-05T02:05:07Z

I don't necessary think instructions in this PR are expensive (though others in the same subfamily would, also I agree that we have made a decision not to include those), but I want to make a point about application examples. Sorry, I've been a broken record on this lately, and I should probably stop :)

On a semi-personal point, I am quite familiar with the "legacy" flang compiler (the first example), and is very unlikely to be targeting wasm at any point, at the very least because it is going to be deprecated in favor of similarly-named project already in LLVM source tree. However, I am not sure how similar functionality would be implemented then.

Going further down the least of examples, some of those are obviously AVX, so while it is technically possible to port them, but for this proposal in its current state it would mean going from AVX to SSE, which might not have enough parallelism. On the other hand, AVX examples would be very important for any "bigger" SIMD proposals, like flexible vectors.

It is possible to find examples of intrinsics' use by doing a github search (example), but by themselves those are not wasm examples yet - there might be other issues getting in the way of getting good performance for those and they might be impossible to port efficiently for reasons other than SIMD. Probably more imporantly, why do we need to chase exact instruction sequences - I thought there is a consensus that adding something for symmetry or to match native is not a goal.

libpgmath support library for Flang compiler

Vector Packet Processing library

ByteSlice engine for column-store databases

MemFusion query aggregation engine

parasail Pairwise Sequence Alignment library

xsimd SIMD wrapper library

Enoki vectorization & differentiation library

KFR DSP library

fit-diffusion-model code for fitting diffusion models to MRI data

abrown · 2021-01-11T21:42:44Z

The discussion from the meeting was to not include the instructions solely based on the symmetry of the instruction set - as @ngzhian pointed out, this was previously discussed and the community voted against including a majority of these unless there was a compelling reason to do so

I would agree with @dtig's comment above and would extend it a bit: if we were to add these signed comparisons but not #414 (unsigned comparisons with worse lowerings for x86), wouldn't this be making the spec less orthogonal? To be consistent, I think actually this would make sense to me, having a patchwork of instructions that directly map to the supported ISAs, but that seems to contradict the orthogonality intent of a bunch of these i64x2 PRs. Are others OK with that type of non-symmetry: merge this but not #414? I think I would be.

penzn · 2021-01-13T02:15:56Z

I remember that historically we have been somewhat skeptical about 64x2 instructions, as they handle just two elements, and the cost can add up quickly when lowering is not very good. I think it would be great to get perf data for non-trivial lowerings.

I am at least somewhat comfortable with merging in "non-orthogonal" fashion, though :) This applies to #414 as well.

Maratyszcza · 2021-01-22T18:43:08Z

Today's meeting raised concerns about non-wrapper use-cases for these instruction, so I'd like to point out some examples:

ngzhian · 2021-01-22T22:03:36Z

I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a vec256_int.h. The .net runtime also requires AVX2. Many of these won't be trivial to port to Wasm, even with these instructions.

To put it another way, most (if not all) other instructions proposed have strong use cases in that XNNPACK will use them once those instructions are standardized, and XNNPACK benchmarks indicate performance improvements. This list of examples do not meet the same bar.

Maratyszcza · 2021-01-22T22:44:37Z

I don't think we should do a blanket search for intrinsics and call them use cases. E.g. the pytorch example is in a vec256_int.h. The .net runtime also requires AVX2.

x86 SIMD extensions are so fragmented that many projects optimize only for a few of them. E.g. PyTorch has explicit vectorization only for AVX2 (thus vec256_int.h), bitonic sort in .Net Runtime - only for AVX2 and AVX512, Google Highway targets SSE2, AVX2, and AVX512, Microsoft ONNX Runtime - SSE2, SSE4.1, AVX, and AVX2. It isn't that other x86 SIMD extensions are not useful - but the developers are fatigued of writing multiple versions of the algorithm. WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

Many of these won't be trivial to port to Wasm, even with these instructions.

PyTorch vector primitives have a port to POWER VSX, which is a 128-bit SIMD extension similar to WAsm SIMD. E.g. here's the signed 64-bit comparison for greater-than. I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.

ngzhian · 2021-01-22T23:01:09Z

WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.

WAsm SIMD should strive to avoid this fragmentation and deliver an orthogonal set of instructions that could suite a wide range of applications.

I am not saying we shouldn't, I am saying SIMD v1 doesn't need to. This SIMD proposal is not the end, it's only the beginning.

I don't see why PyTorch couldn't use equivalent WAsm SIMD instructions if they were available.

They could. If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument. Your work on instructions like load/store lane, benchmarking on XNNPACK, shows obvious wins for inclusion, and is a real world, immediately relevant use case. The rest of the examples given here, less so.

Maratyszcza · 2021-01-23T01:16:39Z

That's not our only goal here. Portable performance is is important too. It would be frustrating for developers to discover that their code is slow for some subset of users, and learn that signed comparisons are much slower on older harder. It's harder in this case since they won't be able to easily tell the slow group from the fast group, besides some sort of timing detection.

I agree in principle, but it is important to quantify which subset of users might experience poor performance. The proposed instructions lower to 1-3 instructions on x86 with SSE4.2, ARM with NEON, and ARM64. The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2. These are Intel Core 2 Duo/Quad CPUs on 45 nm process (earlier Intel Core 2 processors on 65 nm didn't support SSE4.1, and later Nehalem-generation processors support SSE4.2). From the Wikipedia list the latest among these processors is Core 2 Quad Q9500, released in January 2010, and long discontinued. Is this processor a good fit for WebAssembly SIMD? I doubt so: per Agner Fog, on these processors unaligned loads are internally decoposed into 4 microoperations and unaligned stores - into 9 microoperations, so any SIMD code would likely perform worse than scalar (unless the code never use full 128-bit SIMD loads/stores).

So, in general, I agree that portable performance is important and should be our goal. However, IMO performance portability to decade-old processors that are not suited for WebAssembly SIMD anyway does not weight much towards this goal.

If PyTorch is targeting Wasm SIMD, and that missing these instructions will hinder their work, then it's a stronger argument.

Parts of PyTorch were ported to Emscripten. SIMD is not there yet, though.

jan-wassenberg · 2021-01-25T08:38:57Z

A small correction, Highway targets SSE4.1 and others, but not SSE2, indeed because there are way too many combinations. I agree with @ngzhian that performance portability is important but also with @Maratyszcza that <SSE4.2 is getting quite old and less relevant.

My opinion on orthogonality has shifted a bit recently, I wanted to detect a sign change and if so, flip all bits - also for i64. Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)

Now 6 ops does look a bit scary from the point of performance cliffs, but the alternative of going scalar also includes store + either store to load forwarding stall (from storing one i64 half then loading i62x2) or pinsrq (2 cycles, surprisingly enough). Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.

ngzhian · 2021-01-25T17:57:46Z

The only problematic cases are on x86 CPUs which support SSE4.1, but not SSE4.2.

Thanks for the detailed breakdown. I recall @lars-t-hansen mentioned some metrics he saw regarding SSE4.1, it was in the lower percentages. I looked at Chrome's numbers, ~10% of clients don't have at least SSE4.2. This number will surely go down with time, but it's not insignificant.

Without i64.shr nor i64.gt_s nor signselect that would have been difficult :)

We have i64.shr. You need all 3 of the above?

Thus adding at least gt_s seems reasonably efficient, and does allow us and apps to take advantage of newer instruction sets.

If we add gt_s, lt_s comes for free, since we can swap the operands. So that will make this group look slightly more complete.

dtig · 2021-01-25T19:31:44Z

Adding a preliminary vote for the inclusion of i64x2 signed comparison operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 signed comparison operations
👎 Against including i64x2 signed comparison operations

penzn · 2021-01-25T22:54:16Z

IMO, PyTorch is one of those examples that has much larger issues than SIMD support, as Python is not really running in Wasm today.

Libraries like PyTorch, NumPy, Flang RTL would at least require their "entry" language to be compiled to Wasm (NumPy stands out as it also requires Fortran). That's why I personally don't think those are valid examples of apps by themselves - code running on top of them would, which makes using them for our purposes even more far-fetched.

Maratyszcza · 2021-01-25T23:36:05Z

@penzn I ported CPython to Emscripten even ~~before it was cool~~ before WebAssembly, and today embedding Python in WAsm binaries does not raise eyebrows. In PyTorch the 64-bit comparison intrinsics are used in ATen, its Python-independent part that is used e.g. in mobile deployments. You don't need to Python to use ATen or even to run NN inference on a PyTorch model.

penzn · 2021-01-26T17:25:30Z

No doubt it is possible to compile it, but what about performance - what does it use for GC and how well does that work?

Maratyszcza · 2021-01-26T17:48:05Z

CPython objects are reference-counted

These instructions were merged in WebAssembly#412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.

Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 unsigned comparisons were added in WebAssembly#412.

Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 signed comparisons were added in WebAssembly#412.

Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 signed comparisons were added in #412.

These instructions were merged in #412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.

This is a partial revert of https://crrev.com/c/2457669/. This change is slightly longer (in code-generator-x64.cc) because we also implement support when SSE4_2 is not supported (the reverted change seems to assume SSE4_2, which is not always the case). This code sequence is from WebAssembly/simd#412. Bug: v8:11415 Change-Id: I3eef415667b4142887cf1c449d27d19ba5bbd208 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2683219 Commit-Queue: Zhi An Ng <[email protected]> Reviewed-by: Bill Budge <[email protected]> Cr-Commit-Position: refs/heads/master@{#72611}

Maratyszcza force-pushed the cmpgts-64bit branch from b2bb166 to b1b3625 Compare December 29, 2020 00:41

Maratyszcza mentioned this pull request Jan 5, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

tlively mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/22/21 #419

Closed

abrown mentioned this pull request Jan 11, 2021

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

Closed

Maratyszcza force-pushed the cmpgts-64bit branch from b1b3625 to 8c4d63e Compare January 19, 2021 20:55

Maratyszcza force-pushed the cmpgts-64bit branch from 8c4d63e to 299bbbf Compare January 22, 2021 21:48

zeux mentioned this pull request Jan 25, 2021

Agenda for sync meeting 1/29/21 #429

Closed

tlively mentioned this pull request Jan 31, 2021

Agenda for sync meeting 2/5/21 #436

Closed

Maratyszcza force-pushed the cmpgts-64bit branch from 299bbbf to 12d1340 Compare February 1, 2021 16:58

dtig added the needs discussion Proposal with an unclear resolution label Feb 2, 2021

Maratyszcza force-pushed the cmpgts-64bit branch from 12d1340 to 42f9a22 Compare February 5, 2021 16:46

i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions

acbf991

Maratyszcza force-pushed the cmpgts-64bit branch from 42f9a22 to acbf991 Compare February 5, 2021 16:47

tlively merged commit c69b8ef into WebAssembly:master Feb 5, 2021

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 9, 2021

[interpreter] Add i64x2 signed comparisons

ef4db27

These instructions were merged in WebAssembly#412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.

ngzhian mentioned this pull request Feb 9, 2021

[interpreter] Add i64x2 signed comparisons #454

Merged

ngzhian mentioned this pull request Feb 9, 2021

[spectext] Add i64x2 eq,ne,lt_s,gt_s,le_s,ge_s. #455

Merged

ngzhian added a commit that referenced this pull request Feb 9, 2021

[spectext] Add i64x2 signed comparisons

7a8190f

Previously it was included \ishape.\virelop, but it is incorrect, as we don't have i64x2 unsigned comparisons. i64x2 signed comparisons were added in #412.

ngzhian added a commit that referenced this pull request Feb 9, 2021

[interpreter] Add i64x2 signed comparisons (#454)

17f55c3

These instructions were merged in #412. The binary opcodes are temporary, they will be fixed up when we finalize the opcodes.

penzn mentioned this pull request May 18, 2023

x64: Add non-SSE4.1 lowerings of ceil/trunc/floor/nearest bytecodealliance/wasmtime#6224

Merged

i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412

i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412

Uh oh!

Conversation

Maratyszcza commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.2 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

Uh oh!

ngzhian commented Dec 23, 2020

Uh oh!

Maratyszcza commented Dec 23, 2020

Uh oh!

dtig commented Dec 23, 2020

Uh oh!

omnisip commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omnisip commented Dec 24, 2020

Uh oh!

Maratyszcza commented Dec 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngzhian commented Dec 29, 2020

Uh oh!

Maratyszcza commented Dec 30, 2020

Uh oh!

penzn commented Jan 5, 2021

Uh oh!

abrown commented Jan 11, 2021

Uh oh!

penzn commented Jan 13, 2021

Uh oh!

Maratyszcza commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngzhian commented Jan 22, 2021

Uh oh!

Maratyszcza commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngzhian commented Jan 22, 2021

Uh oh!

Maratyszcza commented Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg commented Jan 25, 2021

Uh oh!

ngzhian commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtig commented Jan 25, 2021

Uh oh!

penzn commented Jan 25, 2021

Uh oh!

Maratyszcza commented Jan 25, 2021

Uh oh!

penzn commented Jan 26, 2021

Uh oh!

Maratyszcza commented Jan 26, 2021

Uh oh!

Uh oh!

Maratyszcza commented Dec 23, 2020 •

edited

Loading

omnisip commented Dec 23, 2020 •

edited

Loading

Maratyszcza commented Dec 24, 2020 •

edited

Loading

Maratyszcza commented Jan 22, 2021 •

edited

Loading

Maratyszcza commented Jan 22, 2021 •

edited

Loading

Maratyszcza commented Jan 23, 2021 •

edited

Loading

ngzhian commented Jan 25, 2021 •

edited

Loading