Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

rrwinterton · 2018-03-30T17:45:19Z

proposal-webAssembly-SIMD-modification.pdf

arunetm-zz · 2018-03-30T17:48:32Z

jfbastien · 2018-03-30T17:52:49Z

Could you put the content of the PDF into this bug? PDFs aren't searchable in general.

kenchris · 2018-03-30T18:06:43Z

Proposal WebAssembly SIMD Modification

Currently as proposed there is an instruction defined in the WASM SIMD ISA as follows.
Instruction – i8x16.mul which is a register to register operation that takes 16 8 bit integers and
multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this
would result in approximately 50 percent of the instructions with overflow. This instruction is
not too useful in real life applications that is known. This instruction could be left in the ISA mix
but for IA would require a significant amount of instructions and would have to require
overflow handling and specifications to achieve cross platform compatibility. This really is
messy for no known applications. This overflow handling is currently not defined in the WASM
SIMD specification.

Proposal is to remove this instruction as known applications don’t use i8x16.mul functionality
and add the widening load integer instructions. This was previously proposed as a candidate in
bullet two back in November last year.

Candidates

Horizontal pairwise addition, integer and floating point
Widening integer conversion, signed and unsigned
Saturating narrowing integer conversion

Proposed new instructions:

In conjunction with the potential change of the i8x16.mul removal it is proposed to create an
instruction that would have six new load instructions to make integer multiplies easier. The
first is i8x16. This would make i8, i16, i32 multiplies useful and more practical for applications
such as machine learning, image compression and video and rendering data processing.
The new instructions would take consecutive integers of the corresponding size and zero sign
extend and sign extend the consecutive bytes, words or dword to

WASM SIMD Instruction Change

Current instruction:

i8x16.mul 50 percent of the result will overflow (remove?)

New instructions:

i8x8.zxload zero sign extend - takes 8 consecutive byes and zero sign extends

i16x4.zxload zero sign extend - takes 4 consecutive words and zero sign extends

132x2.zxload zero sign extend - takes 4 consecutive words and zero sign extends

Intel and ARM both have this capability by doing the following:

movzxbw,
movzxwd,
movzxdq,
movsxbw,
movsxwd,
movsxdq

ARM:

Example instruction	Description
`LDR X0, [X1]`	Load from the address in X1
`LDR X0, [X1, #8]`	Load from the address in X1 + 8
`LDR X0, [X1, X2]`	Load from the address in X1 + X2
`LDR X0, [X1, LSL, #3]`	Load from the address in X1 + (X2 << 3)
`LDR X0, [X1, W2, SXTW]`	Load from the address in X1 + sign extend(W2)
`LDR X0, [X1, W2, SXTW, #3]`	Load from the address in X1 + (sign extend(W2) << 3)

So the new instructions for WASM would be defined as follows:

i8x8.zxload, i16x4.zxload, i32x2.zxload, i8x8.sxload, i16x4.sxload, i32x2.sxload
As a result of these new instructions a multiply can now be done without worrying about signed
and unsigned overflow on 50 percent of the data it operates on.

rrwinterton · 2018-03-30T18:07:55Z

Thanks I was just doing it.

rrwinterton · 2018-04-11T04:44:05Z

PeterJensen · 2018-04-13T22:33:19Z

How about separating the sign/zero extend operation from the 64-bit load operation? So instead of having the combined load/extend operations, just introduce the extend operations, since there already is a 64-bit load instruction. e.g.

i16x8.sext8
i16x8.zext8
i32x4.sext16
i32x4.zext16

Just FYI: The IA pmovsx/pmovzx instructions were introduced with SSE4.1

rrwinterton · 2018-04-13T23:24:12Z

Yes the register to register operations are useful but typically we load and operate on the same operands when we multiply. If we do a register to register extend and sign/zero we would still have to do a load so it would be two instructions required for the typical use case of this instruction as opposed to one instruction. I am good with either instructions. I thought about providing a memory or register instruction for WASM but I didn't see that we had any other instructions that provided that option so I didn't think it would be consistent but I would have preferred it. As said I would be good with either but was trying to optimize for the most common use case.

PeterJensen · 2018-04-13T23:43:39Z

A moderately clever engine should be able to fold the load/extend instruction pair into using the right addressing mode on the actual instructions.

Dan - @sunfishcode - might have an opinion

rrwinterton · 2018-04-16T16:10:01Z

That is true and as I said I don't care as long as we get an extend instruction in. Peter or others if you feel strongly about it register extend works for me. Just as an optimization FYI if you use the load extend instead of register extend instruction you would improve performance by about 1.06X per the gemmlowp matrix multiply kernel. But this is a specific application that would get the 6 percent improvement. As Peter said the engine would use two instructions instead of just the one but if there are application that could use the register form the 6% hit isn't ideal but it is on the kernel not the overall workload. (See code below)

"pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
"pshufd $0x00,%%xmm1,%%xmm2 \n\t"
"pshufd $0x55,%%xmm1,%%xmm3 \n\t"
"pmaddwd %%xmm0, %%xmm2 \n\t"
"pmaddwd %%xmm0, %%xmm3 \n\t"
"paddd %%xmm2, %%xmm4 \n\t"
"paddd %%xmm3, %%xmm5 \n\t"

"pshufd $0xaa,%%xmm1,%%xmm2 \n\t"
"pmaddwd %%xmm0, %%xmm2 \n\t"
"pshufd $0xff,%%xmm1,%%xmm3 \n\t"
"pmaddwd %%xmm0, %%xmm3 \n\t"

PeterJensen · 2018-04-16T16:27:42Z

My point above was that the engine should still be able to generate just one target instruction with the load operation folded into the mem->reg addressing mode of the instruction, so there shouldn't be any performance hit.

sunfishcode · 2018-04-16T16:28:38Z

I'm a little confused; the current proposal doesn't have a 64-bit SIMD load; it only has a 128-bit load.

PeterJensen · 2018-04-16T16:31:13Z

The source operand for the extends should probably just be an i64

rrwinterton · 2018-04-16T18:54:26Z

128-bit load is what we are looking at. I am not sure where the i64 came into the picture? We would want to load 128 bit and extend as the appropriate type that is being loaded. I don't understand 64 bit either.

sunfishcode · 2018-04-16T19:32:48Z

I'm nervous about using i64 to hold values that we'd never want stored in integer registers (on common architectures). On an implementation with slow scalar->SIMD moves, if pattern-matching fails for any reason, it could create a surprising slowdown.

What would you think about also introducing 64-bit load operations that produce v128 values, with zeros in the high elements? This would be like load2 etc. from earlier iterations of the SIMD proposal, which were removed in part due to load3 being a non-power-of-two-size memory access, which can be complicated to implement on popular platforms. Perhaps just the 64-bit memory forms could be reintroduced. Then a v128-to-v128sign/zero-extend op could work by discarding the high elements and sign/zero-extending the low ones.

I'd also be ok just adding combined load+extend operations as is currently proposed here. wasm already does have analogous opcodes in scalar.

jfbastien · 2018-04-16T19:37:49Z

load+extend seems more natural IMO, given that we have them to support loading scalar 8 / 16, which aren't normal sizes (similar to what we have here).

PeterJensen · 2018-04-16T20:23:43Z

Introducing 64-bit loads that produce v128 values with zeros in the high elements would be fine with me. @rrwinterton I'm confused about your comment about 128-bit loads. The original proposed combined load+extend operations load only 64-bit as well, as far as I understood them. That's what the actual pmovsx/pmovzx instructions do, when using the memory addressing mode.

rrwinterton · 2018-04-17T18:00:55Z

I think the load+extend is more intuitive. Yes it would take from memory 64 bits with the types specified signed and unsigned type and expand to 128. I think there were two items we needed to do with this. First was to standardize on the instruction convention and naming and second decide if the i8x16.mul is something we want. I did a quick check with the auto vectorize with clang and the code isn't nice to do unsigned byte multiplies. Obviously clang for IA would never generate an instruction that doesn't exist :) Dan were you going to look at what the compiler would do for ARM? I still haven't seen a use in an application. Perhaps byte pattern initialization? If so there are better ways to initialize data.

movq $1 ,%xmm1
punpcklbw %xmm0,%xmm1
movq $2,%xmm2
punpcklbw %xmm0,%xmm2
pmullw %xmm1,%xmm2
pand %xmm0,%xmm2
packuswb %xmm2,%xmm2
movq %xmm2, $3

I could accept the byte multiply as something to do but wanted to point out it isn't cheap on IA and not sure how it is used. Dan let me know if you had a chance to look.

Maratyszcza · 2018-04-18T01:42:34Z

Related: #23

This proposed change is to add the 6 load extend instructions for signed and unsigned integers using the naming convention suggested by others in the pull request WebAssembly#28

Consensus for the removal is documented in WebAssembly#28 and WebAssembly#98.

And remove i8x16.mul, as documented in #28 and #98.

dtig · 2019-09-13T22:07:44Z

Closing as #98 is merged.

And remove i8x16.mul, as documented in WebAssembly#28 and WebAssembly#98.

rrwinterton closed this as completed Mar 30, 2018

rrwinterton reopened this Mar 30, 2018

gnzlbg mentioned this issue May 3, 2019

simdsignunsignedextendedload.md #77

Closed

tlively mentioned this issue Jul 16, 2019

Add opcodes for extending different size integer signed and unsigned loads #83

Closed

rrwinterton mentioned this issue Jul 23, 2019

This is to assign the opcodes for the load extend instructions #86

Closed

dtig mentioned this issue Jul 25, 2019

i64x2.mul: useful or not? #88

Closed

penzn mentioned this issue Aug 28, 2019

Introduce Load and Extend #98

Merged

penzn added a commit to penzn/simd that referenced this issue Sep 10, 2019

Remove i8x16.mul

075c14a

Consensus for the removal is documented in WebAssembly#28 and WebAssembly#98.

tlively pushed a commit that referenced this issue Sep 13, 2019

Introduce Load and Extend (#98)

36a8199

And remove i8x16.mul, as documented in #28 and #98.

dtig closed this as completed Sep 13, 2019

Honry pushed a commit to Honry/simd that referenced this issue Oct 19, 2019

Introduce Load and Extend (WebAssembly#98)

072bde0

And remove i8x16.mul, as documented in WebAssembly#28 and WebAssembly#98.

penzn mentioned this issue Oct 7, 2021

No i8x16.mul? #524

Closed

Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

Comments

rrwinterton commented Mar 30, 2018

arunetm-zz commented Mar 30, 2018

Uh oh!

jfbastien commented Mar 30, 2018

Uh oh!

kenchris commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal WebAssembly SIMD Modification

Candidates

Proposed new instructions:

WASM SIMD Instruction Change

Uh oh!

rrwinterton commented Mar 30, 2018

Uh oh!

rrwinterton commented Apr 11, 2018

Uh oh!

PeterJensen commented Apr 13, 2018

Uh oh!

rrwinterton commented Apr 13, 2018

Uh oh!

PeterJensen commented Apr 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrwinterton commented Apr 16, 2018

Uh oh!

PeterJensen commented Apr 16, 2018

Uh oh!

sunfishcode commented Apr 16, 2018

Uh oh!

PeterJensen commented Apr 16, 2018

Uh oh!

rrwinterton commented Apr 16, 2018

Uh oh!

sunfishcode commented Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfbastien commented Apr 16, 2018

Uh oh!

PeterJensen commented Apr 16, 2018

Uh oh!

rrwinterton commented Apr 17, 2018

Uh oh!

Maratyszcza commented Apr 18, 2018

Uh oh!

dtig commented Sep 13, 2019

Uh oh!

kenchris commented Mar 30, 2018 •

edited

Loading

PeterJensen commented Apr 13, 2018 •

edited

Loading

sunfishcode commented Apr 16, 2018 •

edited

Loading