Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28

Closed
rrwinterton opened this issue Mar 30, 2018 · 19 comments

Comments

@rrwinterton
Copy link

proposal-webAssembly-SIMD-modification.pdf

@arunetm-zz
Copy link

@PeterJensen @kenchris

@jfbastien
Copy link
Member

Could you put the content of the PDF into this bug? PDFs aren't searchable in general.

@kenchris
Copy link

kenchris commented Mar 30, 2018

Proposal WebAssembly SIMD Modification

Currently as proposed there is an instruction defined in the WASM SIMD ISA as follows.
Instruction – i8x16.mul which is a register to register operation that takes 16 8 bit integers and
multiplies them together resulting in an 8 bit value. If the distribution of the integers it flat this
would result in approximately 50 percent of the instructions with overflow. This instruction is
not too useful in real life applications that is known. This instruction could be left in the ISA mix
but for IA would require a significant amount of instructions and would have to require
overflow handling and specifications to achieve cross platform compatibility. This really is
messy for no known applications. This overflow handling is currently not defined in the WASM
SIMD specification.

Proposal is to remove this instruction as known applications don’t use i8x16.mul functionality
and add the widening load integer instructions. This was previously proposed as a candidate in
bullet two back in November last year.

Candidates

  • Horizontal pairwise addition, integer and floating point
  • Widening integer conversion, signed and unsigned
  • Saturating narrowing integer conversion

Proposed new instructions:

In conjunction with the potential change of the i8x16.mul removal it is proposed to create an
instruction that would have six new load instructions to make integer multiplies easier. The
first is i8x16. This would make i8, i16, i32 multiplies useful and more practical for applications
such as machine learning, image compression and video and rendering data processing.
The new instructions would take consecutive integers of the corresponding size and zero sign
extend and sign extend the consecutive bytes, words or dword to

WASM SIMD Instruction Change

Current instruction:

i8x16.mul 50 percent of the result will overflow (remove?)

image

New instructions:

i8x8.zxload zero sign extend - takes 8 consecutive byes and zero sign extends

image

i16x4.zxload zero sign extend - takes 4 consecutive words and zero sign extends

image

132x2.zxload zero sign extend - takes 4 consecutive words and zero sign extends

image

Intel and ARM both have this capability by doing the following:

movzxbw,
movzxwd,
movzxdq,
movsxbw,
movsxwd,
movsxdq

ARM:

Example instruction Description
LDR X0, [X1] Load from the address in X1
LDR X0, [X1, #8] Load from the address in X1 + 8
LDR X0, [X1, X2] Load from the address in X1 + X2
LDR X0, [X1, LSL, #3] Load from the address in X1 + (X2 << 3)
LDR X0, [X1, W2, SXTW] Load from the address in X1 + sign extend(W2)
LDR X0, [X1, W2, SXTW, #3] Load from the address in X1 + (sign extend(W2) << 3)

So the new instructions for WASM would be defined as follows:

i8x8.zxload, i16x4.zxload, i32x2.zxload, i8x8.sxload, i16x4.sxload, i32x2.sxload
As a result of these new instructions a multiply can now be done without worrying about signed
and unsigned overflow on 50 percent of the data it operates on.

@rrwinterton
Copy link
Author

Thanks I was just doing it.

@rrwinterton
Copy link
Author

image

image

image

image

image

image

image

@PeterJensen
Copy link
Collaborator

How about separating the sign/zero extend operation from the 64-bit load operation? So instead of having the combined load/extend operations, just introduce the extend operations, since there already is a 64-bit load instruction. e.g.

i16x8.sext8
i16x8.zext8
i32x4.sext16
i32x4.zext16

Just FYI: The IA pmovsx/pmovzx instructions were introduced with SSE4.1

@rrwinterton
Copy link
Author

Yes the register to register operations are useful but typically we load and operate on the same operands when we multiply. If we do a register to register extend and sign/zero we would still have to do a load so it would be two instructions required for the typical use case of this instruction as opposed to one instruction. I am good with either instructions. I thought about providing a memory or register instruction for WASM but I didn't see that we had any other instructions that provided that option so I didn't think it would be consistent but I would have preferred it. As said I would be good with either but was trying to optimize for the most common use case.

@PeterJensen
Copy link
Collaborator

PeterJensen commented Apr 13, 2018

A moderately clever engine should be able to fold the load/extend instruction pair into using the right addressing mode on the actual instructions.

Dan - @sunfishcode - might have an opinion

@rrwinterton
Copy link
Author

That is true and as I said I don't care as long as we get an extend instruction in. Peter or others if you feel strongly about it register extend works for me. Just as an optimization FYI if you use the load extend instead of register extend instruction you would improve performance by about 1.06X per the gemmlowp matrix multiply kernel. But this is a specific application that would get the 6 percent improvement. As Peter said the engine would use two instructions instead of just the one but if there are application that could use the register form the 6% hit isn't ideal but it is on the kernel not the overall workload. (See code below)

"pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
"pshufd $0x00,%%xmm1,%%xmm2 \n\t"
"pshufd $0x55,%%xmm1,%%xmm3 \n\t"
"pmaddwd %%xmm0, %%xmm2 \n\t"
"pmaddwd %%xmm0, %%xmm3 \n\t"
"paddd %%xmm2, %%xmm4 \n\t"
"paddd %%xmm3, %%xmm5 \n\t"

"pshufd $0xaa,%%xmm1,%%xmm2 \n\t"
"pmaddwd %%xmm0, %%xmm2 \n\t"
"pshufd $0xff,%%xmm1,%%xmm3 \n\t"
"pmaddwd %%xmm0, %%xmm3 \n\t"

@PeterJensen
Copy link
Collaborator

My point above was that the engine should still be able to generate just one target instruction with the load operation folded into the mem->reg addressing mode of the instruction, so there shouldn't be any performance hit.

@sunfishcode
Copy link
Member

I'm a little confused; the current proposal doesn't have a 64-bit SIMD load; it only has a 128-bit load.

@PeterJensen
Copy link
Collaborator

The source operand for the extends should probably just be an i64

@rrwinterton
Copy link
Author

128-bit load is what we are looking at. I am not sure where the i64 came into the picture? We would want to load 128 bit and extend as the appropriate type that is being loaded. I don't understand 64 bit either.

@sunfishcode
Copy link
Member

sunfishcode commented Apr 16, 2018

I'm nervous about using i64 to hold values that we'd never want stored in integer registers (on common architectures). On an implementation with slow scalar->SIMD moves, if pattern-matching fails for any reason, it could create a surprising slowdown.

What would you think about also introducing 64-bit load operations that produce v128 values, with zeros in the high elements? This would be like load2 etc. from earlier iterations of the SIMD proposal, which were removed in part due to load3 being a non-power-of-two-size memory access, which can be complicated to implement on popular platforms. Perhaps just the 64-bit memory forms could be reintroduced. Then a v128-to-v128sign/zero-extend op could work by discarding the high elements and sign/zero-extending the low ones.

I'd also be ok just adding combined load+extend operations as is currently proposed here. wasm already does have analogous opcodes in scalar.

@jfbastien
Copy link
Member

load+extend seems more natural IMO, given that we have them to support loading scalar 8 / 16, which aren't normal sizes (similar to what we have here).

@PeterJensen
Copy link
Collaborator

Introducing 64-bit loads that produce v128 values with zeros in the high elements would be fine with me. @rrwinterton I'm confused about your comment about 128-bit loads. The original proposed combined load+extend operations load only 64-bit as well, as far as I understood them. That's what the actual pmovsx/pmovzx instructions do, when using the memory addressing mode.

@rrwinterton
Copy link
Author

I think the load+extend is more intuitive. Yes it would take from memory 64 bits with the types specified signed and unsigned type and expand to 128. I think there were two items we needed to do with this. First was to standardize on the instruction convention and naming and second decide if the i8x16.mul is something we want. I did a quick check with the auto vectorize with clang and the code isn't nice to do unsigned byte multiplies. Obviously clang for IA would never generate an instruction that doesn't exist :) Dan were you going to look at what the compiler would do for ARM? I still haven't seen a use in an application. Perhaps byte pattern initialization? If so there are better ways to initialize data.

movq $1 ,%xmm1
punpcklbw %xmm0,%xmm1
movq $2,%xmm2
punpcklbw %xmm0,%xmm2
pmullw %xmm1,%xmm2
pand %xmm0,%xmm2
packuswb %xmm2,%xmm2
movq %xmm2, $3

I could accept the byte multiply as something to do but wanted to point out it isn't cheap on IA and not sure how it is used. Dan let me know if you had a chance to look.

@Maratyszcza
Copy link
Contributor

Related: #23

rrwinterton added a commit to rrwinterton/simd that referenced this issue Jul 23, 2019
This proposed change is to add the 6 load extend instructions for signed and unsigned integers using the naming convention suggested by others in the pull request WebAssembly#28
penzn added a commit to penzn/simd that referenced this issue Sep 10, 2019
Consensus for the removal is documented in WebAssembly#28 and
WebAssembly#98.
tlively pushed a commit that referenced this issue Sep 13, 2019
And remove i8x16.mul, as documented in #28 and #98.
@dtig
Copy link
Member

dtig commented Sep 13, 2019

Closing as #98 is merged.

@dtig dtig closed this as completed Sep 13, 2019
Honry pushed a commit to Honry/simd that referenced this issue Oct 19, 2019
@penzn penzn mentioned this issue Oct 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants