-
Notifications
You must be signed in to change notification settings - Fork 43
Adding new SIMD instructions to load sign and zero extend 8, 16 and 32 byte integers #28
Comments
Could you put the content of the PDF into this bug? PDFs aren't searchable in general. |
Proposal WebAssembly SIMD ModificationCurrently as proposed there is an instruction defined in the WASM SIMD ISA as follows. Proposal is to remove this instruction as known applications don’t use Candidates
Proposed new instructions:In conjunction with the potential change of the WASM SIMD Instruction ChangeCurrent instruction:
New instructions:
Intel and ARM both have this capability by doing the following:
ARM:
So the new instructions for WASM would be defined as follows:
|
Thanks I was just doing it. |
How about separating the sign/zero extend operation from the 64-bit load operation? So instead of having the combined load/extend operations, just introduce the extend operations, since there already is a 64-bit load instruction. e.g.
Just FYI: The IA pmovsx/pmovzx instructions were introduced with SSE4.1 |
Yes the register to register operations are useful but typically we load and operate on the same operands when we multiply. If we do a register to register extend and sign/zero we would still have to do a load so it would be two instructions required for the typical use case of this instruction as opposed to one instruction. I am good with either instructions. I thought about providing a memory or register instruction for WASM but I didn't see that we had any other instructions that provided that option so I didn't think it would be consistent but I would have preferred it. As said I would be good with either but was trying to optimize for the most common use case. |
A moderately clever engine should be able to fold the load/extend instruction pair into using the right addressing mode on the actual instructions. Dan - @sunfishcode - might have an opinion |
That is true and as I said I don't care as long as we get an extend instruction in. Peter or others if you feel strongly about it register extend works for me. Just as an optimization FYI if you use the load extend instead of register extend instruction you would improve performance by about 1.06X per the gemmlowp matrix multiply kernel. But this is a specific application that would get the 6 percent improvement. As Peter said the engine would use two instructions instead of just the one but if there are application that could use the register form the 6% hit isn't ideal but it is on the kernel not the overall workload. (See code below) "pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t" "pshufd $0xaa,%%xmm1,%%xmm2 \n\t" |
My point above was that the engine should still be able to generate just one target instruction with the load operation folded into the mem->reg addressing mode of the instruction, so there shouldn't be any performance hit. |
I'm a little confused; the current proposal doesn't have a 64-bit SIMD load; it only has a 128-bit load. |
The source operand for the extends should probably just be an |
128-bit load is what we are looking at. I am not sure where the i64 came into the picture? We would want to load 128 bit and extend as the appropriate type that is being loaded. I don't understand 64 bit either. |
I'm nervous about using What would you think about also introducing 64-bit load operations that produce I'd also be ok just adding combined load+extend operations as is currently proposed here. wasm already does have analogous opcodes in scalar. |
load+extend seems more natural IMO, given that we have them to support loading scalar 8 / 16, which aren't normal sizes (similar to what we have here). |
Introducing 64-bit loads that produce |
I think the load+extend is more intuitive. Yes it would take from memory 64 bits with the types specified signed and unsigned type and expand to 128. I think there were two items we needed to do with this. First was to standardize on the instruction convention and naming and second decide if the i8x16.mul is something we want. I did a quick check with the auto vectorize with clang and the code isn't nice to do unsigned byte multiplies. Obviously clang for IA would never generate an instruction that doesn't exist :) Dan were you going to look at what the compiler would do for ARM? I still haven't seen a use in an application. Perhaps byte pattern initialization? If so there are better ways to initialize data. movq $1 ,%xmm1 I could accept the byte multiply as something to do but wanted to point out it isn't cheap on IA and not sure how it is used. Dan let me know if you had a chance to look. |
Related: #23 |
This proposed change is to add the 6 load extend instructions for signed and unsigned integers using the naming convention suggested by others in the pull request WebAssembly#28
Consensus for the removal is documented in WebAssembly#28 and WebAssembly#98.
Closing as #98 is merged. |
And remove i8x16.mul, as documented in WebAssembly#28 and WebAssembly#98.
proposal-webAssembly-SIMD-modification.pdf
The text was updated successfully, but these errors were encountered: