Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.mul: useful or not? #88

Closed
ngzhian opened this issue Jul 25, 2019 · 9 comments
Closed

i64x2.mul: useful or not? #88

ngzhian opened this issue Jul 25, 2019 · 9 comments

Comments

@ngzhian
Copy link
Member

ngzhian commented Jul 25, 2019

I was prototyping some i64x2 instructions, and was just going down the list of existing i32x4 instructions, and encountered mul.

Multiplying 2 64-bit numbers requires 128-bits to hold the result, so a i64x2.mul truncates the result to fit inside 64x2. I can't think of interesting use cases for this behavior. Thus, polling via this issue to see if an i64x2 even makes sense.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 25, 2019

i8.mul, i16.mul, i32.mul, i64.mul, i32x4.mul, i16x8.mul and i8x16.mul all do require a larger type to hold the result and are all therefore defined to return the lower half of the result.

I don't understand why these semantics are, in your opinion, problematic for i64x2.mul but not for any of the other mul opcodes.

@ngzhian
Copy link
Member Author

ngzhian commented Jul 25, 2019

That's very true about i8x16.mul and the rest, i completely missed those because i was looking at arm64 mul and they didnt have it defined for i64x2 :)
In that case follow up question: how are simd multiplies used? do users really not care about the top half of the result? Or do they convert i8x16 into 2 i16x8 and multiply them?

@dtig
Copy link
Member

dtig commented Jul 25, 2019

I think this issue is more polling whether this instruction is practically useful in applications. We've previously had this conversation for i8x16Mul as well, whether the resulting overflow handling is worth it (in #28). Asking whether there are practical applications for this behavior seems like a reasonable question.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 25, 2019

how are simd multiplies used?

If you multiply two numbers that are too large, the result won't fit in a register of the same size. That's an issue that all integer multiplies have. SIMD multiplies aren't special, they have the same issue that the scalar multiplies have.

do users really not care about the top half of the result?

When users do care about the top half of the result, they just switch to a wider integer type that produces a wider result.

Or do they convert i8x16 into 2 i16x8 and multiply them?

Yes that's exactly how you do it - for scalars, you would just convert two i8s into two i16s and then do a single i16.mul.


Some architectures do have instructions for multiplying two integers and storing a result that's twice as wide, e.g., the x86 BMI2 ISA has a mulx and mulxq instructions for doing this, but these are super rarely used in practice. Even if you use the C intrinsic for these, e.g., LLVM almost never emits them because there are alternative instruction sequences (e.g. imulq) that have the same cost but smaller code size.


@dtig

I think this issue is more polling whether this instruction is practically useful in applications.

I read this as whether integer multiplies that lose the upper bits are generally useful in practice.

In the discussion we had about i8 multiplies the main argument was that with multiplies an i8 is ridiculously easy to overflow (but then again this argument applies to i8.mul as well). I don't think this argument holds for i64.

@dtig
Copy link
Member

dtig commented Jul 25, 2019

how are simd multiplies used?

If you multiply two numbers that are too large, the result won't fit in a register of the same size. That's an issue that all integer multiplies have. SIMD multiplies aren't special, they have the same issue that the scalar multiplies have.

do users really not care about the top half of the result?

When users do care about the top half of the result, they just switch to a wider integer type that produces a wider result.

Or do they convert i8x16 into 2 i16x8 and multiply them?

Yes that's exactly how you do it - for scalars, you would just convert two i8s into two i16s and then do a single i16.mul.

Some architectures do have instructions for multiplying two integers and storing a result that's twice as wide, e.g., the x86 BMI2 ISA has a mulx and mulxq instructions for doing this, but these are super rarely used in practice. Even if you use the C intrinsic for these, e.g., LLVM almost never emits them because there are alternative instruction sequences (e.g. imulq) that have the same cost but smaller code size.

@dtig

I think this issue is more polling whether this instruction is practically useful in applications.

I read this as whether integer multiplies that lose the upper bits are generally useful in practice.

Yes, that is the question. I've come across them in benchmarks, but don't have a good sense for how widely they are used in practice.

In the discussion we had about i8 multiplies the main argument was that with multiplies an i8 is ridiculously easy to overflow (but then again this argument applies to i8.mul as well). I don't think this argument holds for i64.

It's not the same argument, but I was holding this as a precedent of them not always being used in practice, not necessarily for the same reasons.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 26, 2019

Yes, that is the question. I've come across them in benchmarks, but don't have a good sense for how widely they are used in practice.

It is unclear whether you are talking about integer multiplies that discard the upper bits or about multiplies that return both the lower and upper bits of the result.

@arunetm
Copy link
Collaborator

arunetm commented Jul 26, 2019

We have been prioritizing use cases and possibility of efficient implementations as key criteria for defining instructions & their semantics. In case of multiplies (scalar or simd) there are obvious workarounds when the higher order bits are necessary for the applications, so a change will make sense only if we have convincing use cases that can gain realistic benefits.

I tried unsuccessfully to find some real use cases. Think this is the concern @dtig mentioned above. ie. justification to add multiplies returning lower and upper bits for > 8bit lanes. It will be great if anyone can point to any workloads that will help us here.

As WASM scalar offers only i32.mul and i64.mul only. Going back to @ngzhian's question, the argument to have i64x2.mul is arguably identical to i64.mul IMO.

@mratsim
Copy link

mratsim commented Jul 27, 2019

Yes integer multiplication while keeping only the low half is useful.

Scientific computing

I use those to implement fast integer matrix multiplication.

Image processing

Images are stored as integers, lots of image processing (like blurring, edge detection, ...) involves convolution or cross-correlation which requires multiplication.

Cryptography

To implement portable big int matrix multiplication, the easiest is to use Karatsuba algorithm and after all multiplications only the low-half is kept.

@dtig
Copy link
Member

dtig commented Jul 31, 2019

Thanks for your reply, closing this issue as the question has been answered.

@dtig dtig closed this as completed Jul 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants