i64x2.mul: useful or not? #88

ngzhian · 2019-07-25T18:08:44Z

I was prototyping some i64x2 instructions, and was just going down the list of existing i32x4 instructions, and encountered mul.

Multiplying 2 64-bit numbers requires 128-bits to hold the result, so a i64x2.mul truncates the result to fit inside 64x2. I can't think of interesting use cases for this behavior. Thus, polling via this issue to see if an i64x2 even makes sense.

gnzlbg · 2019-07-25T18:17:00Z

i8.mul, i16.mul, i32.mul, i64.mul, i32x4.mul, i16x8.mul and i8x16.mul all do require a larger type to hold the result and are all therefore defined to return the lower half of the result.

I don't understand why these semantics are, in your opinion, problematic for i64x2.mul but not for any of the other mul opcodes.

ngzhian · 2019-07-25T18:26:04Z

That's very true about i8x16.mul and the rest, i completely missed those because i was looking at arm64 mul and they didnt have it defined for i64x2 :)
In that case follow up question: how are simd multiplies used? do users really not care about the top half of the result? Or do they convert i8x16 into 2 i16x8 and multiply them?

dtig · 2019-07-25T18:28:18Z

I think this issue is more polling whether this instruction is practically useful in applications. We've previously had this conversation for i8x16Mul as well, whether the resulting overflow handling is worth it (in #28). Asking whether there are practical applications for this behavior seems like a reasonable question.

gnzlbg · 2019-07-25T20:58:13Z

how are simd multiplies used?

If you multiply two numbers that are too large, the result won't fit in a register of the same size. That's an issue that all integer multiplies have. SIMD multiplies aren't special, they have the same issue that the scalar multiplies have.

do users really not care about the top half of the result?

When users do care about the top half of the result, they just switch to a wider integer type that produces a wider result.

Or do they convert i8x16 into 2 i16x8 and multiply them?

Yes that's exactly how you do it - for scalars, you would just convert two i8s into two i16s and then do a single i16.mul.

Some architectures do have instructions for multiplying two integers and storing a result that's twice as wide, e.g., the x86 BMI2 ISA has a mulx and mulxq instructions for doing this, but these are super rarely used in practice. Even if you use the C intrinsic for these, e.g., LLVM almost never emits them because there are alternative instruction sequences (e.g. imulq) that have the same cost but smaller code size.

@dtig

I think this issue is more polling whether this instruction is practically useful in applications.

I read this as whether integer multiplies that lose the upper bits are generally useful in practice.

In the discussion we had about i8 multiplies the main argument was that with multiplies an i8 is ridiculously easy to overflow (but then again this argument applies to i8.mul as well). I don't think this argument holds for i64.

dtig · 2019-07-25T21:06:13Z

how are simd multiplies used?

If you multiply two numbers that are too large, the result won't fit in a register of the same size. That's an issue that all integer multiplies have. SIMD multiplies aren't special, they have the same issue that the scalar multiplies have.

do users really not care about the top half of the result?

When users do care about the top half of the result, they just switch to a wider integer type that produces a wider result.

Or do they convert i8x16 into 2 i16x8 and multiply them?

Yes that's exactly how you do it - for scalars, you would just convert two i8s into two i16s and then do a single i16.mul.

Some architectures do have instructions for multiplying two integers and storing a result that's twice as wide, e.g., the x86 BMI2 ISA has a mulx and mulxq instructions for doing this, but these are super rarely used in practice. Even if you use the C intrinsic for these, e.g., LLVM almost never emits them because there are alternative instruction sequences (e.g. imulq) that have the same cost but smaller code size.

@dtig

I think this issue is more polling whether this instruction is practically useful in applications.

I read this as whether integer multiplies that lose the upper bits are generally useful in practice.

Yes, that is the question. I've come across them in benchmarks, but don't have a good sense for how widely they are used in practice.

In the discussion we had about i8 multiplies the main argument was that with multiplies an i8 is ridiculously easy to overflow (but then again this argument applies to i8.mul as well). I don't think this argument holds for i64.

It's not the same argument, but I was holding this as a precedent of them not always being used in practice, not necessarily for the same reasons.

gnzlbg · 2019-07-26T08:32:20Z

Yes, that is the question. I've come across them in benchmarks, but don't have a good sense for how widely they are used in practice.

It is unclear whether you are talking about integer multiplies that discard the upper bits or about multiplies that return both the lower and upper bits of the result.

arunetm · 2019-07-26T18:08:27Z

We have been prioritizing use cases and possibility of efficient implementations as key criteria for defining instructions & their semantics. In case of multiplies (scalar or simd) there are obvious workarounds when the higher order bits are necessary for the applications, so a change will make sense only if we have convincing use cases that can gain realistic benefits.

I tried unsuccessfully to find some real use cases. Think this is the concern @dtig mentioned above. ie. justification to add multiplies returning lower and upper bits for > 8bit lanes. It will be great if anyone can point to any workloads that will help us here.

As WASM scalar offers only i32.mul and i64.mul only. Going back to @ngzhian's question, the argument to have i64x2.mul is arguably identical to i64.mul IMO.

mratsim · 2019-07-27T14:15:19Z

Yes integer multiplication while keeping only the low half is useful.

Scientific computing

I use those to implement fast integer matrix multiplication.

Image processing

Images are stored as integers, lots of image processing (like blurring, edge detection, ...) involves convolution or cross-correlation which requires multiplication.

Cryptography

To implement portable big int matrix multiplication, the easiest is to use Karatsuba algorithm and after all multiplications only the low-half is kept.

dtig · 2019-07-31T22:37:34Z

Thanks for your reply, closing this issue as the question has been answered.

dtig closed this as completed Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

i64x2.mul: useful or not? #88

i64x2.mul: useful or not? #88

ngzhian commented Jul 25, 2019

gnzlbg commented Jul 25, 2019 •

edited

Loading

Uh oh!

ngzhian commented Jul 25, 2019

Uh oh!

dtig commented Jul 25, 2019

Uh oh!

gnzlbg commented Jul 25, 2019 •

edited

Loading

Uh oh!

dtig commented Jul 25, 2019

Uh oh!

gnzlbg commented Jul 26, 2019

Uh oh!

arunetm commented Jul 26, 2019

Uh oh!

mratsim commented Jul 27, 2019 •

edited

Loading

Uh oh!

dtig commented Jul 31, 2019

Uh oh!

i64x2.mul: useful or not? #88

i64x2.mul: useful or not? #88

Comments

ngzhian commented Jul 25, 2019

gnzlbg commented Jul 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngzhian commented Jul 25, 2019

Uh oh!

dtig commented Jul 25, 2019

Uh oh!

gnzlbg commented Jul 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtig commented Jul 25, 2019

Uh oh!

gnzlbg commented Jul 26, 2019

Uh oh!

arunetm commented Jul 26, 2019

Uh oh!

mratsim commented Jul 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scientific computing

Image processing

Cryptography

Uh oh!

dtig commented Jul 31, 2019

Uh oh!

gnzlbg commented Jul 25, 2019 •

edited

Loading

gnzlbg commented Jul 25, 2019 •

edited

Loading

mratsim commented Jul 27, 2019 •

edited

Loading