-
Notifications
You must be signed in to change notification settings - Fork 696
bswap
and movebe
equivalents?
#1426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess another option is to change the data format to store as little-endian since most machines in my use case have that already (and wasm always is). I just find it odd that "network" byte ordering is different from "web assembly" byte ordering. If the native format of the data structure is LE, then the generated code obviously gets a lot better: popcnt_u32: # @popcnt_u32
local.get 0
i32.popcnt
i32.const 0
local.get 0
i32.select
end_function
popcnt_u64: # @popcnt_u64
local.get 0
i64.popcnt
i32.wrap_i64
end_function
is_little_endian: # @is_little_endian
i32.const 1
end_function
swap_u32: # @swap_u32
local.get 0
end_function
swap_u64: # @swap_u64
local.get 0
end_function
read_u32_swap: # @read_u32_swap
local.get 0
i32.load 0:p2align=0
end_function
read_u32: # @read_u32
local.get 0
i32.load 0:p2align=0
end_function
read_u64_swap: # @read_u64_swap
local.get 0
i64.load 0:p2align=0
end_function
read_u64: # @read_u64
local.get 0
i64.load 0:p2align=0
end_function But changing the data format is often not an option, especially for network protocols that use "network" byte ordering. |
There is no byteswap equivalent yet, but there is a note about it in Future Features. |
Btw Wasm MVP supports popcnt ops ( Special casing for zero input is really redundant: local.get 0
i32.popcnt
i32.const 0
local.get 0
i32.select perhaps LLVM should enhance this |
Good to know which builtins are available, but as you see in my code, it did correctly detect the pattern and generated the |
Fwiw, we have been anticipating the addition of |
That's great! Though I wonder if there is any hope it will land any time soon? Also from that document, I'd love to have |
In general, proposals advance when sufficiently motivated people push them forward. That person could be you!
This means that the cost of the proposal is relatively low.
It's of course some work to advance a proposal, and there is some fixed amount of overhead per-proposal across the board. This could maybe amortized if there are other related instructions being proposed, but it's certainly not required. You'd also probably be able to get some help once you advance to a certain point (e.g. getting implementations, semantics, etc). Especially for uncontroversial proposals, often this mostly involves nagging the right people at the right time. |
+1 to what @dschuff says here. It looks like it'd be easy to make a case for byte-swap instructions. |
What code bases would need a PR to prototype this? I assume at a minimum, a compiler like maybe the wasm backend to llvm and then a wasm runtime like many of the interpreters or JIT/AOT compilers. Any tips on which ones would be good to help prove the feature technically? |
LLVM, maybe binaryen, and then at least one good VM that targets native code (to make the performance case) - v8, spidermonkey, and wasmtime are obvious choices. I don't know about the others, but adding single instructions to spidermonkey for this kind of prototyping is not particularly hard, though sometimes laborsome. (At the last SIMD meeting we discussed very briefly whether we could have a lightweight process for moving single instruction proposals such as this along with minimal friction, and @dtig said she would bring it to the whole group. Maybe part of that process would be some guidelines on how to go about making the case for a new instruction.) |
I've opened WebAssembly/meetings#857 to gather ideas on how we can streamline the process, based on responses there, I can add to the process documentation if we decide that there is enough overhead that we can get rid of for smaller proposals. Prototyping a single opcode in V8 should also be fairly straightforward, there are some layers top plumb through, @ngzhian put together some helpful documentation on how to do this in V8. To prove this feature technically, +1 to what @lars-t-hansen reply above, and adding to that some spec tests or a side-by-side comparison of code reduction on adding |
We have a checklist of things to do to add new instructions to Binaryen as well: https://github.com/WebAssembly/binaryen/blob/main/Contributing.md#adding-support-for-new-instructions. For prototyping on the tools side, it would probably be sufficient to model the new instructions as function calls in C/C++/LLVM and only implement the instruction in Binaryen. Then you could add a trivial Binaryen pass to lower the import calls emitted by LLVM to the new instructions. Edit: Although it's not that hard to add a new instruction to Clang/LLVM either. |
Thanks everyone. I'm not sure when I'll have availability to prototype this, but it sounds like a lot of fun!
I would think it would help the performance for both native and interpreted. Surely a single interpreted instruction is much faster than dozens of them. But yeah, proving both should be helpful. My hope is to prototype the following:
But I'm doing this in my sporadic free time, so if anyone else wants to speed things up, feel free to jump in. Then to test it, write a codec for some existing wire format that requires network byte ordering and compare:
Assuming I get all that done, I'll probably need to focus on other projects for a bit and will likely need help pushing through all the processes required. But my hope is this will give the idea a jump start. |
I could help with binaryen part |
Uh oh!
There was an error while loading. Please reload this page.
Related to #422 and #725, I'm wondering what I should to to have fast reads of big-endian numbers from a buffer.
My use case is I have a large data structure that uses network byte ordering (aka big-endian) and wasm is always little-endian.
I'm trying to write C code for clang that emits effecient bytecode, but it gets really nasty whenever I need to load a BE
uint64_t
to a native number in memory.Suppose the following test program:
If I compile this to native assembly, clang/llvm detects the patterns for the byteswaping and generates optimal instructions:
Notice how the functions reduce down to things like
movbe
,bswap
andpopcnt
!But if I try the same thing with webassembly, it doesn't optimize the swaps at all:
Is there anything I can do to help this not bloat so terribly? Are there proposals for adding
bswap
and/orloadbe
to the instruction set? The linked issues above seem to be closed with no progress in years.I understand that not all real machines have these instructions, but many do. I would say the vast majority of machines that support wasm have these fast instructions natively. And even if the underlying machine can't emit a single native instruction when compiling the wasm to native, it can emit whatever is optimal for the platform. But the real effect will be much smaller wasm code downloaded over the network!
Please reconsider adding these to the spec or if there is an existing primitive that gets me close, please point me in the right direction.
The text was updated successfully, but these errors were encountered: