-
Notifications
You must be signed in to change notification settings - Fork 695
Make dropping of values explicit #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
More specifically: - Values can no longer be discarded implicitly. - Instead, there is an explicit `drop` operator. - To compensate, store operators no longer return a value. Also fixing a bunch of other out-of-date text on the way.
The block top-level expressions are still not known until the end of the block, or if It is not possible to validate that This seems to fundamentally conflict with the proposal in #685 which would allow expression values on the values stack to be re-used. I believe this re-use of definitions is compelling and will need to be supported, and thus this I think what should be done is to push one value on the values stack for each value in a result expression. Then an expression returning zero values pushes zero values on the stack. A multiple-values call would push multiple result values on the stack. Blocks might just need to define the number of values from the stack that they return for the fall-through, or end in |
@jsstat, I don't follow half of your questions, but some concrete answers: |
The PRs lgtm. A concrete benefit that motivated us is that this change avoids pushing a lot of implicitly-discarded values that a single-pass baseline compiler (which we're building for first-tier codegen) would otherwise have to hold onto (thereby avoiding a lot of stack spilling). Also, with this change (and set_local/store returning void), pretty much the only use of A separate question is whether we should add a |
This is by far the biggest code change brought about by this PR. Previously all implementations supported We discussed this previously here, and there are some numbers there that show around half of |
Numbers on BB: without There would also be some loss from not having This suggests that removing return values from On the other hand there are arguments for taking some % of regression here,
|
@rossberg-chromium How does this validate that It lacks a plan for multiple value calls. You could argue that multiple value calls need to write their results to local variables, but the same could be done for all operators (the expressionless encoding I explored) and that experiment suggested that the values stack is an important resource and avoiding local variables speeds SSA decoding. So I think there may be compelling reasons to have multiple value calls write their results to the values stack too. What is your proposal for multiple values?? This is a critical question for this PR. If multiple value call results are to be written to separate values of the values stack then it would be natural for If multiple value tuples are single entries on the values stack then Being able to reuse values up the stack is also a compelling feature, and a far more general and cleaner solution than @kripken If people are going to give up on the wasm encoding being space efficient and optimize for quick decode and code generation, and leave it up to a compression layer to be space efficient, then we can optimize for other criteria, optimize for fast decode and compilation. If so then the compiler needs the live ranges of definitions so lets solve this general problem. The compiler does not need to drop any values from the values stack to solve this general problem, rather it could leave all definitions on the values stack and flag their last use or unused definitions which would solve the general problem. Being able to drop the top stack value can then be seen as an optimization for a common case. I think the single pass compiler experiment is very valuable and should be pushed hard and perhaps some experimental encodings explored for this. For example I see no reason why this could not also do type derivation and optimize away some of the bounds checks and all in a single pass. The implications for register allocation are very interesting. What would be the best encoding for such a compiler, an encoding that supports multiple values calls efficiently too, and how much would this cost in encoding space efficiency?? I think that is where the effort should be, where we will get some insights. |
I did some further measurements by writing a binaryen pass that can remove reliance on First thing is that it looks like there is no usage of For |
I had proposed that we have separate |
Thanks @kripken, that's very useful data. I agree that this suggests that we should probably have a |
@kripken I don't think you results are good. Consider your test example:
this could be transformed into the more compact:
It appears to cost a |
Aww 😢 set_local and store returning values was one of my favorite features - I'm not surprised that its removal harms optimized code size significantly. Is implicit drop measured to be harmful to compiler performance - especially compared to the opportunity cost of disabling the compiler optimization enabled by stores-returning-values? The name "tee" is completely foreign to me (I prefer titzer's name idea). Where did it come from? @kripken What's BB? |
To be clear, stores-returning-values and set-local-returning-values is no problem for any optimizing compiler; it only makes a difference to interpreters and baseline compilers. |
@titzer Forgive me, what's the definition of "baseline compiler"? (And to be clear the performance I was talking about was back-end compiler, not front-end optimizer.) |
@kripken Also I recall seeing code making good use of the result of stores expressions. For example storing a constant to multiple addresses. |
@qwertie "baseline compiler" - a very quick compilation to machine code, perhaps generating code in a single pass while decoding. I wonder if it really makes a difference to the base-line compiler as it could keep the value in a register and the following |
I expect wasm to be optimized for real life, so give me a thumbs up if I have this straight: you want to remove implicit drop in order to benefit the first phase of two-phase JITs, in which the first phase ("baseline compiler") needs to be ultra-fast and the second phase is the kind of optimizing JIT that uses on-stack replacement? |
@kripken Has it been discussed to have multiple simultaneously valid variations of WASM, that add/leave out these sort of aspects? I'm wondering if there could be profiles that could be manually selected for performance vs. size vs. compile time, and if they're different enough to be worthwhile. |
@qwertie The 'basline compiler' is a complete single-phase compiler, not a 'two-phase JIT'. It might be best to take a look at the actual code which is quite interesting, see https://bugzilla.mozilla.org/show_bug.cgi?id=1232205 This PR helps this single pass compiler by dropping expression results from the values stack earlier, rather than all at the end of the block. So it knows they are not live and can avoid flushing them to the memory stack. Personally I think there might be better solutions to this challenge. |
Adding @kripken Thanks for the measurements! What is the size increase for brotli and lzma? I think that if layer 0 size stays mostly unchanged (as it sounds like it would be if we add |
So far I'm really worried about this one. Be careful about pulling the trigger!
So is the baseline compiler intended to be the (only) final production compiler for wasm in FF? |
No, it's a first tier. But on weak mobile/IoT hardware, it may end up running code for a "long" time as the full optimizing compilation completes in the background. Moreover, I think we'll see a variety of compilation schemes for wasm over time, not just beefy browsers, so when low-hanging fruit is available like this at low cost (assuming |
@JSStats: you are correct, the second measurements are not fully optimal. I think both sets of numbers are useful, for that reason. Each has limitations, the truth is probably in the middle. (The transformation you suggest should be added as a binaryen optimization pass, but it doesn't exist yet.) @qwertie: BB is BananaBread, the codebase I measure a lot on. It's a fairly large (~100K) game engine with lots of different types of code in it (C style, C++ style, float, int, parsing, physics, AI, rendering, etc. etc.). I think it's big and varied enough to be representative, and small enough that it's easy to get measurements on quickly. @taisel: I don't think I remember that being suggested. Perhaps worth opening an issue for discussion. |
I think @qwertie is right about the efficiency. We know a single
2 seems like it has to be more efficient. And 2 is possible now. But I'm not opposed to this change. I just think baseline benefits do not justify it. There might be other benefits though:
|
Other than the aesthetic benefit of being able to think of execution as both evaluation of an AST and interpretation of a stack machine (and having both mean the same thing), with this change, we are at a point where, post-MVP, and only after significant measurement and experimentation, we have the option to loosen the validation predicate on the postorder bytecode and allow more general stack machine operations ( |
@lukewagner You have a compelling point about cheap IoT devices. In your most recent message, I think what you're saying is that the switch to explicit-drop could allow the addition of certain features in the future, after we've had ample time to study what is most useful to add. Is that so? Can you describe any specific feature(s) that could be done with explicit drop that would be incompatible with an implicit-drop regime? |
@JSStats: I wrote a pass to get rid of the extra blocks you noticed, WebAssembly/binaryen#540, and with that, we go from around 9% more to around 6% more, which is in line with the previous numbers. So overall it looks like the first set of numbers is now confirmed (6% more nodes, 5% larger binary, 5% larger after gzip). @lukewagner: On the new data just mentioned, LZMA is 4.8% bigger, brotli is 7.3% bigger. |
@qwertie A use case for an operator table 'An operator table could allow a choice to discard the expression results of an operator.' is noted in #678 @kripken Thank you for the numbers and the extra pass. @lukewagner I don't see how this PR moves to a more 'stack machine' style or unblocks a future path to this. Could you give a specific example of how a move to a stack machine style is blocked currently? Having some operators push back results onto the values stack, as currently done, seem consistent with a stack machine? Getting repetitive, but I really think people need to consider the multiple value use case in such decisions. Further, perhaps shoe horning the encoding into a stack ordering is part of the problem. How much better could the code generated by a single pass compiler be without the stack ordering restrictions? The stack ordering might not be the best order to minimise register pressure, and reordering might not be practical for a single pass compiler, but perhaps the producer could take this into account if not constrained to encode in the stack order? For example:
Here it would help reduce register pressure to compute the The sexpr-wasm single pass compiler to a stack machine was very useful for me in exploring some ideas, and if we had something similar that targeted a simple virtual register based machine then perhaps it would be easier to explore some of these questions. |
@kripken Not throughput data yet, just the coarse measurement from @lars-t-hansen that this removes 687,251 spills on AngryBots. Naively assuming 4 bytes for each (ARM32), that's >2MiB of code, which is significant. @JSStats That is validation-time; with this PR, the runtime semantics do not need to push anything for void-returning ops. If validation constraints were relaxed in the future, then even validation wouldn't need to push anything on the stack for void-returning ops. |
@lukewagner: 2MB out of how much? I also think it would be good to measure a 1.5 pass baseline. The 2MB figure is without that, I assume, i.e., it's doing the worst-case behavior on every set (assuming it can return)? |
@lars-t-hansen Do you have the total code size for AngryBots? @kripken Based on rough compile-time data we have so far, the baseline is fast enough that raw decoding is a significant % of the total cost, meaning that taking a second pass of any kind will significantly increase overall compile-time. |
@lukewagner I do not have any total code size numbers, but I'll try to collect some. I expect that, though it may be 2MB code, it's 2MB out of a lot of MB. (And it's not 2MB on x64, but probably rather less, since push-one-register is less than 1.5 bytes on average.) @kripken Adding the drop analysis to the baseline compiler is probably not very interesting because it'll slow down the compiler and that defeats its main purpose. It's probably better to just live with the extra memory traffic in that case. (All the traffic will be writes to the stack top, probably not the most expensive operation in practice.) That said, I think it would not be a huge cost to perform the experiment. As a general remark, I can absolutely live without the change, certainly until better data are available about the impact on the generated code's size and performance. At the moment, our baseline compiler generates ho-hum code, meaning it does not optimize things it could, does not keep values in registers when it should, and pushes values when it might have avoided it (within the limit of none of the necessary analyses costing much): in short, the data we're using for the decision are not the best. It seems to me that the discussion here is fruitful but from my point of view we do not have to rush a decision, so long as we do not close off the avenue for a drop opcode in the future. |
It may be useful to look at the sexpr-wasm interpreter here. It runs two passes over the data, where the first pass performs type checking and determines which values are discarded, and the second pass emits the stack machine opcodes. I added some instrumentation to the release build and ran it over AngryBots:
It's not easy to tell how much would be saved with explicit drops, but I wouldn't be surprised if it was a significant amount of that first pass. |
Thanks for the data, that looks like a ~5% cost increase for the initial pass (plus some book-keeping data). If the extra stack traffic looks bad it may be worth paying 5% at compile time to get rid of it. |
Very interesting data and analysis, thanks @binji, @lars-t-hansen, @lukewagner. That's more than enough to convince me, for this PR + tee. Without tee, we do have a simpler design (which I like!), but the % code size increase (5% on total nodes, binary size, and gzip/lzma/brotli sizes, as mentioned above) is worrying. So overall I think I prefer this PR with tee. |
(noting addition of |
Some data: I am writing a pass to convert modules before this change to after this change. Looks like it causes a 1% increase in the number of AST nodes (measured on BB), which is more than I had expected. The culprit is mainly calls, things like libc |
I don't think There are reports above that the result of store operators are not frequently used. If this has been investigated and understood and still holds then might a separate PR change them to not return values too. Another small step. |
The LLVM wasm back end has an optimization for memcpy et all in which it replaces uses of the first operand with uses of the return value. It would be interesting to see how that affects the metrics here. |
@JSStats, |
@rossberg-chromium Yes, I understood this. But adding a |
@JSStats, without this PR there's no point in adding a variant of set_local that doesn't return anything. |
@rossberg-chromium I don't see it that way. One of the use cases for the drop PR is to eliminate the excess values on the values stack, and adding a variant of It will help a lot for an expressionless style of code where every operator with a result sets a local variable! |
Ok, so several lgtms here; can we merge this and create a separate PR on |
@lukewagner I don't see any harm in adding the However the following change is very significant and is making a distinct choice on the future path of wasm: 'In all constructs containing block-like sequences of expressions, all expressions but the last must not yield a value.' Blocks currently have the potentially useful property of discarding excess values, and this could well be a strength for wasm. Being able to push values on the values stack that are not dropped in last-in-first-out order would significantly extend the expressiveness of the wasm language, it would allow a constant to be pushed and reused many times, it would allow multiples values from I don't know which is the best path, but this change does appear to be making a distinct choice for wasm, and I don't think we have the data. With this proposed change wasm appears to be moving towards depending heavily on local variables, to be moving towards the expression-less style of coding for code patterns that do not fit the restricted stack expression pattern, and this might also be ok. Does a decision on this really need to be made now, and do we have the data, or could this change be removed from the PR for now, adding the |
This change is actually the conservative one in that the validation rules could be backwards-compatibly relaxed in the future (re-allowing implicit drop), if there were a strong reason. |
@lukewagner That assumes that further changes are not made taking advantage of this limitation. You still have not articulate a plan wrt multiple values which makes it impossible to assess this matter, you still have no plan?? Would two weeks be a reasonable time frame for you to articulate a concrete plan in this area or conceded you have no plan here? |
@JSStats no plan wrt multiple values is needed as a prerequisite for this change. If someone later proposes "further changes" you don't like, it would make more sense to complain about those instead. |
@qwertie Perhaps, but I can see it now, people will point back here and say 'the decision was made here and if you didn't like it then you should have objected, bad luck'. Also it is more difficult to explore the alternative development paths if one is closed here, and if there is no plan then more development is needed. I am not objecting to people adding the |
"The plan" is to have a lot more data and experience to know what the right route forward post-MVP re: multiple values. This PR preserves flexibility of future options while providing immediate short-term benefits to single-pass compilers. |
Merging with LGTMs above and general consent that this is the right direction. |
Luke, Ben, Dan, and I have been discussing this change for slightly simplifying Wasm, and allowing a more natural interpretation of Wasm as a stack machine. The main changes are:
drop
operator.The constructs affected by this are mainly blocks and block-like sequences. Before, all expressions in a block could yield a value, and it would just be dropped on the floor implicitly (except the last one). Now we require explicit drop operations in those cases. With stores no longer returning a value, the only places were we expect drops to actually arise are calls to side-effecting functions that also return a value, but this value isn't used.
(Also fixing a bunch of other out-of-date text on the way.)
The corresponding spec changes can be found here for more details:
https://github.com/WebAssembly/spec/compare/void