-
Notifications
You must be signed in to change notification settings - Fork 695
Design goals: fast interpreter, polyfill ease, and preorder versus postorder encoding #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think a fast interpreter is a worthy goal. A lot of code on the web ends up executed exactly once, and just interpreting the code can be faster than JITing it first. So I would expect wasm interpreters to be used in production. |
I think a fast interpreter is a worthy secondary goal. WebAssembly as we know it today isn't going to be a great choice for code on the web which is expected to be executed exactly once. JavaScript handles that use case quite well already. |
@kripken Agreed. We will definitely be shipping an interpreter, and we have found interpreter perf is pretty good, and is even acceptable for startup on asm.js (we had a baseline JIT but it didn't really buy much over a baseline interpreter). If we can reduce startup time by not having to do a bytecode generation pass on the main thread, I think that is a noble cause. |
At the minimum you have to do a pass to validate the types, so if that pass I think fast interpretation of direct wire format is going to be a tough On Tue, Jun 30, 2015 at 7:41 PM, Michael Holman [email protected]
|
I agree that interpreting is secondary to baseline or full-optimizing JITing. However it still feels quite important. It would be nice if wasm had an option to annotate a module with "-O0" or "-O2" etc., which PNaCl has now, and then websites could mark code as "-O0" when they know it executes only once, in which case the browser could avoid a large amount of overhead. It would be far better than JS engines can do now, in fact, so it would be a big improvement for the web. |
@kripken I'd rather annotate on a per-function basis, instead of whole module. |
Agreed, that would be even better. |
The polyfill consideration is pretty important and I've seen a mixture of views expressed here, so it's worthwhile to figure out whether there's any agreement on that part of the premise for any pre/post-order decision we make. Do we believe that the polyfill will live around a year at most, as some people have expressed? I personally consider that implausible based on the history of internet technology adoption, but it is possible if every involved party pushes really hard. If we collectively think that sort of aggressive adoption schedule is possible and commit to push hard for it, that frees us to not spend too much time worrying about the polyfill when designing the file format. If we can't ensure a compressed time-scale, any format decisions made in favor of an interpreter could compromise polyfill performance for years. |
From my perspective, it seems likely the polyfill will continue to exist to On Tue, Jun 30, 2015 at 8:13 PM, Katelyn Gadd [email protected]
|
I view WebAssembly as something that, if designed well, is a standard (or a precursor to one) that will last over 100 years. With today's self-updating browsers, the life of the polyfill will be very brief in comparison. |
@kg Given that "Safari is the new IE", I suspect that the Polyfill will be around for a long time. |
For the last couple days at lunch @LouisLaf and I have been going back and forth on the polyfill. One thing we've discussed is having a wasm feature check and then the server can serve up either the polyfilled js (gzipped) or the wasm. Obviously, this will not be quite as small as the gzipped binary, but from the numbers in the FAQ it's still pretty good. If we do this, then we no longer need to consider the polyfill when discussing the binary encoding. |
It reduces the need for quick translation of the binary encoding to JS, but we would still need a practical way to do that translation which emits efficient JS. Relying on separate downloads would make hosting such applications a little more complicated for people, but it's debatable if that's a serious problem or not. |
@kripken Of course, but that is a matter of AST semantics, not the binary encoding. Right now, the binary encoding is being optimized for speed of the polyfill. If we decouple the polyfill, then the binary encoding can be optimized for other things. |
Agreed, assuming we are ok with the hosting complexity, it would remove the need for focusing on quick translation to JS in the binary encoding. |
Server side conversion does not seem like a viable approach. It requires smart, carefully configured servers instead of being able to host an wasm application anywhere. It also prevents an intelligent polyfill that generates code based on the target browser's feature support. It prevents any strategy other than AOT compilation of an entire module. It requires the server software/configuration to be updated in response to wasm revisions instead of putting a different polyfill url in the webpage. It complicates scenarios where multiple versions of the polyfill may need to be used (for example, running a very old wasm application on one page, and a very new one on another.) It complicates compatibility fallback in cases where you can't easily sniff support from the server - native module loads but browser's native implementation is too old so fallback to polyfill occurs, &c. |
@kg It's true that there is additional overhead for hosting. But I don't really understand the rest of your argument.
How would this be used? What kind of feature testing would a client-side polyfill do that couldn't be relayed back to the server to have the appropriate code served up? Are we expecting the polyfill to be so polymorphic?
How? I certainly am not pushing for a full AOT compilation strategy and I don't see a conflict.
We can still keep the strategy where ops which are not yet implemented have js polyfill fallbacks. We should not have a case where someone has a wasm implementation, but it is so fundamentally wrong that a full js polyfill is needed. The only type of scenario I can think of is someone implements SharedArrayBuffer in js, but doesn't implement any threading in wasm. You have a wasm module that uses threading and you can't use it here, but it is fundamental to your app, so you must fall back completely to a polyfill. Is this even polyfillable in a single pass? And is that not information the client can figure out before deciding which content to download? And is this what we believe that polyfill will be used for? I hope we're not already conceding to non-conformant wasm implementations. Maybe I'm giving a strawman, but I really can't think of a good fallback case that this is blocking. |
There are future language improvements on the table that a polyfill would ideally be able to take advantage of, like simd.js and int64 and weakrefs. We can't have the polyfill generate code for this offline at present. If we want to do offline code generation on the server for polyfill scenarios, this would mean wasm applications wouldn't be able to leverage these features until they're integrated directly into a native wasm VM. I think that's undesirable. There are other scenarios where serving the exact same code up to everyone could be problematic. If we try to do all of this in the native VM through bodges you will end up having to cross the FFI every time you interact with these experimental features, which is super not awesome.
I'm not sure how else to describe 'translate the entire wasm module on the server and feature detect to decide whether to send it to the browser' other than 'ahead-of-time compilation'.
This still implies crossing the FFI boundary repeatedly and has nasty implications for the shape of data on the stack and where values need to live. There are already scenarios where people have needed to generate 'almost asm' code instead of 'use asm' code for the short-term benefits. We're not talking 'fundamentally wrong' here, just 'things have changed a lot and we aren't doing explicit WebAssembly versioning'.
The feature detection proposals we've discussed are at least partially designed to enable you to early-detect whether or not a module relies on features your VM doesn't support. This would enable early fallback to a JS polyfill or interpreter. On the other hand, it's my impression that I'm the only person who cares about that scenario, so maybe it doesn't matter. See #90. The fallback philosophy that has been discussed so far has tiers (opcode can be implemented with a wasm fallback function; opcode can be implemented with a js fallback function; only way to implement is by running a polyfill that implements it.) The instruction table idea described near the bottom of that issue is important. I'm not sure what we believe the polyfill will be used for. But I think it will probably live at least 3 years, and we want to ensure that we can always fall back to it in any scenario where we want to experiment with sweeping changes/improvements to WebAssembly. We also want to empower people to prototype changes against real-world applications without requiring them to directly modify the code of a native VM or use the interpreter. I think having a high-performance, flexible polyfill that can load and run webassembly modules on the client is really important. Any solution that requires you to run something on the server and do sniffing to serve up files to clients positions us as akin to something like Dart or CoffeeScript instead of positioning us as a successor to technologies like asm.js and pnacl. A client-side polyfill (hopefully loaded off a CDN operated by a trustworthy browser vendor) is much easier to update and keep current than a sniffing mechanism built into HTTPDs. (Do we really think we can make sniffing work? I thought history had shown that it's always fragile.) |
(Sorry for delay, on vacation; processing asynchronously.) After recent discussions with @MikeHolman and @LouisLaf, @sunfishcode and I were re-examining the pre- vs. post-order question. Their positive experience with using an interpreter as a the warm-up baseline (instead of a dumb compiler) was pretty interesting and made interpreter perf more interesting than during previous considerations. So first, I think a key point is the one @titzer made above that an engine is probably going to want to do some streaming validation and transformation to an internal suitable-for-direct-interpretation bytecode anyway, regardless of how we design our binary format (unless we massively specialize for an assumed interpreter implementation, something I hope we can all agree is a bad idea). That combined with streaming compilation I think negates any startup advantage of interp over dumb-compiler; I think the main argument for an interp is just simplicity/portability. Next, I think the polyfill can get by well enough with postorder and thus we can drop polyfill from our consideration. In the worst case (assuming we want to keep things single-pass and avoid any intermediate analyses, which I think is critical for startup time), we can just pessimistically insert a fixed number of spaces that get back-patched to parens and coercions as necessary. (Preorder makes this unnecessary b/c you can usually judge whether parens/coercions are necessary before splatting out the children.) This may bloat the output size 2-3x though (hard to know w/o implementing). If this ended up hurting (maybe contributing to OOM or slow startup), I realized we could do better by integrating the polyfill with the specific layer and thus the specific layer format could be designed to be optimal for polyfilling. Since we're not standardizing the specific layer any time soon, these blemishes would be swept away in the long term. So what I think it comes down to is: which format is easiest to compile (to internal interp IR, dumb compiler, smart compiler) where "easiest" is both performance and simplicity. Initially, I thought preorder was the clear winner. That was mostly b/c I imagined a recursive descent parser (such elegance, much maths). However, I does seem like stack-based algos get grosser with preorder. (I'm thinking about pure expressions here; for control-flow ops, I assume we wouldn't require a pure postorder since not even interp wants that.) OTOH, postorder practically mandates a stack-based algo (which could be a good thing if it squeezes out browser variance!). There was also the argument that pre-order allows type-segregated opcode index spaces. However, as I was convincing myself in #58, type-segregation is probably not a size win at the raw layer. There is also the advantage that type-segregation requires less work to validate (ops are correctly-typed by construction, don't have to test), but it remains to be seen how much this wins in practice. So currently, I'm actually leaning slightly in favor of post-order, but I'd be most interested to hear from others implementing experimental wasm decoders/compilers (esp. @titzer, having already implemented preorder in v8-native, I think?). |
The more I think about this, the only advantage that I see for a postorder a.) if it were only in a "basic block", then one could interpret the basic The v8-native-prototype decodes a pre-order layout of the AST directly to a.) The decoder knows when control flow is starting for ifs and loops:
For compilers, building the graph in a single pass in this way also allows I could even conceive of a single-pass bytecode or even machine code -B On Sat, Jul 4, 2015 at 10:49 AM, Luke Wagner [email protected]
|
@titzer Those are all great points. It seems like all but one are concerning control flow. I was thinking the same thing, which is why I added this caveat above:
The one non-control-flow advantage you mentioned was (b). This is mostly what I've been thinking about and what I was considering in my above comment: are there any advantages to preorder for pure expressions? |
I would like to re-iterate the importance of an interpreter for a runtime like wasm:
A post-order representation for expression works much better for interpretation than pre-order, but a pure post-order AST representation really does not work great. You really need a better way to represent flow since you don't want the interpreter to have to process code which isn't being executed. This is possible and include all of the advantages @titzer pointed at in #223. Types can be inferred from leafs (adding more precise typing for locals would help here), you can have types available right away. Flow can be done in a pre-order way, and the addition of target offsets for branches would make it efficient to interpret. So you could have something like: if (x > y) { x = y } else { y = x} Something like this would be easy for an interpreter or a JIT. The CFG is well represented, With precise types on x and y, all the other types are easy to infer. The polyfill would be quite easy to write but would need a simple stack. |
I agree with the goals of the interpreter for fast start up and, e.g. run On Mon, Jul 6, 2015 at 10:47 PM, Louis Lafreniere [email protected]
|
@titzer What do you think about using pre-order for control flow, post-order for expressions? |
I still think interpreting the bytecode directly is going to be tough. E.g. On Tue, Jul 7, 2015 at 9:08 PM, Luke Wagner [email protected]
|
Sorry, I mean, [if][cond-expression][true-expression][false-expression]. On Wed, Jul 8, 2015 at 12:39 AM, Ben L. Titzer [email protected] wrote:
|
Yeah, inline representation of child nodes of variable length kinda wrecks any approach for efficiently interpreting the bytecode right out of storage. I think you end up having to decode the AST no matter what, and hope that the decoding pass is cheap. |
@titzer The reason I'm favoring postorder for expressions isn't direct interpretation, but because it seems slightly more natural for iterative (i.e., non-recursive) decoding of expressions (not control flow). @MikeHolman Is that a proposal to include offsets in the raw binary format? I think this would have worse size and complicate/slow general validation for this one case. Assuming 'no', though, and you want offsets, then you'd need to translate to an internal bytecode with offsets (so it's not direct interpretation). I could also imagine an interpreter that JITed with a very very low hotness count (like 2), taking advantage of the fact that most code is dead or run-once and in this case, you might not even need offsets. |
FWIW, I agree: for a mostly tree-like data structure, post-order serialisation is more natural, and likely also slightly more efficient to decode -- at least that was what we measured a long time ago in a project I was involved in. Though that was for general serialisation; if it's an AST and you also want to be able to create a CFG from that directly then the equation might change somewhat. @luke, your suggestion seems dependent on that other question, whether statements should be expressions. If they are, you probably cannot (or don't want to) represent control differently. Would that be a big issue, though? It's easy to extend postorder deserialisation to handle DAGs and general graphs. |
I don't think postorder expression encoding makes any difference (positive On Wed, Jul 8, 2015 at 1:05 PM, rossberg-chromium [email protected]
|
I recently implemented an AST based interpreter for microcontrollers. It's not based on wasm bytecodes, but I did it using a binary AST. I tried several styles including luajit inspired fixed-width bytecodes, forth style stack operations and pre-order AST. The pre-order was by far faster in my implementations even without jump offsets in my if/elif/else nodes. I simply have a
Yes embedding jump offsets after IF/ELIF and before ELIF/ELSE would probably make it faster, but it greatly complicates the bytecode. For my uses (kids learning on microcontrollers) scanning is plenty fast. Just a data point to consider. |
Also fwiw, in my language, everything including control flow is an expression. |
@rossberg-chromium Even in the statements-are-expressions case, I think it'd be ok to nest pre-/post-order. It would be a problem for a generic traversal algo (that doesn't know anything about the specific node types), but for a custom wasm decoder (that switches on the decoded opcode before moving on to the next), I think I see how it'd work (though of course I'd want to implement to be sure). |
@creationix: that's interesting data, thanks. do you have insight as to why preorder was faster in your interpreter? perhaps it was heavily tied to the fact you have fixed-size bytecodes? |
I like the idea of an interpreter for the reasons @LouisLaf brought up. In particular,
would be of interest for the Tor browser (which is based on Firefox). Using only a wasm interpreter in addition to a JS interpreter sounds like a possibly acceptable tradeoff for the Tor browser (rather than blocking all JS, as they do, as of today). |
@kripken I'm not entirely sure why it's so much faster. Perhaps I just programmed better that way. With both, I have 7-bit opcodes with 1 bit to signify it's an opcode. 1xxxxxxx - opcode, literal entry into enum table starting at 128. I don't use fixed-size codes for the preorder or post-order versions, that was only for the luajit style bytecodes (4 bytes each). The preorder version can be seen visually in the IDE for kids I'm working on at http://uscript.creationix.com/ (though I graphically show math ops as infix to make it easier for kids) In preorder, there is no length header or even end marker. Every op has a fixed set of operators. For variable length bodies, I have a For postorder, it's more like forth, which I expected to be much faster. I still have 7-bit ops and variable length integers, but each op simply consumes items on the stack and/or pushes new items on the stack. I have to pass in a length parameter to eval to know how many token to consume. I hadn't decided if I wanted a length header or end markers. After benchmarking, the pre-order was an order of magnitude faster.
edit: added screenshot of IDE prototype since it's likely to change in the near future. |
Conclusion: fast intrp and polyfill aren't primary goals. |
We were recently discussing preorder versus postorder encoding. Different folks (@titzer, @kg, @ncbray, @sunfishcode, @lukewagner) have explored this in the prototype repos and elsewhere. We haven't reached an agreement yet on which is better (considering compression, decoding / interpretation time, implementation complexity).
Postorder may be better for interpreter, preorder for CFG building. We need more experiments and data to figure out what the tradeoffs are.
Once we have this data, we need to answer a bigger question:
The text was updated successfully, but these errors were encountered: