-
Notifications
You must be signed in to change notification settings - Fork 695
Rewrite BinaryEncoding.md to accurately represent the current v8-native decoder #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note that this does not exhaustively document the encoding of individual opcodes, but we already didn't do that. The burden of keeping all of that up to date alongside AstSemantics.md, v8-native, etc seems unnecessary so we should figure out a better solution for that. Maybe document the encoding directly in AstSemantics... Alternately the encoding of opcodes could be general such that we don't need to document it per-opcode :-) |
Quick look, there is a lot in there that is not the v8 encoding, lots of ideas that seem wip. Can we just land the v8 specification as-is, in it's full detail, omitting the discussion of the ssa one pass algorithm (which is very interesting and relevant, just seems more relevant to the rational than an encoding specification). We can then add a w3c template and make it start looking like a specification too, but not necessarily here. Can we all accept this is wip and subject to revision, if not then lets put it in a separate file clearly labelled experimental implementation, just so we can get something working. |
If you look at the diff the intent here is to keep the existing design philosophy bits that were in the document before (instead of throwing them away), then replace the imaginary-format-documentation we had before with a clear, concise specification of what v8-native implements right now. This follows off the previous PR that split most of the pure rationale out into Rationale.md. If you have some ideas for how to split it out further in a comprehensible way, suggestions appreciated! |
Then I suggest putting the v8 implementation in a separate document as it is not consistent with a lot of the design ideas. Just add it warts and all, the opcode numbers etc. The challenge will then be to reconcile it and 'iterate' towards the design goals (please after we have something working and some more experience). |
That's what this is. A separate document is what we have now (the gdoc). If we really want a completely separate document, we delete BinaryEncoding.md and name the separate document... something? We could just rename BinaryEncoding and update all the links to it, I guess.
A straightforward way to do this through our existing process is to have a BinaryEncoding.md that documents v8-native, and then file PRs that revise it to propose changes. Then the process of updating implementations can occur at some point before/after the PR is merged. I'm already in the process of writing some PRs like that which will sit on top of this. |
I suggest keeping BinaryEncoding.md and a separate PR to update it with all your ideas. There is a lot in there that will take a good deal of work to comment on and experience. I suggest creating a new document, ExperimentalEncoding.md with the specification of the current v8 implementation. File PR's to correct errors, follow changes, clarify it, etc. It will represent the current state of implementations, and need not be consistent to BinaryEncoding.md at present. |
It makes sense to refine the top part of BinaryEncoding.md to refine the wording to match how we've been talking about it informally. The middle section seems to be out of place since not all of the ideas are realized or decided on. I agree the last section as currently in BinaryEncoding.md is out of date and could just be removed. But I agree with @JSStats that the next step is to get a simple fully-explicit definintion that we can start to iterate and submit PRs to. Given that we now have an AST in ml-proto (ast.ml) that pretty faithfully represents AstSemantics.md, I had hoped this would be an ML function Lastly, I wouldn't call it the "v8 binary format"; it's just "the current state of the wasm binary spec". |
be standardized in the [MVP](MVP.md). | ||
* **Specific** compression to the binary encoding, that is unreasonable to | ||
expect a generic compression algorithm like gzip to achieve. | ||
* This is not meant to be standardized, at least not initially, as it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this text seems to have been removed, was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I mean this line and the 2 below it, specifically, not the whole removed block in the diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't believe there was ever a consensus to not standardize Layer 1/Layer 2. The opposite is true: IIRC all the discussions have involved eventually standardizing them. User-space is a stopgap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the previous text says "this is not meant to be standardized, at least not initially" - which I thought was the consensus, and hence was written here? How about just adding a note that the Specific compression is not initially going to be standardized, but the option is open to do so later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the actual decision I can revise it to say that. But we never arrived at that decision and I would object to it, so maybe we should start a thread to discuss it (issue in the design repo, maybe?)
Do you want me to kick off that thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luke's prototype (our only real-world size test other than mine) basically had a bunch of layer 1 elements baked into it in order to get a good size reduction when combined with gzip. Layer 0 + gzip won't be a compelling improvement unless we pull a bunch of layer 1 stuff into layer 0 (which, to be fair, people seem to want to do anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get more real-world data now. I tested on zlib built with emscripten, and ran asm2wasm
, then wasm-as
which emits the current binary format. I see a 33% size reduction before gzip, and 16% size reduction after gzip.
And in addition to the 16% smaller download, we will have massively faster parsing. Overall that seems like a very compelling MVP to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 16% post-gzip reduction instead of the ~45% from luke's prototype compelling enough to justify the amount of work necessary to implement this across the web ecosystem? Consumers of wasm will need to ship the polyfill and deal with that for a period of time as well.
In comparison, compressing bananabread's raw asm.js with lzham delivers a ~25% size reduction on top of gzip's. Deploying a new compression codec is much easier compared to shipping wasm everywhere. People are going to ask questions like: "Brotli is going into every web browser already, so why not just compress my asm.js with that?"
For reference, layer 0 as defined previously is larger than v8-native. We can just redefine things so that v8-native is the only thing we spec, and ship that. That has obvious downsides and I'm honestly tired of having to explain them over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Thu, Jan 21, 2016 at 8:25 PM, Alon Zakai [email protected]
wrote:
In BinaryEncoding.md
#520 (comment):@@ -3,120 +3,143 @@
This document describes the portable binary encoding of the
Abstract Syntax Tree nodes.-The binary encoding is designed to allow fast startup, which includes reducing
-download size and allow for quick decoding. For more information, see the-rationale document
-Reducing download size, is achieved through three layers:
- * The raw binary encoding itself, natively decoded by the browser, and to
- be standardized in the MVP.
- * Specific compression to the binary encoding, that is unreasonable to
- expect a generic compression algorithm like gzip to achieve.
- * This is not meant to be standardized, at least not initially, as it can be
We can get more real-world data now. I tested on zlib built with
emscripten, and ran asm2wasm, then wasm-as which emits the current binary
format. I see a 33% size reduction before gzip, and 16% size reduction
after gzip.I'm actually surprised by only 33% before compression. My fuzzy
recollection of the packer work from Luke was that it was around 3-5x
reduction, if not more. How big are the code sections compared to the
signature, function declaration, etc? Are you making use of the I8Const
shorter opcodes?And in addition to the 16% smaller download, we will have massively faster
parsing. Overall that seems like a very compelling MVP to me.—
Reply to this email directly or view it on GitHub
https://github.com/WebAssembly/design/pull/520/files#r50449341.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titzer: I ran @lukewagner's prototype on the same code (zlib) now. Before gzip it is 64% smaller, and after it is 25% smaller. So it is significantly better before gzip, but less so after gzip.
Possibly the main factor here is that @lukewagner's prototype had more tricks in it than the v8 binary format. Makes sense that would matter more before gzip.
Yes, binaryen's binary support uses I8Const.
@kg: sorry if I'm asking something you've already explained in detail elsewhere. Is there a link to the previous discussions or summaries of them?
Overall, I don't think we have a problem of viability here. With the apples-to-apples comparison in the first part of this comment, the current v8 binary format is 15% smaller vs @lukewagner's which is 25% smaller. That's a significant difference, but I think both are excellent numbers for an MVP. We certainly intend to do better later, but there's no need to rush, I think.
Furthermore, we may do better at the MVP stage, with layer 1 stuff in userspace. We might also modify layer 0 in the choice of opcodes or ordering or such that improves things, even without layer 1.
More importantly, smaller downloads are a crucial piece here, but not the only one. At the MVP stage, we will also have
- Lightning-fast parsing, that also consumes far less memory.
- Consistent performance across browsers.
- Fast performance of things like 64-bit ints and memory growth, which were never (or never consistently) fast.
Even with an initially more moderate download size improvement, I think the wasm MVP definitely justifies itself.
I think we all agree that layer 1 is important to work on, and will matter a lot. The only question I'm raising here is, previously we wrote that we would leave layer 1 standardization to post-MVP, and I don't see a reason to change that?
Yeah. Maybe I should split that piece out into a smaller PR so it can merge separate from the rest of this discussion.
Based on my understanding and memory of past discussions (and things we have previously documented), nothing in the middle section is inaccurate or undecided, though some of it probably didn't go through as much discussion (since it dates back to earlier in the process). There are some mismatches with v8-native, but that's why it's in the terminology/primitives section. Maybe there needs to be a separate document for that section, like there is for Rationale.md. I don't know what we would call it.
This is fully explicit and matches v8-native. In fact, I wrote this from a reading of the source currently in the v8 repository, to ensure that it was accurate.
ML for a binary format specification is something we'd need to prove is worthwhile. The benefits of an ML encoding for an instruction set are obvious(-ish) and compelling, but defining a binary format in ML is not something I can come up with a good justification for. The typical use cases for a binary format like what we're building involves writing an efficient decoder in a production language (i.e. not ML), reverse-engineering/examining files (i.e. hex editor), and building layers on top (like structural compression). I'm not sure how ML would clearly express the relevant information here. We're defining a new binary encoding for the web so we need to pay attention to existing standards here and understand why successful formats are successful. More importantly, the binary encoding needs to be decoupled from the fine details of the AST/IR, and I'm not sure how we would do that if we cram it into the spec repo with the rest of the ML. I'm sure are some existing ML/Haskell/etc projects out there that achieve that kind of split and generality, can you point me to some?
To be precise, there is currently no binary format spec. We had a sketch before (the thing this PR removes) and now we have the v8-native prototype format, but it's not a spec (in part because it is divergent from the ML spec and in part because it doesn't match the existing design text). We could define this new document as the spec, if that's the consensus. I don't think we should - it's a research prototype right now, just like polyfill-prototype-1 and js-astcompressor. |
I don't think that makes it fully explicit. As in, I don't think I could write a binary decoder given this. As this becomes fully explicit, the spec repo seems to be the right place for it.
It would specify, for any given AST, the exact binary bits. And you could write tests against it. That's pretty clear. OTOH, if @rossberg-chromium isn't up for it, it's not like I'm volunteering either, so I guess md would do.
On the contrary, the binary encoding needs to be precisely paired with the AST so that the exact correspondence is clear.
I think this is the root of the misunderstanding here. I thought you were trying to initialize "the binary spec" on which we'd start to iterate (by submitting PRs, issues, etc). There was some consensus around bootstrapping the binary spec in this way, I thought and it sounds like people are eager to do this. I'd still say that a sibling dir to |
Well, I've been writing the draft polyfill's decoder, so this format is informed by the information I needed to write it - information I mostly acquired by reading the v8 source, since the gdoc wasn't completely accurate. Right now you could implement a decoder for the module format from this document without any additional reference information. Decoding the AST function bodies would require extra info, and I agree that we need really good documentation for that. But AST decoding and module decoding are separate concerns, arguably.
This appears to be a big philosophical disagreement that I thought we were in agreement on. How do you want to address this? I'm not sure a github issue is the right place, but feedback is welcome.
Yeah, I failed to communicate here. My immediate goal is just to update this document in design/ so it's not wildly inaccurate, so that it's not misleading people who are looking for a summary of the design at present. I've actually received a couple unsolicited emails from people who were looking for 'the current spec' or 'current format' so I think the two best options are to delete BinaryEncoding.md entirely or strip it down so it's just a minimal representation of the format. |
The encoding of the sections seems oddly out of place to me. It's s-exp in the wast, and could use the same data layer representation as the AST. I can live with it either way, but using a common data layer could have been cleaner and might have set an example for other extension sections.
I don't think pointing to code as a specification is tenable either. It needs to be written up. But having a reference implementation seems fine. Can we please land the v8 style AST definition. The table with argument types seems a good start.
We still need somewhere to integrate the planned direction, and BinaryEncoding.md seems fine. We don't want the newly landed 'experimental encoding' to drift way ahead of the actual implementations - we would have the same problem again. I don't think it belongs under ml-proto - that seems clearly a reference implementation to me. I suggest creating a new repo for the written specifications if it is not appropriate for the design repo. The current spec repo could then become dedicated to the reference ml-proto implementation and test suite. |
@lukewagner, still planning to start on this soon, but was bogged down by other things after the holidays. @kg, I have difficulties understanding your position. Clearly, as Luke said, there should be as much rigor in defining the mapping as possible. What would be gained by not explicating that in the proto spec (in addition to the textual description, of course)? |
|
||
# Primitives and key terminology | ||
|
||
### varuint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worthwhile to go ahead and define the other integer types and to be explicit that they are encoded in little-endian.
On Tue, Jan 19, 2016 at 10:19 PM, Katelyn Gadd [email protected]
I think this is the right place to be very explicit and detailed, down to This will work great if we can point people to We'll have to figure out a versioning story and this document will be the
|
I agree with @lukewagner and @rossberg-chromium in that adding a decoder to On Wed, Jan 20, 2016 at 9:48 AM, rossberg-chromium <[email protected]
|
Revised the PR. My opinion is that for laying out exact enumerant values (i.e. the index assigned to the Signatures section), we should have a separate document for that. In practice we should not have very many (any?) assigned indices, because we previously agreed to have name->index tables and not having them will prevent the format from being evolved iteratively. So I think we should introduce a BinaryEncodingConstants.md or something, containing all the hard-coded constants we have right now. Then as we incrementally fix v8-native's encoding, we can remove parts of that document until it's empty and we can delete it. I would offer to write that document myself in a follow-up PR but I don't think I'll have a chance to do it any time soon. |
+1 to landing @kg's work here. Minor clarifications could be addressed in follow up PRs. I've been making an attempt to put together an opcode table, noting conflicts with the spec, and filling in some of the missing information such as which operands are 'parametric' and the natural alignments etc, see https://github.com/wllang/wll/blob/master/wasm.lisp#L132 Perhaps I could have a go at the opcodes section, most of it might be automatically generated from this table, with the exceptions discussed in more detail. I've written a binary encoder and decoder and I've also been working with binaryen to try and resolve the operator names and opcode numbers. |
Superseded by #537. |
No description provided.