Skip to content

Rewrite BinaryEncoding.md to accurately represent the current v8-native decoder #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

kg
Copy link
Contributor

@kg kg commented Jan 19, 2016

No description provided.

@kg
Copy link
Contributor Author

kg commented Jan 19, 2016

Note that this does not exhaustively document the encoding of individual opcodes, but we already didn't do that. The burden of keeping all of that up to date alongside AstSemantics.md, v8-native, etc seems unnecessary so we should figure out a better solution for that. Maybe document the encoding directly in AstSemantics...

Alternately the encoding of opcodes could be general such that we don't need to document it per-opcode :-)

@ghost
Copy link

ghost commented Jan 19, 2016

Quick look, there is a lot in there that is not the v8 encoding, lots of ideas that seem wip. Can we just land the v8 specification as-is, in it's full detail, omitting the discussion of the ssa one pass algorithm (which is very interesting and relevant, just seems more relevant to the rational than an encoding specification). We can then add a w3c template and make it start looking like a specification too, but not necessarily here. Can we all accept this is wip and subject to revision, if not then lets put it in a separate file clearly labelled experimental implementation, just so we can get something working.

@kg
Copy link
Contributor Author

kg commented Jan 19, 2016

If you look at the diff the intent here is to keep the existing design philosophy bits that were in the document before (instead of throwing them away), then replace the imaginary-format-documentation we had before with a clear, concise specification of what v8-native implements right now.

This follows off the previous PR that split most of the pure rationale out into Rationale.md.

If you have some ideas for how to split it out further in a comprehensible way, suggestions appreciated!

@ghost
Copy link

ghost commented Jan 19, 2016

Then I suggest putting the v8 implementation in a separate document as it is not consistent with a lot of the design ideas. Just add it warts and all, the opcode numbers etc. The challenge will then be to reconcile it and 'iterate' towards the design goals (please after we have something working and some more experience).

@kg
Copy link
Contributor Author

kg commented Jan 19, 2016

Then I suggest putting the v8 implementation in a separate document as it is not consistent with a lot of the design ideas. Just add it warts and all, the opcode numbers etc.

That's what this is. A separate document is what we have now (the gdoc). If we really want a completely separate document, we delete BinaryEncoding.md and name the separate document... something? We could just rename BinaryEncoding and update all the links to it, I guess.

The challenge will then be to reconcile it and 'iterate' towards the design goals

A straightforward way to do this through our existing process is to have a BinaryEncoding.md that documents v8-native, and then file PRs that revise it to propose changes. Then the process of updating implementations can occur at some point before/after the PR is merged. I'm already in the process of writing some PRs like that which will sit on top of this.

@ghost
Copy link

ghost commented Jan 19, 2016

I suggest keeping BinaryEncoding.md and a separate PR to update it with all your ideas. There is a lot in there that will take a good deal of work to comment on and experience.

I suggest creating a new document, ExperimentalEncoding.md with the specification of the current v8 implementation. File PR's to correct errors, follow changes, clarify it, etc. It will represent the current state of implementations, and need not be consistent to BinaryEncoding.md at present.

@lukewagner
Copy link
Member

It makes sense to refine the top part of BinaryEncoding.md to refine the wording to match how we've been talking about it informally.

The middle section seems to be out of place since not all of the ideas are realized or decided on.

I agree the last section as currently in BinaryEncoding.md is out of date and could just be removed. But I agree with @JSStats that the next step is to get a simple fully-explicit definintion that we can start to iterate and submit PRs to. Given that we now have an AST in ml-proto (ast.ml) that pretty faithfully represents AstSemantics.md, I had hoped this would be an ML function to_binary : Ast -> byte string since this would basically read like an EBNF grammar with the inverse from_binary provided for completeness. The benefit being these specs would be executable and useful for testing. Last I checked, @rossberg-chromium sounded interested in coding this up. The next best thing would be markdown, like you've started here, but I think that'd be in the spec repo.

Lastly, I wouldn't call it the "v8 binary format"; it's just "the current state of the wasm binary spec".

be standardized in the [MVP](MVP.md).
* **Specific** compression to the binary encoding, that is unreasonable to
expect a generic compression algorithm like gzip to achieve.
* This is not meant to be standardized, at least not initially, as it can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this text seems to have been removed, was that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean this line and the 2 below it, specifically, not the whole removed block in the diff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't believe there was ever a consensus to not standardize Layer 1/Layer 2. The opposite is true: IIRC all the discussions have involved eventually standardizing them. User-space is a stopgap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the previous text says "this is not meant to be standardized, at least not initially" - which I thought was the consensus, and hence was written here? How about just adding a note that the Specific compression is not initially going to be standardized, but the option is open to do so later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the actual decision I can revise it to say that. But we never arrived at that decision and I would object to it, so maybe we should start a thread to discuss it (issue in the design repo, maybe?)

Do you want me to kick off that thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luke's prototype (our only real-world size test other than mine) basically had a bunch of layer 1 elements baked into it in order to get a good size reduction when combined with gzip. Layer 0 + gzip won't be a compelling improvement unless we pull a bunch of layer 1 stuff into layer 0 (which, to be fair, people seem to want to do anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get more real-world data now. I tested on zlib built with emscripten, and ran asm2wasm, then wasm-as which emits the current binary format. I see a 33% size reduction before gzip, and 16% size reduction after gzip.

And in addition to the 16% smaller download, we will have massively faster parsing. Overall that seems like a very compelling MVP to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 16% post-gzip reduction instead of the ~45% from luke's prototype compelling enough to justify the amount of work necessary to implement this across the web ecosystem? Consumers of wasm will need to ship the polyfill and deal with that for a period of time as well.

In comparison, compressing bananabread's raw asm.js with lzham delivers a ~25% size reduction on top of gzip's. Deploying a new compression codec is much easier compared to shipping wasm everywhere. People are going to ask questions like: "Brotli is going into every web browser already, so why not just compress my asm.js with that?"

For reference, layer 0 as defined previously is larger than v8-native. We can just redefine things so that v8-native is the only thing we spec, and ship that. That has obvious downsides and I'm honestly tired of having to explain them over and over.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thu, Jan 21, 2016 at 8:25 PM, Alon Zakai [email protected]
wrote:

In BinaryEncoding.md
#520 (comment):

@@ -3,120 +3,143 @@
This document describes the portable binary encoding of the
Abstract Syntax Tree nodes.

-The binary encoding is designed to allow fast startup, which includes reducing
-download size and allow for quick decoding. For more information, see the

-rationale document

-Reducing download size, is achieved through three layers:

  • * The raw binary encoding itself, natively decoded by the browser, and to
  • be standardized in the MVP.
  • * Specific compression to the binary encoding, that is unreasonable to
  • expect a generic compression algorithm like gzip to achieve.
  • * This is not meant to be standardized, at least not initially, as it can be

We can get more real-world data now. I tested on zlib built with
emscripten, and ran asm2wasm, then wasm-as which emits the current binary
format. I see a 33% size reduction before gzip, and 16% size reduction
after gzip.

I'm actually surprised by only 33% before compression. My fuzzy
recollection of the packer work from Luke was that it was around 3-5x
reduction, if not more. How big are the code sections compared to the
signature, function declaration, etc? Are you making use of the I8Const
shorter opcodes?

And in addition to the 16% smaller download, we will have massively faster
parsing. Overall that seems like a very compelling MVP to me.


Reply to this email directly or view it on GitHub
https://github.com/WebAssembly/design/pull/520/files#r50449341.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titzer: I ran @lukewagner's prototype on the same code (zlib) now. Before gzip it is 64% smaller, and after it is 25% smaller. So it is significantly better before gzip, but less so after gzip.

Possibly the main factor here is that @lukewagner's prototype had more tricks in it than the v8 binary format. Makes sense that would matter more before gzip.

Yes, binaryen's binary support uses I8Const.

@kg: sorry if I'm asking something you've already explained in detail elsewhere. Is there a link to the previous discussions or summaries of them?

Overall, I don't think we have a problem of viability here. With the apples-to-apples comparison in the first part of this comment, the current v8 binary format is 15% smaller vs @lukewagner's which is 25% smaller. That's a significant difference, but I think both are excellent numbers for an MVP. We certainly intend to do better later, but there's no need to rush, I think.

Furthermore, we may do better at the MVP stage, with layer 1 stuff in userspace. We might also modify layer 0 in the choice of opcodes or ordering or such that improves things, even without layer 1.

More importantly, smaller downloads are a crucial piece here, but not the only one. At the MVP stage, we will also have

  1. Lightning-fast parsing, that also consumes far less memory.
  2. Consistent performance across browsers.
  3. Fast performance of things like 64-bit ints and memory growth, which were never (or never consistently) fast.

Even with an initially more moderate download size improvement, I think the wasm MVP definitely justifies itself.

I think we all agree that layer 1 is important to work on, and will matter a lot. The only question I'm raising here is, previously we wrote that we would leave layer 1 standardization to post-MVP, and I don't see a reason to change that?

@kg
Copy link
Contributor Author

kg commented Jan 19, 2016

It makes sense to refine the top part of BinaryEncoding.md to refine the wording to match how we've been talking about it informally.

Yeah. Maybe I should split that piece out into a smaller PR so it can merge separate from the rest of this discussion.

The middle section seems to be out of place since not all of the ideas are realized or decided on.

Based on my understanding and memory of past discussions (and things we have previously documented), nothing in the middle section is inaccurate or undecided, though some of it probably didn't go through as much discussion (since it dates back to earlier in the process). There are some mismatches with v8-native, but that's why it's in the terminology/primitives section. Maybe there needs to be a separate document for that section, like there is for Rationale.md. I don't know what we would call it.

I agree the last section as currently in BinaryEncoding.md is out of date and could just be removed. But I agree with @JSStats that the next step is to get a simple fully-explicit definintion that we can start to iterate and submit PRs to.

This is fully explicit and matches v8-native. In fact, I wrote this from a reading of the source currently in the v8 repository, to ensure that it was accurate.

Given that we now have an AST in ml-proto (ast.ml) that pretty faithfully represents AstSemantics.md, I had hoped this would be an ML function to_binary : Ast -> byte string since this would basically read like an EBNF grammar with the inverse from_binary provided for completeness. The benefit being these specs would be executable and useful for testing. Last I checked, @rossberg-chromium sounded interested in coding this up. The next best thing would be markdown, like you've started here, but I think that'd be in the spec repo.

ML for a binary format specification is something we'd need to prove is worthwhile. The benefits of an ML encoding for an instruction set are obvious(-ish) and compelling, but defining a binary format in ML is not something I can come up with a good justification for.

The typical use cases for a binary format like what we're building involves writing an efficient decoder in a production language (i.e. not ML), reverse-engineering/examining files (i.e. hex editor), and building layers on top (like structural compression). I'm not sure how ML would clearly express the relevant information here. We're defining a new binary encoding for the web so we need to pay attention to existing standards here and understand why successful formats are successful.

More importantly, the binary encoding needs to be decoupled from the fine details of the AST/IR, and I'm not sure how we would do that if we cram it into the spec repo with the rest of the ML. I'm sure are some existing ML/Haskell/etc projects out there that achieve that kind of split and generality, can you point me to some?

Lastly, I wouldn't call it the "v8 binary format"; it's just "the current state of the wasm binary spec".

To be precise, there is currently no binary format spec. We had a sketch before (the thing this PR removes) and now we have the v8-native prototype format, but it's not a spec (in part because it is divergent from the ML spec and in part because it doesn't match the existing design text).

We could define this new document as the spec, if that's the consensus. I don't think we should - it's a research prototype right now, just like polyfill-prototype-1 and js-astcompressor.

@lukewagner
Copy link
Member

This is fully explicit and matches v8-native. In fact, I wrote this from a reading of the source currently in
the v8 repository, to ensure that it was accurate.

I don't think that makes it fully explicit. As in, I don't think I could write a binary decoder given this. As this becomes fully explicit, the spec repo seems to be the right place for it.

I'm not sure how ML would clearly express the relevant information here.

It would specify, for any given AST, the exact binary bits. And you could write tests against it. That's pretty clear. OTOH, if @rossberg-chromium isn't up for it, it's not like I'm volunteering either, so I guess md would do.

More importantly, the binary encoding needs to be decoupled from the fine details of the AST/IR

On the contrary, the binary encoding needs to be precisely paired with the AST so that the exact correspondence is clear.

We could define this new document as the spec, if that's the consensus. I don't think we should - it's a
research prototype right now, just like polyfill-prototype-1 and js-astcompressor.

I think this is the root of the misunderstanding here. I thought you were trying to initialize "the binary spec" on which we'd start to iterate (by submitting PRs, issues, etc). There was some consensus around bootstrapping the binary spec in this way, I thought and it sounds like people are eager to do this. I'd still say that a sibling dir to ml-proto inside spec is the right place, then.

@kg
Copy link
Contributor Author

kg commented Jan 19, 2016

I don't think that makes it fully explicit. As in, I don't think I could write a binary decoder given this. As this becomes fully explicit, the spec repo seems to be the right place for it.

Well, I've been writing the draft polyfill's decoder, so this format is informed by the information I needed to write it - information I mostly acquired by reading the v8 source, since the gdoc wasn't completely accurate. Right now you could implement a decoder for the module format from this document without any additional reference information. Decoding the AST function bodies would require extra info, and I agree that we need really good documentation for that. But AST decoding and module decoding are separate concerns, arguably.

It would specify, for any given AST, the exact binary bits. And you could write tests against it. That's pretty clear. OTOH, if @rossberg-chromium isn't up for it, it's not like I'm volunteering either, so I guess md would do.
snip
On the contrary, the binary encoding needs to be precisely paired with the AST so that the exact correspondence is clear.

This appears to be a big philosophical disagreement that I thought we were in agreement on. How do you want to address this? I'm not sure a github issue is the right place, but feedback is welcome.

I think this is the root of the misunderstanding here. I thought you were trying to initialize "the binary spec" on which we'd start to iterate (by submitting PRs, issues, etc). There was some consensus around bootstrapping the binary spec in this way, I thought and it sounds like people are eager to do this. I'd still say that a sibling dir to ml-proto inside spec is the right place, then.

Yeah, I failed to communicate here. My immediate goal is just to update this document in design/ so it's not wildly inaccurate, so that it's not misleading people who are looking for a summary of the design at present. I've actually received a couple unsolicited emails from people who were looking for 'the current spec' or 'current format' so I think the two best options are to delete BinaryEncoding.md entirely or strip it down so it's just a minimal representation of the format.

@ghost
Copy link

ghost commented Jan 19, 2016

I don't think that makes it fully explicit. As in, I don't think I
could write a binary decoder given this. As this becomes fully
explicit, the spec repo seems to be the right place for it.

Well, I've been writing the draft polyfill's decoder, so this format is
informed by the information I needed to write it. Information I mostly
acquired by reading the v8 source, since the gdoc wasn't completely
accurate. Right now you could implement a decoder for the module format
from this document without any additional reference information.
Decoding the AST function bodies would require extra info, and I agree
that we need really good documentation for that. But AST decoding and
module decoding are separate concerns, arguably.

The encoding of the sections seems oddly out of place to me. It's s-exp in the wast, and could use the same data layer representation as the AST. I can live with it either way, but using a common data layer could have been cleaner and might have set an example for other extension sections.

It would specify, for any given AST, the exact binary bits. And you
could write tests against it. That's pretty clear. OTOH, if
@rossberg-chromium <https://github.com/rossberg-chromium> isn't up
for it, it's not like I'm volunteering either, so I guess md would do.
~snip~
On the contrary, the binary encoding needs to be precisely paired
with the AST so that the exact correspondence is clear.

This appears to be a big philosophical disagreement that I thought we
were in agreement on. How do you want to address this? I'm not sure a
github issue is the right place, but feedback is welcome.

I don't think pointing to code as a specification is tenable either. It needs to be written up. But having a reference implementation seems fine. Can we please land the v8 style AST definition. The table with argument types seems a good start.

I think this is the root of the misunderstanding here. I thought you
were trying to initialize "the binary spec" on which we'd start to
iterate (by submitting PRs, issues, etc). There was some consensus
around bootstrapping the binary spec in this way, I thought and it
sounds like people are eager to do this. I'd still say that a
sibling dir to ml-proto inside spec is the right place, then.

Yeah, I failed to communicate here. My immediate goal is just to update
this document in design/ so it's not wildly inaccurate, so that it's not
misleading people who are looking for a summary of the design at
present. I've actually received a couple unsolicited emails from people
who were looking for 'the current spec' or 'current format' so I think
the two best options are to delete BinaryEncoding.md entirely or strip
it down so it's just a minimal representation of the format.

We still need somewhere to integrate the planned direction, and BinaryEncoding.md seems fine. We don't want the newly landed 'experimental encoding' to drift way ahead of the actual implementations - we would have the same problem again.

I don't think it belongs under ml-proto - that seems clearly a reference implementation to me. I suggest creating a new repo for the written specifications if it is not appropriate for the design repo. The current spec repo could then become dedicated to the reference ml-proto implementation and test suite.

@rossberg
Copy link
Member

@lukewagner, still planning to start on this soon, but was bogged down by other things after the holidays.

@kg, I have difficulties understanding your position. Clearly, as Luke said, there should be as much rigor in defining the mapping as possible. What would be gained by not explicating that in the proto spec (in addition to the textual description, of course)?


# Primitives and key terminology

### varuint32
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worthwhile to go ahead and define the other integer types and to be explicit that they are encoded in little-endian.

@titzer
Copy link

titzer commented Jan 20, 2016

On Tue, Jan 19, 2016 at 10:19 PM, Katelyn Gadd [email protected]
wrote:

Note that this does not exhaustively document the encoding of individual
opcodes, but we already didn't do that. The burden of keeping all of that
up to date alongside AstSemantics.md, v8-native, etc seems unnecessary so
we should figure out a better solution for that. Maybe document the
encoding directly in AstSemantics...

I think this is the right place to be very explicit and detailed, down to
the encoding for every single section and opcode. It should represent the
consensus and serve as a reference detailed enough for encoder and decoder
implementation. PRs to this document will then be the go-to place for
changes and tweaks. We can also add a section for more intrusive proposals
or plans at the end so that things can be merged and recorded without
becoming Gospel.

This will work great if we can point people to
github.com/WebAssembly/design/BinaryEncoding.md and it will be fully
self-contained.

We'll have to figure out a versioning story and this document will be the
place to do that.

Alternately the encoding of opcodes could be general such that we don't
need to document it per-opcode :-)


Reply to this email directly or view it on GitHub
#520 (comment).

@titzer
Copy link

titzer commented Jan 20, 2016

I agree with @lukewagner and @rossberg-chromium in that adding a decoder to
ml-proto makes sense. That is not in lieu of this English-language detailed
document, but a very important addition, in the same vein that it serves as
an executable specification for s-expressions.

On Wed, Jan 20, 2016 at 9:48 AM, rossberg-chromium <[email protected]

wrote:

@lukewagner https://github.com/lukewagner, still planning to start on
this soon, but was bogged down by other things after the holidays.

@kg https://github.com/kg, I have difficulties understanding your
position. Clearly, as Luke said, there should be as much rigor in defining
the mapping as possible. What would be gained by not explicating that
in the proto spec (in addition to the textual description, of course)?


Reply to this email directly or view it on GitHub
#520 (comment).

@kg
Copy link
Contributor Author

kg commented Jan 26, 2016

Revised the PR.

My opinion is that for laying out exact enumerant values (i.e. the index assigned to the Signatures section), we should have a separate document for that. In practice we should not have very many (any?) assigned indices, because we previously agreed to have name->index tables and not having them will prevent the format from being evolved iteratively.

So I think we should introduce a BinaryEncodingConstants.md or something, containing all the hard-coded constants we have right now. Then as we incrementally fix v8-native's encoding, we can remove parts of that document until it's empty and we can delete it. I would offer to write that document myself in a follow-up PR but I don't think I'll have a chance to do it any time soon.

@ghost
Copy link

ghost commented Jan 26, 2016

+1 to landing @kg's work here. Minor clarifications could be addressed in follow up PRs.

I've been making an attempt to put together an opcode table, noting conflicts with the spec, and filling in some of the missing information such as which operands are 'parametric' and the natural alignments etc, see https://github.com/wllang/wll/blob/master/wasm.lisp#L132 Perhaps I could have a go at the opcodes section, most of it might be automatically generated from this table, with the exceptions discussed in more detail. I've written a binary encoder and decoder and I've also been working with binaryen to try and resolve the operator names and opcode numbers.

@titzer
Copy link

titzer commented Feb 8, 2016

Superseded by #537.

@titzer titzer closed this Feb 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants