Skip to content

Tweaks to binary section format? #623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rossberg opened this issue Mar 24, 2016 · 17 comments
Closed

Tweaks to binary section format? #623

rossberg opened this issue Mar 24, 2016 · 17 comments
Milestone

Comments

@rossberg
Copy link
Member

Having finished a first iteration of the encoder & decoder for the spec, I'd like to make a couple of small suggestions regarding the structure of sections in the binary. Lumping the together here.

Section Headers

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)
? Two minor advantages:

  • It makes the offset computation slightly less confusing.
  • It allows simply viewing a section as a pair
    (name_string; payload_string),
    which is particularly natural wrt handling and skipping unknown sections.

With that change, skipping over sections requires skipping over their name first. But I have a hard time imagining a reason for skipping a section without even knowing what it is.

Section Names

Honestly, the current names are super verbose. Would anybody be opposed to shorten them a bit, to something nicer? I'd suggest:

"signatures"          -> "types"
"import_table"        -> "imports"
"function_signatures" -> "functions"
"export_table"        -> "exports"
"start_function"      -> "start"
"function_bodies"     -> "code"
"data_segments"       -> "data"

Note that the signatures section may be generalised to contain other kinds of type definitions in the future, so the current name is not a good fit.

Function Bodies

Function bodies are implicitly delimited by the byte size of their encoding. This is unfortunate for a couple of reasons:

  • The expression decoder cannot just operate on an abstract stream, it has to take this secondary end-of-stream condition into account. The spec currently handles this by a somewhat ad-hoc notion of "substream", but really, this pierces the stream abstraction in an ugly manner.
  • This is the only piece of the binary format that stands in the way of formulating the entire format unambiguously as a grammar. Because it depends on non-local (and lower-level) context information. I find it rather sad to get 99% there but lose it on the last meter.

I'd hence like to suggest adding an explicit end opcode to functions. Pros:

  • The binary format is fully structured and can be entirely parsed linearly from an abstract byte stream, without paying attention to any size information.
  • Sizes would only be needed to (a) seek through the stream if desired, (b) validating that they are consistent. That decouples concerns nicely.

I'm aware that there are concerns that the extra byte is "redundant", but is one byte per function a big deal? We recently saved much more by improving the representation of locals (or would by shortening section names :) ).

@wanderer
Copy link
Contributor

Section Names

If I understand correctly the whole string is going into the binary. Could each section type be mapped to an int and represented with a single byte?

I'm aware that there are concerns that the extra byte is "redundant", but is one byte per function a big deal?

If you have an end op could you drop the function body size? Could you reuse the end section instead of a new op?

@kripken
Copy link
Member

kripken commented Mar 24, 2016

+1 to all of these.

For the last one, we'd be adding an extra byte, but removing the varuint32 for "body size", shouldn't that be an overall win? 1 byte vs. 1+.

@binji
Copy link
Member

binji commented Mar 24, 2016

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)

Agreed, I found it strange that the size of the section occurs before the section name as well.

Honestly, the current names are super verbose.

+1. These are the basic sections that all modules will have, let's make them succinct.

"signatures" -> "types"
"import_table" -> "imports"
"function_signatures" -> "functions"
"export_table" -> "exports"
"start_function" -> "start"
"function_bodies" -> "code"
"data_segments" -> "data"

OK with all of these except maybe functions and code. Don't have better ideas though.

I'd hence like to suggest adding an explicit end opcode to functions.

+1

@binji
Copy link
Member

binji commented Mar 24, 2016

If I understand correctly the whole string is going to the end binary. Could each section type be mapped to an int and represented with a with a single byte?

This is how it used to be, but was changed to allow extending the format later with new section types.

If you have an end op could you drop the function body size? Could you reuse the end section instead of a new op?

The postorder format already adds an end op, so we'd just reuse it there, I think.

@jfbastien
Copy link
Member

Does name_string include the size of the string? The purpose it to be able to skip sections with unknown names, and since they're not null-terminated we need a size. Reordering sounds fine to me.

@ghost
Copy link

ghost commented Mar 24, 2016

@rossberg-chromium I also find it a little odd that the function top level expressions are terminated by a stream byte count whereas other blocks are terminated differently (count or end-marker). I don't think one extra byte per function is a blocker. Making them consistent would appear to give a small reduction in complexity for the decoder.

@kripken The function body size is still useful for the use case of skipping them quickly, for building a directory of the position and size of each function body, but I guess this could be in an optional directory section.

Thus it seems possible to remove the function body size too. So a decoder reading an incorrectly balanced function body AST would just continue interpreting the following data, perhaps hitting a 'syntax' error, and at the end of the section checking that all functions had been decoded and that the section terminates on the correct section size?

@kripken
Copy link
Member

kripken commented Mar 24, 2016

For fast skipping of function bodies, we could allow multiple function sections. However, I'm not sure we need to focus on optimizing skipping of bodies? Skipping entire sections is useful, but skipping bodies seems unnecessary given we now have signatures beforehand anyhow. Normal VM operation won't want to skip function bodies.

@binji
Copy link
Member

binji commented Mar 24, 2016

@kripken we'll want to quickly skip function bodies for parallel decoding.

@kripken
Copy link
Member

kripken commented Mar 24, 2016

Oh, right, thanks.

@sunfishcode
Copy link
Member

Section names: it's a bikeshed of course, but I prefer the more descriptive names. In particular, "functions", "types", and "code" seem likely to confuse.

@ghost
Copy link

ghost commented Mar 27, 2016

Regarding the signatures section renaming, this section does not appear amenable to extension at present, rather it is specialized to the function signature types. Thus extending it in future would break readability which might not be practical when wasm is deployed, and perhaps it would be more likely that wasm would add a separate section to introduce other types and if so then renaming it to types might not be a great choice.

@lukewagner
Copy link
Member

Section Headers

+1 That is actually how I implemented it initially in SM and I also found the switch to make the code a bit grosser.

Section Names

Hah, this was also on my personal "tidy-up todo".

Function Bodies

+1. Symmetric to this, there's also an implicit dependency on the whole module's byte size in the current proposal in #597 that we'd probably want to replace with, perhaps, an up-front section count (after magic+version, before first section).

@titzer
Copy link

titzer commented Mar 29, 2016

On Thu, Mar 24, 2016 at 3:36 PM, rossberg-chromium <[email protected]

wrote:

Having finished a first iteration of the encoder & decoder for the spec,
I'd like to make a couple of small suggestions regarding the structure of
sections in the binary. Lumping the together here.
Section Headers

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)
? Two minor advantages:

It makes the offset computation slightly less confusing.

It allows simply viewing a section as a pair
(name_string; payload_string),
which is particularly natural wrt handling and skipping unknown
sections.

With that change, skipping over sections requires skipping over their name
first. But I have a hard time imagining a reason for skipping a section
without even knowing what it is.

Ok by me.

Section Names

Honestly, the current names are super verbose. Would anybody be opposed to
shorten them a bit, to something nicer? I'd suggest:

"signatures" -> "types"
"import_table" -> "imports"
"function_signatures" -> "functions"
"export_table" -> "exports"
"start_function" -> "start"
"function_bodies" -> "code"
"data_segments" -> "data"

Note that the signatures section may be generalised to contain other
kinds of type definitions in the future, so the current name is not a good
fit.

+1

Function Bodies

Function bodies are implicitly delimited by the byte size of their
encoding. This is unfortunate for a couple of reasons:

The expression decoder cannot just operate on an abstract stream, it
has to take this secondary end-of-stream condition into account. The spec
currently handles this by a somewhat ad-hoc notion of "substream", but
really, this pierces the stream abstraction in an ugly manner.

This is the only piece of the binary format that stands in the way of
formulating the entire format unambiguously as a grammar. Because it
depends on non-local (and lower-level) context information. I find it
rather sad to get 99% there but lose it on the last meter.

I'd hence like to suggest adding an explicit end opcode to functions.
Pros:

The binary format is fully structured and can be entirely parsed
linearly from an abstract byte stream, without paying attention to any size
information.

Sizes would only be needed to (a) seek through the stream if desired,
(b) validating that they are consistent. That decouples concerns nicely.

I'm aware that there are concerns that the extra byte is "redundant", but
is one byte per function a big deal? We recently saved much more by
improving the representation of locals (or would by shortening section
names :) ).

I think we discussed this before. It is redundant since a decoder always
needs to do proper bounds checking against the end of the stream when
decoding a function body anyway. An end byte doesn't make that check
redundant, since there can always be malformed input. And yes, I think
avoiding another byte per function is worth it; there are 37k functions in
AngryBots :-)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#623

@ghost
Copy link

ghost commented Mar 29, 2016

@titzer Yes, I guess with 37k functions and one byte each which gives 0.3% of the wasm file size this is not insignificant for such a small matter and there seems no prospect of it being optimized in the operator table or with macros. Might question if the current AngryBots is representative as it seems to have a lot of small functions which might be trivial methods that should have been inlined anyway.

@jfbastien jfbastien added this to the MVP milestone Apr 5, 2016
@rossberg
Copy link
Member Author

rossberg commented Apr 5, 2016

[Back from vacation.]

It seems there is mostly consensus for the first two suggestions, and sympathy for the latter. I'll create separate PRs for each suggestion, and we can iterate on the details there.

@binji, I picked "code" and "data" to mirror the usual code/data dichotomy familiar from other binaries. Not attached to that, though.

@jfbastien, yes, "string" is supposed to denote a (length, bytes) pair.

@kripken, as others pointed out, we still have use for the function size.

@JSStats, regarding the types vs signatures section, we already decided at the Feb summit to make the function type constructor explicit to support extensibility. It's just that nobody got around turning that one into a PR yet. I'll do.

@titzer, the size bound introduces a secondary end-of-stream condition. With actual streaming, you cannot know upfront which end-of-stream condition will fire first (as the size value might be invalid), so you always have to check against both.

This was referenced Apr 5, 2016
@ghost
Copy link

ghost commented Apr 5, 2016

Looking at the sexp-wasm single pass compiler-to-stack-machine, there are some frustrating pieces of code that seem relevant to the function end declaration matter. For each function top level expression it needs to know the number of expected values to decide if they are to be discarded, but not knowing this it emits a discard operation and then if at the end of the function it needs the values it removes this discard operation. This also blocks top-down typing in a single pass because the expected type of a function top level expression is not known when starting to build the expression. This all suggests that for the pre-order encoding having an expression count, just like for the blocks, would help simplify the decoder. This comment seems specific to the pre-order decoder, so might be of limited relevance moving forward.

@jfbastien
Copy link
Member

I think this is long done. @rossberg-chromium please reopen if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants