Tweaks to binary section format? #623

rossberg · 2016-03-24T14:35:51Z

Having finished a first iteration of the encoder & decoder for the spec, I'd like to make a couple of small suggestions regarding the structure of sections in the binary. Lumping the together here.

Section Headers

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)
? Two minor advantages:

It makes the offset computation slightly less confusing.
It allows simply viewing a section as a pair
(name_string; payload_string),
which is particularly natural wrt handling and skipping unknown sections.

With that change, skipping over sections requires skipping over their name first. But I have a hard time imagining a reason for skipping a section without even knowing what it is.

Section Names

Honestly, the current names are super verbose. Would anybody be opposed to shorten them a bit, to something nicer? I'd suggest:

"signatures"          -> "types"
"import_table"        -> "imports"
"function_signatures" -> "functions"
"export_table"        -> "exports"
"start_function"      -> "start"
"function_bodies"     -> "code"
"data_segments"       -> "data"

Note that the signatures section may be generalised to contain other kinds of type definitions in the future, so the current name is not a good fit.

Function Bodies

Function bodies are implicitly delimited by the byte size of their encoding. This is unfortunate for a couple of reasons:

The expression decoder cannot just operate on an abstract stream, it has to take this secondary end-of-stream condition into account. The spec currently handles this by a somewhat ad-hoc notion of "substream", but really, this pierces the stream abstraction in an ugly manner.
This is the only piece of the binary format that stands in the way of formulating the entire format unambiguously as a grammar. Because it depends on non-local (and lower-level) context information. I find it rather sad to get 99% there but lose it on the last meter.

I'd hence like to suggest adding an explicit end opcode to functions. Pros:

The binary format is fully structured and can be entirely parsed linearly from an abstract byte stream, without paying attention to any size information.
Sizes would only be needed to (a) seek through the stream if desired, (b) validating that they are consistent. That decouples concerns nicely.

I'm aware that there are concerns that the extra byte is "redundant", but is one byte per function a big deal? We recently saved much more by improving the representation of locals (or would by shortening section names :) ).

The text was updated successfully, but these errors were encountered:

wanderer · 2016-03-24T17:35:46Z

Section Names

If I understand correctly the whole string is going into the binary. Could each section type be mapped to an int and represented with a single byte?

I'm aware that there are concerns that the extra byte is "redundant", but is one byte per function a big deal?

If you have an end op could you drop the function body size? Could you reuse the end section instead of a new op?

kripken · 2016-03-24T17:35:58Z

+1 to all of these.

For the last one, we'd be adding an extra byte, but removing the varuint32 for "body size", shouldn't that be an overall win? 1 byte vs. 1+.

binji · 2016-03-24T17:36:28Z

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)

Agreed, I found it strange that the size of the section occurs before the section name as well.

Honestly, the current names are super verbose.

+1. These are the basic sections that all modules will have, let's make them succinct.

"signatures" -> "types"
"import_table" -> "imports"
"function_signatures" -> "functions"
"export_table" -> "exports"
"start_function" -> "start"
"function_bodies" -> "code"
"data_segments" -> "data"

OK with all of these except maybe functions and code. Don't have better ideas though.

I'd hence like to suggest adding an explicit end opcode to functions.

+1

binji · 2016-03-24T17:38:15Z

If I understand correctly the whole string is going to the end binary. Could each section type be mapped to an int and represented with a with a single byte?

This is how it used to be, but was changed to allow extending the format later with new section types.

If you have an end op could you drop the function body size? Could you reuse the end section instead of a new op?

The postorder format already adds an end op, so we'd just reuse it there, I think.

jfbastien · 2016-03-24T19:48:44Z

Does name_string include the size of the string? The purpose it to be able to skip sections with unknown names, and since they're not null-terminated we need a size. Reordering sounds fine to me.

ghost · 2016-03-24T21:45:15Z

@rossberg-chromium I also find it a little odd that the function top level expressions are terminated by a stream byte count whereas other blocks are terminated differently (count or end-marker). I don't think one extra byte per function is a blocker. Making them consistent would appear to give a small reduction in complexity for the decoder.

@kripken The function body size is still useful for the use case of skipping them quickly, for building a directory of the position and size of each function body, but I guess this could be in an optional directory section.

Thus it seems possible to remove the function body size too. So a decoder reading an incorrectly balanced function body AST would just continue interpreting the following data, perhaps hitting a 'syntax' error, and at the end of the section checking that all functions had been decoded and that the section terminates on the correct section size?

kripken · 2016-03-24T21:58:49Z

For fast skipping of function bodies, we could allow multiple function sections. However, I'm not sure we need to focus on optimizing skipping of bodies? Skipping entire sections is useful, but skipping bodies seems unnecessary given we now have signatures beforehand anyhow. Normal VM operation won't want to skip function bodies.

binji · 2016-03-24T22:11:58Z

@kripken we'll want to quickly skip function bodies for parallel decoding.

kripken · 2016-03-24T22:33:32Z

Oh, right, thanks.

sunfishcode · 2016-03-25T22:50:59Z

Section names: it's a bikeshed of course, but I prefer the more descriptive names. In particular, "functions", "types", and "code" seem likely to confuse.

ghost · 2016-03-27T22:11:41Z

Regarding the signatures section renaming, this section does not appear amenable to extension at present, rather it is specialized to the function signature types. Thus extending it in future would break readability which might not be practical when wasm is deployed, and perhaps it would be more likely that wasm would add a separate section to introduce other types and if so then renaming it to types might not be a great choice.

lukewagner · 2016-03-28T18:39:22Z

Section Headers

+1 That is actually how I implemented it initially in SM and I also found the switch to make the code a bit grosser.

Section Names

Hah, this was also on my personal "tidy-up todo".

Function Bodies

+1. Symmetric to this, there's also an implicit dependency on the whole module's byte size in the current proposal in #597 that we'd probably want to replace with, perhaps, an up-front section count (after magic+version, before first section).

titzer · 2016-03-29T09:22:39Z

On Thu, Mar 24, 2016 at 3:36 PM, rossberg-chromium <[email protected]

wrote:

Having finished a first iteration of the encoder & decoder for the spec,
I'd like to make a couple of small suggestions regarding the structure of
sections in the binary. Lumping the together here.
Section Headers

Instead of
(payload_size; name_string; payload)
as the overall section structure, can we swap that to
(name_string; payload_size; payload)
? Two minor advantages:

It makes the offset computation slightly less confusing.

It allows simply viewing a section as a pair
(name_string; payload_string),
which is particularly natural wrt handling and skipping unknown
sections.

With that change, skipping over sections requires skipping over their name
first. But I have a hard time imagining a reason for skipping a section
without even knowing what it is.

Ok by me.

Section Names

Honestly, the current names are super verbose. Would anybody be opposed to
shorten them a bit, to something nicer? I'd suggest:

"signatures" -> "types"
"import_table" -> "imports"
"function_signatures" -> "functions"
"export_table" -> "exports"
"start_function" -> "start"
"function_bodies" -> "code"
"data_segments" -> "data"

Note that the signatures section may be generalised to contain other
kinds of type definitions in the future, so the current name is not a good
fit.

+1

Function Bodies

Function bodies are implicitly delimited by the byte size of their
encoding. This is unfortunate for a couple of reasons:

The expression decoder cannot just operate on an abstract stream, it
has to take this secondary end-of-stream condition into account. The spec
currently handles this by a somewhat ad-hoc notion of "substream", but
really, this pierces the stream abstraction in an ugly manner.

This is the only piece of the binary format that stands in the way of
formulating the entire format unambiguously as a grammar. Because it
depends on non-local (and lower-level) context information. I find it
rather sad to get 99% there but lose it on the last meter.

I'd hence like to suggest adding an explicit end opcode to functions.
Pros:

The binary format is fully structured and can be entirely parsed
linearly from an abstract byte stream, without paying attention to any size
information.

Sizes would only be needed to (a) seek through the stream if desired,
(b) validating that they are consistent. That decouples concerns nicely.

I'm aware that there are concerns that the extra byte is "redundant", but
is one byte per function a big deal? We recently saved much more by
improving the representation of locals (or would by shortening section
names :) ).

I think we discussed this before. It is redundant since a decoder always
needs to do proper bounds checking against the end of the stream when
decoding a function body anyway. An end byte doesn't make that check
redundant, since there can always be malformed input. And yes, I think
avoiding another byte per function is worth it; there are 37k functions in
AngryBots :-)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#623

ghost · 2016-03-29T15:45:35Z

@titzer Yes, I guess with 37k functions and one byte each which gives 0.3% of the wasm file size this is not insignificant for such a small matter and there seems no prospect of it being optimized in the operator table or with macros. Might question if the current AngryBots is representative as it seems to have a lot of small functions which might be trivial methods that should have been inlined anyway.

rossberg · 2016-04-05T11:38:05Z

[Back from vacation.]

It seems there is mostly consensus for the first two suggestions, and sympathy for the latter. I'll create separate PRs for each suggestion, and we can iterate on the details there.

@binji, I picked "code" and "data" to mirror the usual code/data dichotomy familiar from other binaries. Not attached to that, though.

@jfbastien, yes, "string" is supposed to denote a (length, bytes) pair.

@kripken, as others pointed out, we still have use for the function size.

@JSStats, regarding the types vs signatures section, we already decided at the Feb summit to make the function type constructor explicit to support extensibility. It's just that nobody got around turning that one into a PR yet. I'll do.

@titzer, the size bound introduces a secondary end-of-stream condition. With actual streaming, you cannot know upfront which end-of-stream condition will fire first (as the size value might be invalid), so you always have to check against both.

ghost · 2016-04-05T13:45:07Z

Looking at the sexp-wasm single pass compiler-to-stack-machine, there are some frustrating pieces of code that seem relevant to the function end declaration matter. For each function top level expression it needs to know the number of expected values to decide if they are to be discarded, but not knowing this it emits a discard operation and then if at the end of the function it needs the values it removes this discard operation. This also blocks top-down typing in a single pass because the expected type of a function top level expression is not known when starting to build the expression. This all suggests that for the pre-order encoding having an expression count, just like for the blocks, would help simplify the decoder. This comment seems specific to the pre-order decoder, so might be of limited relevance moving forward.

jfbastien · 2016-12-15T23:50:55Z

I think this is long done. @rossberg-chromium please reopen if that's not the case.

jfbastien added this to the MVP milestone Apr 5, 2016

jfbastien added the binary format label Apr 5, 2016

This was referenced Apr 5, 2016

Reorder section size field #634

Closed

Prettify section names #635

Closed

This was referenced Apr 5, 2016

Prettify section names #638

Merged

Reorder section size field #639

Merged

rossberg mentioned this issue Apr 21, 2016

Add end opcode to functions #666

Closed

jfbastien closed this as completed Dec 15, 2016

Tweaks to binary section format? #623

Tweaks to binary section format? #623

Comments

rossberg commented Mar 24, 2016

Section Headers

Section Names

Function Bodies

wanderer commented Mar 24, 2016

Uh oh!

kripken commented Mar 24, 2016

Uh oh!

binji commented Mar 24, 2016

Uh oh!

binji commented Mar 24, 2016

Uh oh!

jfbastien commented Mar 24, 2016

Uh oh!

ghost commented Mar 24, 2016

Uh oh!

kripken commented Mar 24, 2016

Uh oh!

binji commented Mar 24, 2016

Uh oh!

kripken commented Mar 24, 2016

Uh oh!

sunfishcode commented Mar 25, 2016

Uh oh!

ghost commented Mar 27, 2016

Uh oh!

lukewagner commented Mar 28, 2016

Uh oh!

titzer commented Mar 29, 2016

It makes the offset computation slightly less confusing.

The expression decoder cannot just operate on an abstract stream, it has to take this secondary end-of-stream condition into account. The spec currently handles this by a somewhat ad-hoc notion of "substream", but really, this pierces the stream abstraction in an ugly manner.

The binary format is fully structured and can be entirely parsed linearly from an abstract byte stream, without paying attention to any size information.

Uh oh!

ghost commented Mar 29, 2016

Uh oh!

rossberg commented Apr 5, 2016

Uh oh!

ghost commented Apr 5, 2016

Uh oh!

jfbastien commented Dec 15, 2016

Uh oh!

The expression decoder cannot just operate on an abstract stream, it
has to take this secondary end-of-stream condition into account. The spec
currently handles this by a somewhat ad-hoc notion of "substream", but
really, this pierces the stream abstraction in an ugly manner.

The binary format is fully structured and can be entirely parsed
linearly from an abstract byte stream, without paying attention to any size
information.