Extensible encoding of function signatures #640

rossberg · 2016-04-05T14:02:03Z

[Retargeted #636.]

Better support future type system extensions with

new forms of type definitions,
functions with multiple return values,
potential mixture of base type constructors (i32..f64) and structured types in common positions, by avoiding overlap in their encoding.

Motivation: when moving to a richer type language, you'll need a proper AST for type expressions. Currently, our grammar is simple:

value_type ::= i32 | i64 | f32 | f64
structural_type ::= value_type* -> value_type?
type_id ::= uint32

where structural types are the ones defined in the type section, and type_ids are references to these. In particular, there is no overlap between the different syntactic classes, they are always used in distinct places.

Now, when we grow the language, it becomes highly conceivable that (some of) this separation no longer applies, and that some of the phrases grow additional alternatives. For example, in a potential extension with both struct and function pointers, we might have the following:

value_type ::= i32 | i64 | f32 | f64 | type_id
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32

It's also conceivable that we will want to allow nested type expressions at some point, or not require naming every structural type, in which case we'd get something like

value_type ::= i32 | i64 | f32 | f64 | type_id | structural_type
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32

This PR simply ensures that the encoding of the current grammar is future-compatible with such potential extensions, by not baking in assumptions about the size or disjointness of syntactic classes: the production "value_type ::= type_id" could be encoded with another opcode for type references (followed by an index immediate), struct types with another opcode for the struct type constructor; the embedding of structural types into value types would require no extra opcode if we keep their index spaces disjoint (which is a design choice with zero cost).

A variation would be to overlay "type constructor" and "type id" opcode/index space e.g. by signedness, similar to what Luke suggests. That would save introducing an extra opcode for type references in the future. Regardless, I'd still propose to keep primitive and structural type constructor spaces disjoint.

[Binary 11] Update the version number to 0xB.

Revert "[Binary 11] Update the version number to 0xB."

lukewagner · 2016-04-05T15:52:47Z

BinaryEncoding.md

 | ----- |  ----- | ----- |
+| constructor | `0x40` | the function type constructor |


[Re-asking my question in the closed PR]

Instead of making an arbitrary-feeling reservation, what if the nullary constructors were positive and the user-defined constructors were negative and we used (signed) varint32, starting at -1 and going down?

kripken · 2016-04-05T16:26:03Z

I must have missed something, what is a "function type constructor"?

naturaltransformation · 2016-04-05T17:39:22Z

BinaryEncoding.md

+| return_count | `uint8` | the number of results from the function (0 or 1) |
+| return_type | `value_type?` | the result type of the function (if return_count is 1) |
+
+(Note: In the future, this section may contain other forms of type entries as well.)


So will the set of acceptable type entry kinds be tied to specific versions of the binary format (and WASM)? Specifically, WASM consumer implementations cannot just implement a subset of the constructors and claim to support a specific version of WASM. I am assuming that is the case, so that it is impossible to reach a type entry with an unknown constructor value and "type entries" can never be safely skipped.

naturaltransformation · 2016-04-05T17:54:52Z

In this context, a constructor is essentially a tag (from a tagged union) so function type constructor is the tag and also a constructor for building function types (i.e., apply 0x40 to a record to build a function type). This is in contrast to type constructor and data constructor which construct types (and possibly type definitions I'm assuming) and values respectively. Would the term "tag" or "type entry kind" be slightly friendlier here?

kripken · 2016-04-05T18:08:33Z

Thanks @naturaltransformation, but I'm afraid I understood maybe 5% of that :) Am I stupid? Or does this PR suddenly and unexpectedly add a completely new concept to the design repo? Neither the term "constructor" nor "tag" seem to appear previously in this repo?

I don't see what tagged unions have to do with wasm function types, which are simple "this is the return type, these are the argument types" type things?

What is a "record" here?

Finally, what does "build a function type" mean? And why do we need to "build" ones, instead of just the simple way of defining them we had so far (like (param f64) (ret i32))?

Again, I might be stupid here, sorry if so.

naturaltransformation · 2016-04-05T18:13:24Z

Heh, no, sorry about that. It is my fault. I guess I tried to explain jargon with other jargon. The change is to leave room for other kinds of type_entries. "Tagged unions" is in the sense of an implementation detail of the binary decoder/encoder. The bottom line is that the "function type constructor" is just a tag to indicate that a given type_entry is a function type and not a typedef or some other unforeseen kind of type_entry. As least that is how I'm understanding it.

naturaltransformation · 2016-04-05T18:16:22Z

As I understand it, this change does not impact how WASM types are used in WASM modules at all, only that the the binary format decoder now needs to accommodate potential additional kinds of type_entries.

kripken · 2016-04-05T18:23:52Z

Oh, I see, so this is saying "this type is a simple function type", and eventually this field will be used to indicate whether a type is something else, when we have such things? Cool, thanks.

Perhaps "constructor", "tag", etc. would be surprising for other people as well? I am obviously the farthest thing from a type theorist, but probably other non-type theorists will read this document too ;)

naturaltransformation · 2016-04-05T18:31:32Z

BinaryEncoding.md

-#### Signature entry
-| Field | Type | Description |
+#### Type entry
+| Field | Type/Value | Description |


Instead of a Type/Value column, maybe we can leave it as a Type column and move the 0x40 value to Description to be consistent with the other fixed value fields such as the module header.

naturaltransformation · 2016-04-05T18:38:46Z

Yeah, it would be nice to have more accessible terminology. I suppose it is a "type_entry type", but that has its own problems.

ghost · 2016-04-05T22:06:17Z

This still seems far from extensible, rather it requires binary format version changes when adding new type declaration formats. Just a thought but could the type definitions section be abstracted into their own AST, following a similar format to the function bodies, and use the planned operator table to define future type definition opcodes?

rossberg · 2016-04-06T12:27:02Z

I extended the PR description with more detail about what this PR is trying to achieve and how, hopefully answering most of the questions that came up.

Also addressed some comments:

@kripken, renamed constructor field to form (of type), for lack of a better name.

@lukewagner, changed the return_count to be a varuint1. ;)

@lukewagner, @titzer, keeping the index space of primitive and structural type constructors disjoint keeps the door open to perhaps allow inline use of (some?) structural types in the future (avoiding to have to name every type). Not sure we ever want to do that, but given the size of the index space, I see no reason to preclude it prematurely (I can't imagine a universe where we will ever have more than 128 different type constructors, or even get close to that number).

@JSStats, laying the ground works for being able to grow type encodings into proper AST encodings is exactly what this PR tries to achieve, see the updated description.

lukewagner · 2016-04-06T14:28:52Z

BinaryEncoding.md

 | Field | Type | Description |
 | ----- |  ----- | ----- |
+| form | `uint8` | `0x40`, indicating a function type |


We could keep the disjointness of primitive and structural type constructors by giving the former monotonically increasing integers and the latter monotonically decreasing integers (starting by giving functions a form of -1). So same aesthetic preference for avoiding arbitrary-feeling statements like "noone will ever need more than 0x3f primitive type constructors", but different meaning for the negative index than in the previous comment. If so then, the type would be a varint32.

I'm fine with using var(u)int. But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids. If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing between ids and constructors in the future, so that type id references could be single byte (in which case we probably need to assign negative numbers to all constructors now). WDYT?

But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids

I didn't understand this part of the OP: why would we want to add the complication of "inlining" certain structural constructors instead of just using a type id?

If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing
between ids and constructors in the future

Yes, if we can rule out the third case as I'm asking above then the encoding could be pretty simple: if positive, it's a pritimive, if negative, it's a (negated) type-id. But that'd be the encoding of a value type. form is the encoding of a different set: the set of compound constructors. So as observed earlier, it could completely overlap with both primitives and type-ids. I can see the argument for keeping the encoding of compound ctors disjoint from that of primitive ctors (so that you can represent the set of all constructors with an int), but that seems to work just fine with giving the compound ctors negative indices.

I didn't understand this part of the OP: why would we want to add the
complication of "inlining" certain structural constructors instead of just
using a type id?

E.g. to reduce the overhead of one-off uses of structural types. Or imagine
you want to emit a more complicated nested struct, then you wouldn't need
to separate out and name each level.

Perhaps the other way round is a more compelling scenario: you may want to
allow primitive constructors in type definitions. Or they'll start to mix
when we introduce type import/exports some day. Also, it's hard to predict
what other thing might come along and change the story (generics? who
knows).

I'm not suggesting that there currently is a concrete reason to join the
spaces. But it doesn't seem completely unlikely to arise later, and given
that the cost of keeping it an option is zero, why preclude it?

Yes, if we can rule out the third case as I'm asking above then the

encoding could be pretty simple: if positive, it's a pritimive, if
negative, it's a (negated) type-id.

I'd actually invert that scheme, to avoid the negation.

What I don't understand is why all those future new things can't just be new forms of entries in the types section such that a type-id is all you need?

I'd actually invert that scheme, to avoid the negation.

Wouldn't that give all the primitive types today (i32, etc) negative indices?

@titzer, ha, I knew I shouldn't have mentioned generics. This is getting OT, but let me just say that I would like to avoid their complexity as much as the next guy, while I'm also aware that "our language doesn't need generics" have become famous last words of language/VM designers. Static compilation only works for fairly weak, second-class polymorphism; in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions and lots of expensive runtime checks. Maybe that's okay for Wasm.

In contrast, where do you see the downside of avoiding conflicts between the spaces until we
understand the future better?

I also don't understand what is being proposed in this PR to address this: superficially this is just a question of 0x40 vs. -1. What are you proposing happens for these new-kinds-of-types?

in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions
and lots of expensive runtime checks

Still OT, but: yes, for Java-style. For C#-style, though, I was assuming that a C#-on-wasm runtime would actually need to ship with its own runtime machinery to do runtime generation of wasm for instantiations that only show up at runtime since I'd be surprised if we could design a feature in wasm that wasn't overly specialized to C# but that could still handle the C# use case.

In contrast, where do you see the downside of avoiding conflicts between
the spaces until we
understand the future better?

I also don't understand what is being proposed in this PR to address this:
superficially this is just a question of 0x40 vs. -1. What are you
proposing happens for these new-kinds-of-types?

Hm, I thought we just agreed that signedness is best reserved for
distinguishing type ids. So this PR avoids clobbering the opposite sign
space, and instead just picks an arbitrary opcode for the function type
that doesn't collide with the primitive ones. 0x40 just because it
partitions the positive 1-byte signed LEB value range into two equal
halves, reserving one side for nullary, the other for non-nullary
constructors (which may or may not make sense, I'm open to better
suggestions; there are probably going to be far more nullary constructors
than others, but either way the space is comfortably large AFAICT).

Hm, I thought we just agreed that signedness is best reserved for distinguishing type ids.

If non-nullary constructors don't show up in value types (only their type-ids), then both could have negative indices. But I guess the counterargument is: maybe not the 3 non-nullary ctors we're thinking about now (func, struct, array), but perhaps some new thing in the the future and if positive indices are "reserved" for nullary and negative is "reserved" for type-ids, then we're out of luck (or, at the very least, we'd have to do break the pattern). I guess I buy that, so 0x40 is fine. More than just the aesthetic -1 vs. 0x40 has been question of where this is going which I think I now have a better understanding of.

Incidentally, I came across a situation which supports the current design: in Wasm.Table we want to be able to declare the types of elements in the table definition/import. The abovementioned scheme for local types seems to fit (allowing table elements to have any type that you can put in a local, or some restriction thereof), but the question is how to say "any function". Well, since we already have this 0x40 "Function" constructor in the index space, that seems to be a good candidate (even if it's a slight abuse of logical category). Similarly, the "Struct" and "Array" constructors could mean "any struct type" / "any array type" which could make sense one day.

@jf

* When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback.

rossberg · 2016-04-18T12:20:14Z

Ping on this one. Is there desire for more discussion, or should we land and leave potential opcode details for follow-ups?

ghost · 2016-04-18T12:34:46Z

From what I understand any change to the type section will require a wasm binary version bump anyway? Even adding functions with multiple return values would appear to not validate without a version bump due to the tight definition of the fields. So what would this change achieve?

rossberg · 2016-04-18T15:45:09Z

Forwards compatibility. Even with version bumps, old versions should still be subsets of new versions.

lukewagner · 2016-04-18T16:13:06Z

lgtm

titzer · 2016-04-18T16:30:39Z

BinaryEncoding.md


-The indirect function table section defines the module's 
+The indirection section defines the module's 


Worth spelling out "indirect function table" as per original text?

This one got renamed to table section on main branch.

titzer · 2016-04-18T16:31:19Z

BinaryEncoding.md

-The Function Bodies section assigns a body to every function in the module.
-The count of function signatures and function bodies must be the same and the `i`th
-signature corresponds to the `i`th function body.
+The code section assigns a body to every function in the module.


"contains a function body for every function in the module"

Assign is just so...imperative :-)

@jf

* Prettify section names * Restructure encoding of function signatures * Revert "[Binary 11] Update the version number to 0xB." * Leave index space for growing the number of base types * Comments addressed * clarify how export/import names convert to JS strings (#569) (#573) * When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback. * Access to proprietary APIs apart from HTML5 (#656) * comments

@jf

* Merge pull request #648 from WebAssembly/current_memory Add current_memory operator * Reorder section size field (#639) * Prettify section names (#638) * Extensible encoding of function signatures (#640) * Prettify section names * Restructure encoding of function signatures * Revert "[Binary 11] Update the version number to 0xB." * Leave index space for growing the number of base types * Comments addressed * clarify how export/import names convert to JS strings (#569) (#573) * When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback. * Access to proprietary APIs apart from HTML5 (#656) * comments * Merge pull request #641 from WebAssembly/postorder_opcodes Postorder opcodes * fix some text that seems to be in the wrong order (#670) * Clarify that br_table has a branch argument (#664) * Add explicit argument counts (#672) * Add explicit arities * Rename * Replace uint8 with varint7 in form field (#662) This needs to be variable-length.

rossberg added 2 commits April 5, 2016 14:07

Prettify section names

f78543d

Restructure encoding of function signatures

bb8a6e4

rossberg mentioned this pull request Apr 5, 2016

Extensible encoding of function signatures #636

Closed

titzer and others added 4 commits April 5, 2016 16:51

Merge pull request #637 from WebAssembly/binary_0xb_version

247d7b6

[Binary 11] Update the version number to 0xB.

Revert "[Binary 11] Update the version number to 0xB."

7c8be53

Merge pull request #642 from WebAssembly/revert-637-binary_0xb_version

0ac338d

Revert "[Binary 11] Update the version number to 0xB."

Leave index space for growing the number of base types

c63829e

lukewagner reviewed Apr 5, 2016
View reviewed changes

naturaltransformation reviewed Apr 5, 2016
View reviewed changes

rossberg mentioned this pull request Apr 6, 2016

Prettify section names #638

Merged

Comments addressed

ec21463

lukewagner reviewed Apr 6, 2016
View reviewed changes

pizlonator and others added 2 commits April 15, 2016 13:17

Access to proprietary APIs apart from HTML5 (#656)

2312afd

rossberg added the binary format label Apr 18, 2016

titzer reviewed Apr 18, 2016
View reviewed changes

rossberg added 4 commits April 19, 2016 14:54

Merge branch 'master' into types-sec

903d997

Merge branch 'master' into types-sec

3c53609

Merge branch 'binary_0xb' into types-sec

df569d7

comments

aa0407d

rossberg merged commit 54da3d5 into binary_0xb Apr 19, 2016

kripken mentioned this pull request Apr 19, 2016

Extensible type forms WebAssembly/binaryen#367

Merged

jfbastien deleted the types-sec branch April 19, 2016 21:05

lukewagner mentioned this pull request Apr 20, 2016

no need to reserve 0 for "void" type #663

Closed

		\| ----- \| ----- \| ----- \|
		\| constructor \| `0x40` \| the function type constructor \|


		The indirect function table section defines the module's
		The indirection section defines the module's

Extensible encoding of function signatures #640

Extensible encoding of function signatures #640

Uh oh!

Conversation

rossberg commented Apr 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken commented Apr 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naturaltransformation commented Apr 5, 2016

Uh oh!

kripken commented Apr 5, 2016

Uh oh!

naturaltransformation commented Apr 5, 2016

Uh oh!

naturaltransformation commented Apr 5, 2016

Uh oh!

kripken commented Apr 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naturaltransformation commented Apr 5, 2016

Uh oh!

ghost commented Apr 5, 2016

Uh oh!

rossberg commented Apr 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rossberg commented Apr 18, 2016

Uh oh!

ghost commented Apr 18, 2016

Uh oh!

rossberg commented Apr 18, 2016

Uh oh!

lukewagner commented Apr 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!