Skip to content

Extensible encoding of function signatures #640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 19, 2016
Merged

Extensible encoding of function signatures #640

merged 13 commits into from
Apr 19, 2016

Conversation

rossberg
Copy link
Member

@rossberg rossberg commented Apr 5, 2016

[Retargeted #636.]

Better support future type system extensions with

  • new forms of type definitions,
  • functions with multiple return values,
  • potential mixture of base type constructors (i32..f64) and structured types in common positions, by avoiding overlap in their encoding.

Motivation: when moving to a richer type language, you'll need a proper AST for type expressions. Currently, our grammar is simple:

value_type ::= i32 | i64 | f32 | f64
structural_type ::= value_type* -> value_type?
type_id ::= uint32

where structural types are the ones defined in the type section, and type_ids are references to these. In particular, there is no overlap between the different syntactic classes, they are always used in distinct places.

Now, when we grow the language, it becomes highly conceivable that (some of) this separation no longer applies, and that some of the phrases grow additional alternatives. For example, in a potential extension with both struct and function pointers, we might have the following:

value_type ::= i32 | i64 | f32 | f64 | type_id
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32

It's also conceivable that we will want to allow nested type expressions at some point, or not require naming every structural type, in which case we'd get something like

value_type ::= i32 | i64 | f32 | f64 | type_id | structural_type
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32

This PR simply ensures that the encoding of the current grammar is future-compatible with such potential extensions, by not baking in assumptions about the size or disjointness of syntactic classes: the production "value_type ::= type_id" could be encoded with another opcode for type references (followed by an index immediate), struct types with another opcode for the struct type constructor; the embedding of structural types into value types would require no extra opcode if we keep their index spaces disjoint (which is a design choice with zero cost).

A variation would be to overlay "type constructor" and "type id" opcode/index space e.g. by signedness, similar to what Luke suggests. That would save introducing an extra opcode for type references in the future. Regardless, I'd still propose to keep primitive and structural type constructor spaces disjoint.

| ----- | ----- | ----- |
| constructor | `0x40` | the function type constructor |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re-asking my question in the closed PR]

Instead of making an arbitrary-feeling reservation, what if the nullary constructors were positive and the user-defined constructors were negative and we used (signed) varint32, starting at -1 and going down?

@kripken
Copy link
Member

kripken commented Apr 5, 2016

I must have missed something, what is a "function type constructor"?

| return_count | `uint8` | the number of results from the function (0 or 1) |
| return_type | `value_type?` | the result type of the function (if return_count is 1) |

(Note: In the future, this section may contain other forms of type entries as well.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will the set of acceptable type entry kinds be tied to specific versions of the binary format (and WASM)? Specifically, WASM consumer implementations cannot just implement a subset of the constructors and claim to support a specific version of WASM. I am assuming that is the case, so that it is impossible to reach a type entry with an unknown constructor value and "type entries" can never be safely skipped.

@naturaltransformation
Copy link
Contributor

In this context, a constructor is essentially a tag (from a tagged union) so function type constructor is the tag and also a constructor for building function types (i.e., apply 0x40 to a record to build a function type). This is in contrast to type constructor and data constructor which construct types (and possibly type definitions I'm assuming) and values respectively. Would the term "tag" or "type entry kind" be slightly friendlier here?

@kripken
Copy link
Member

kripken commented Apr 5, 2016

Thanks @naturaltransformation, but I'm afraid I understood maybe 5% of that :) Am I stupid? Or does this PR suddenly and unexpectedly add a completely new concept to the design repo? Neither the term "constructor" nor "tag" seem to appear previously in this repo?

I don't see what tagged unions have to do with wasm function types, which are simple "this is the return type, these are the argument types" type things?

What is a "record" here?

Finally, what does "build a function type" mean? And why do we need to "build" ones, instead of just the simple way of defining them we had so far (like (param f64) (ret i32))?

Again, I might be stupid here, sorry if so.

@naturaltransformation
Copy link
Contributor

Heh, no, sorry about that. It is my fault. I guess I tried to explain jargon with other jargon. The change is to leave room for other kinds of type_entries. "Tagged unions" is in the sense of an implementation detail of the binary decoder/encoder. The bottom line is that the "function type constructor" is just a tag to indicate that a given type_entry is a function type and not a typedef or some other unforeseen kind of type_entry. As least that is how I'm understanding it.

@naturaltransformation
Copy link
Contributor

As I understand it, this change does not impact how WASM types are used in WASM modules at all, only that the the binary format decoder now needs to accommodate potential additional kinds of type_entries.

@kripken
Copy link
Member

kripken commented Apr 5, 2016

Oh, I see, so this is saying "this type is a simple function type", and eventually this field will be used to indicate whether a type is something else, when we have such things? Cool, thanks.

Perhaps "constructor", "tag", etc. would be surprising for other people as well? I am obviously the farthest thing from a type theorist, but probably other non-type theorists will read this document too ;)

#### Signature entry
| Field | Type | Description |
#### Type entry
| Field | Type/Value | Description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a Type/Value column, maybe we can leave it as a Type column and move the 0x40 value to Description to be consistent with the other fixed value fields such as the module header.

@naturaltransformation
Copy link
Contributor

Yeah, it would be nice to have more accessible terminology. I suppose it is a "type_entry type", but that has its own problems.

@ghost
Copy link

ghost commented Apr 5, 2016

This still seems far from extensible, rather it requires binary format version changes when adding new type declaration formats. Just a thought but could the type definitions section be abstracted into their own AST, following a similar format to the function bodies, and use the planned operator table to define future type definition opcodes?

@rossberg rossberg mentioned this pull request Apr 6, 2016
@rossberg
Copy link
Member Author

rossberg commented Apr 6, 2016

I extended the PR description with more detail about what this PR is trying to achieve and how, hopefully answering most of the questions that came up.

Also addressed some comments:

@kripken, renamed constructor field to form (of type), for lack of a better name.

@lukewagner, changed the return_count to be a varuint1. ;)

@lukewagner, @titzer, keeping the index space of primitive and structural type constructors disjoint keeps the door open to perhaps allow inline use of (some?) structural types in the future (avoiding to have to name every type). Not sure we ever want to do that, but given the size of the index space, I see no reason to preclude it prematurely (I can't imagine a universe where we will ever have more than 128 different type constructors, or even get close to that number).

@JSStats, laying the ground works for being able to grow type encodings into proper AST encodings is exactly what this PR tries to achieve, see the updated description.

| Field | Type | Description |
| ----- | ----- | ----- |
| form | `uint8` | `0x40`, indicating a function type |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep the disjointness of primitive and structural type constructors by giving the former monotonically increasing integers and the latter monotonically decreasing integers (starting by giving functions a form of -1). So same aesthetic preference for avoiding arbitrary-feeling statements like "noone will ever need more than 0x3f primitive type constructors", but different meaning for the negative index than in the previous comment. If so then, the type would be a varint32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using var(u)int. But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids. If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing between ids and constructors in the future, so that type id references could be single byte (in which case we probably need to assign negative numbers to all constructors now). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids

I didn't understand this part of the OP: why would we want to add the complication of "inlining" certain structural constructors instead of just using a type id?

If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing
between ids and constructors in the future

Yes, if we can rule out the third case as I'm asking above then the encoding could be pretty simple: if positive, it's a pritimive, if negative, it's a (negated) type-id. But that'd be the encoding of a value type. form is the encoding of a different set: the set of compound constructors. So as observed earlier, it could completely overlap with both primitives and type-ids. I can see the argument for keeping the encoding of compound ctors disjoint from that of primitive ctors (so that you can represent the set of all constructors with an int), but that seems to work just fine with giving the compound ctors negative indices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this part of the OP: why would we want to add the
complication of "inlining" certain structural constructors instead of just
using a type id?

E.g. to reduce the overhead of one-off uses of structural types. Or imagine
you want to emit a more complicated nested struct, then you wouldn't need
to separate out and name each level.

Perhaps the other way round is a more compelling scenario: you may want to
allow primitive constructors in type definitions. Or they'll start to mix
when we introduce type import/exports some day. Also, it's hard to predict
what other thing might come along and change the story (generics? who
knows).

I'm not suggesting that there currently is a concrete reason to join the
spaces. But it doesn't seem completely unlikely to arise later, and given
that the cost of keeping it an option is zero, why preclude it?

Yes, if we can rule out the third case as I'm asking above then the

encoding could be pretty simple: if positive, it's a pritimive, if
negative, it's a (negated) type-id.

I'd actually invert that scheme, to avoid the negation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't understand is why all those future new things can't just be new forms of entries in the types section such that a type-id is all you need?

I'd actually invert that scheme, to avoid the negation.

Wouldn't that give all the primitive types today (i32, etc) negative indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titzer, ha, I knew I shouldn't have mentioned generics. This is getting OT, but let me just say that I would like to avoid their complexity as much as the next guy, while I'm also aware that "our language doesn't need generics" have become famous last words of language/VM designers. Static compilation only works for fairly weak, second-class polymorphism; in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions and lots of expensive runtime checks. Maybe that's okay for Wasm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast, where do you see the downside of avoiding conflicts between the spaces until we
understand the future better?

I also don't understand what is being proposed in this PR to address this: superficially this is just a question of 0x40 vs. -1. What are you proposing happens for these new-kinds-of-types?

in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions
and lots of expensive runtime checks

Still OT, but: yes, for Java-style. For C#-style, though, I was assuming that a C#-on-wasm runtime would actually need to ship with its own runtime machinery to do runtime generation of wasm for instantiations that only show up at runtime since I'd be surprised if we could design a feature in wasm that wasn't overly specialized to C# but that could still handle the C# use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast, where do you see the downside of avoiding conflicts between
the spaces until we
understand the future better?

I also don't understand what is being proposed in this PR to address this:
superficially this is just a question of 0x40 vs. -1. What are you
proposing happens for these new-kinds-of-types?

Hm, I thought we just agreed that signedness is best reserved for
distinguishing type ids. So this PR avoids clobbering the opposite sign
space, and instead just picks an arbitrary opcode for the function type
that doesn't collide with the primitive ones. 0x40 just because it
partitions the positive 1-byte signed LEB value range into two equal
halves, reserving one side for nullary, the other for non-nullary
constructors (which may or may not make sense, I'm open to better
suggestions; there are probably going to be far more nullary constructors
than others, but either way the space is comfortably large AFAICT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought we just agreed that signedness is best reserved for distinguishing type ids.

If non-nullary constructors don't show up in value types (only their type-ids), then both could have negative indices. But I guess the counterargument is: maybe not the 3 non-nullary ctors we're thinking about now (func, struct, array), but perhaps some new thing in the the future and if positive indices are "reserved" for nullary and negative is "reserved" for type-ids, then we're out of luck (or, at the very least, we'd have to do break the pattern). I guess I buy that, so 0x40 is fine. More than just the aesthetic -1 vs. 0x40 has been question of where this is going which I think I now have a better understanding of.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, I came across a situation which supports the current design: in Wasm.Table we want to be able to declare the types of elements in the table definition/import. The abovementioned scheme for local types seems to fit (allowing table elements to have any type that you can put in a local, or some restriction thereof), but the question is how to say "any function". Well, since we already have this 0x40 "Function" constructor in the index space, that seems to be a good candidate (even if it's a slight abuse of logical category). Similarly, the "Struct" and "Array" constructors could mean "any struct type" / "any array type" which could make sense one day.

pizlonator and others added 2 commits April 15, 2016 13:17
* When embedded in the web, clarify how export/import names convert to JS strings (#569)

* Fixes suggested by @jf

* Address more feedback

Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html.  Simplified the decoding algorithm thanks to Luke's feedback.
@rossberg
Copy link
Member Author

Ping on this one. Is there desire for more discussion, or should we land and leave potential opcode details for follow-ups?

@ghost
Copy link

ghost commented Apr 18, 2016

From what I understand any change to the type section will require a wasm binary version bump anyway? Even adding functions with multiple return values would appear to not validate without a version bump due to the tight definition of the fields. So what would this change achieve?

@rossberg
Copy link
Member Author

Forwards compatibility. Even with version bumps, old versions should still be subsets of new versions.

@lukewagner
Copy link
Member

lgtm


The indirect function table section defines the module's
The indirection section defines the module's
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth spelling out "indirect function table" as per original text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one got renamed to table section on main branch.

The Function Bodies section assigns a body to every function in the module.
The count of function signatures and function bodies must be the same and the `i`th
signature corresponds to the `i`th function body.
The code section assigns a body to every function in the module.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"contains a function body for every function in the module"

Assign is just so...imperative :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rossberg rossberg merged commit 54da3d5 into binary_0xb Apr 19, 2016
@jfbastien jfbastien deleted the types-sec branch April 19, 2016 21:05
lukewagner pushed a commit that referenced this pull request Apr 28, 2016
* Prettify section names

* Restructure encoding of function signatures

* Revert "[Binary 11] Update the version number to 0xB."

* Leave index space for growing the number of base types

* Comments addressed

* clarify how export/import names convert to JS strings (#569) (#573)

* When embedded in the web, clarify how export/import names convert to JS strings (#569)

* Fixes suggested by @jf

* Address more feedback

Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html.  Simplified the decoding algorithm thanks to Luke's feedback.

* Access to proprietary APIs apart from HTML5 (#656)

* comments
lukewagner added a commit that referenced this pull request Apr 29, 2016
* Merge pull request #648 from WebAssembly/current_memory

Add current_memory operator

* Reorder section size field (#639)

* Prettify section names (#638)

* Extensible encoding of function signatures (#640)

* Prettify section names

* Restructure encoding of function signatures

* Revert "[Binary 11] Update the version number to 0xB."

* Leave index space for growing the number of base types

* Comments addressed

* clarify how export/import names convert to JS strings (#569) (#573)

* When embedded in the web, clarify how export/import names convert to JS strings (#569)

* Fixes suggested by @jf

* Address more feedback

Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html.  Simplified the decoding algorithm thanks to Luke's feedback.

* Access to proprietary APIs apart from HTML5 (#656)

* comments

* Merge pull request #641 from WebAssembly/postorder_opcodes

Postorder opcodes

* fix some text that seems to be in the wrong order (#670)

* Clarify that br_table has a branch argument (#664)

* Add explicit argument counts (#672)

* Add explicit arities

* Rename

* Replace uint8 with varint7 in form field (#662)

This needs to be variable-length.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants