-
Notifications
You must be signed in to change notification settings - Fork 695
Extensible encoding of function signatures #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[Binary 11] Update the version number to 0xB.
Revert "[Binary 11] Update the version number to 0xB."
| ----- | ----- | ----- | | ||
| constructor | `0x40` | the function type constructor | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Re-asking my question in the closed PR]
Instead of making an arbitrary-feeling reservation, what if the nullary constructors were positive and the user-defined constructors were negative and we used (signed) varint32
, starting at -1 and going down?
I must have missed something, what is a "function type constructor"? |
| return_count | `uint8` | the number of results from the function (0 or 1) | | ||
| return_type | `value_type?` | the result type of the function (if return_count is 1) | | ||
|
||
(Note: In the future, this section may contain other forms of type entries as well.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So will the set of acceptable type entry kinds be tied to specific versions of the binary format (and WASM)? Specifically, WASM consumer implementations cannot just implement a subset of the constructors and claim to support a specific version of WASM. I am assuming that is the case, so that it is impossible to reach a type entry with an unknown constructor value and "type entries" can never be safely skipped.
In this context, a constructor is essentially a tag (from a tagged union) so function type constructor is the tag and also a constructor for building function types (i.e., apply 0x40 to a record to build a function type). This is in contrast to type constructor and data constructor which construct types (and possibly type definitions I'm assuming) and values respectively. Would the term "tag" or "type entry kind" be slightly friendlier here? |
Thanks @naturaltransformation, but I'm afraid I understood maybe 5% of that :) Am I stupid? Or does this PR suddenly and unexpectedly add a completely new concept to the design repo? Neither the term "constructor" nor "tag" seem to appear previously in this repo? I don't see what tagged unions have to do with wasm function types, which are simple "this is the return type, these are the argument types" type things? What is a "record" here? Finally, what does "build a function type" mean? And why do we need to "build" ones, instead of just the simple way of defining them we had so far (like Again, I might be stupid here, sorry if so. |
Heh, no, sorry about that. It is my fault. I guess I tried to explain jargon with other jargon. The change is to leave room for other kinds of type_entries. "Tagged unions" is in the sense of an implementation detail of the binary decoder/encoder. The bottom line is that the "function type constructor" is just a tag to indicate that a given type_entry is a function type and not a typedef or some other unforeseen kind of type_entry. As least that is how I'm understanding it. |
As I understand it, this change does not impact how WASM types are used in WASM modules at all, only that the the binary format decoder now needs to accommodate potential additional kinds of type_entries. |
Oh, I see, so this is saying "this type is a simple function type", and eventually this field will be used to indicate whether a type is something else, when we have such things? Cool, thanks. Perhaps "constructor", "tag", etc. would be surprising for other people as well? I am obviously the farthest thing from a type theorist, but probably other non-type theorists will read this document too ;) |
#### Signature entry | ||
| Field | Type | Description | | ||
#### Type entry | ||
| Field | Type/Value | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a Type/Value column, maybe we can leave it as a Type column and move the 0x40 value to Description to be consistent with the other fixed value fields such as the module header.
Yeah, it would be nice to have more accessible terminology. I suppose it is a "type_entry type", but that has its own problems. |
This still seems far from extensible, rather it requires binary format version changes when adding new type declaration formats. Just a thought but could the type definitions section be abstracted into their own AST, following a similar format to the function bodies, and use the planned operator table to define future type definition opcodes? |
I extended the PR description with more detail about what this PR is trying to achieve and how, hopefully answering most of the questions that came up. Also addressed some comments: @kripken, renamed @lukewagner, changed the return_count to be a varuint1. ;) @lukewagner, @titzer, keeping the index space of primitive and structural type constructors disjoint keeps the door open to perhaps allow inline use of (some?) structural types in the future (avoiding to have to name every type). Not sure we ever want to do that, but given the size of the index space, I see no reason to preclude it prematurely (I can't imagine a universe where we will ever have more than 128 different type constructors, or even get close to that number). @JSStats, laying the ground works for being able to grow type encodings into proper AST encodings is exactly what this PR tries to achieve, see the updated description. |
| Field | Type | Description | | ||
| ----- | ----- | ----- | | ||
| form | `uint8` | `0x40`, indicating a function type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep the disjointness of primitive and structural type constructors by giving the former monotonically increasing integers and the latter monotonically decreasing integers (starting by giving functions a form
of -1
). So same aesthetic preference for avoiding arbitrary-feeling statements like "noone will ever need more than 0x3f
primitive type constructors", but different meaning for the negative index than in the previous comment. If so then, the type would be a varint32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with using var(u)int. But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids. If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing between ids and constructors in the future, so that type id references could be single byte (in which case we probably need to assign negative numbers to all constructors now). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the type space potentially indexes 3 sorts of things: primitive constructors, structural constructors, type ids
I didn't understand this part of the OP: why would we want to add the complication of "inlining" certain structural constructors instead of just using a type id?
If we want to use the signedness overlay trick, than it seems much more beneficial to reserve that for distinguishing
between ids and constructors in the future
Yes, if we can rule out the third case as I'm asking above then the encoding could be pretty simple: if positive, it's a pritimive, if negative, it's a (negated) type-id. But that'd be the encoding of a value type. form
is the encoding of a different set: the set of compound constructors. So as observed earlier, it could completely overlap with both primitives and type-ids. I can see the argument for keeping the encoding of compound ctors disjoint from that of primitive ctors (so that you can represent the set of all constructors with an int), but that seems to work just fine with giving the compound ctors negative indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this part of the OP: why would we want to add the
complication of "inlining" certain structural constructors instead of just
using a type id?
E.g. to reduce the overhead of one-off uses of structural types. Or imagine
you want to emit a more complicated nested struct, then you wouldn't need
to separate out and name each level.
Perhaps the other way round is a more compelling scenario: you may want to
allow primitive constructors in type definitions. Or they'll start to mix
when we introduce type import/exports some day. Also, it's hard to predict
what other thing might come along and change the story (generics? who
knows).
I'm not suggesting that there currently is a concrete reason to join the
spaces. But it doesn't seem completely unlikely to arise later, and given
that the cost of keeping it an option is zero, why preclude it?
Yes, if we can rule out the third case as I'm asking above then the
encoding could be pretty simple: if positive, it's a pritimive, if
negative, it's a (negated) type-id.
I'd actually invert that scheme, to avoid the negation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand is why all those future new things can't just be new form
s of entries in the types section such that a type-id is all you need?
I'd actually invert that scheme, to avoid the negation.
Wouldn't that give all the primitive types today (i32
, etc) negative indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titzer, ha, I knew I shouldn't have mentioned generics. This is getting OT, but let me just say that I would like to avoid their complexity as much as the next guy, while I'm also aware that "our language doesn't need generics" have become famous last words of language/VM designers. Static compilation only works for fairly weak, second-class polymorphism; in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions and lots of expensive runtime checks. Maybe that's okay for Wasm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast, where do you see the downside of avoiding conflicts between the spaces until we
understand the future better?
I also don't understand what is being proposed in this PR to address this: superficially this is just a question of 0x40
vs. -1
. What are you proposing happens for these new-kinds-of-types?
in more expressive cases (which all big languages but C++ support) you'd be forced to introduce unions
and lots of expensive runtime checks
Still OT, but: yes, for Java-style. For C#-style, though, I was assuming that a C#-on-wasm runtime would actually need to ship with its own runtime machinery to do runtime generation of wasm for instantiations that only show up at runtime since I'd be surprised if we could design a feature in wasm that wasn't overly specialized to C# but that could still handle the C# use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast, where do you see the downside of avoiding conflicts between
the spaces until we
understand the future better?I also don't understand what is being proposed in this PR to address this:
superficially this is just a question of 0x40 vs. -1. What are you
proposing happens for these new-kinds-of-types?Hm, I thought we just agreed that signedness is best reserved for
distinguishing type ids. So this PR avoids clobbering the opposite sign
space, and instead just picks an arbitrary opcode for the function type
that doesn't collide with the primitive ones. 0x40 just because it
partitions the positive 1-byte signed LEB value range into two equal
halves, reserving one side for nullary, the other for non-nullary
constructors (which may or may not make sense, I'm open to better
suggestions; there are probably going to be far more nullary constructors
than others, but either way the space is comfortably large AFAICT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought we just agreed that signedness is best reserved for distinguishing type ids.
If non-nullary constructors don't show up in value types (only their type-ids), then both could have negative indices. But I guess the counterargument is: maybe not the 3 non-nullary ctors we're thinking about now (func, struct, array), but perhaps some new thing in the the future and if positive indices are "reserved" for nullary and negative is "reserved" for type-ids, then we're out of luck (or, at the very least, we'd have to do break the pattern). I guess I buy that, so 0x40
is fine. More than just the aesthetic -1
vs. 0x40
has been question of where this is going which I think I now have a better understanding of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally, I came across a situation which supports the current design: in Wasm.Table we want to be able to declare the types of elements in the table definition/import. The abovementioned scheme for local types seems to fit (allowing table elements to have any type that you can put in a local, or some restriction thereof), but the question is how to say "any function". Well, since we already have this 0x40 "Function" constructor in the index space, that seems to be a good candidate (even if it's a slight abuse of logical category). Similarly, the "Struct" and "Array" constructors could mean "any struct type" / "any array type" which could make sense one day.
* When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback.
Ping on this one. Is there desire for more discussion, or should we land and leave potential opcode details for follow-ups? |
From what I understand any change to the type section will require a wasm binary version bump anyway? Even adding functions with multiple return values would appear to not validate without a version bump due to the tight definition of the fields. So what would this change achieve? |
Forwards compatibility. Even with version bumps, old versions should still be subsets of new versions. |
lgtm |
|
||
The indirect function table section defines the module's | ||
The indirection section defines the module's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth spelling out "indirect function table" as per original text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one got renamed to table section on main branch.
The Function Bodies section assigns a body to every function in the module. | ||
The count of function signatures and function bodies must be the same and the `i`th | ||
signature corresponds to the `i`th function body. | ||
The code section assigns a body to every function in the module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"contains a function body for every function in the module"
Assign is just so...imperative :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Prettify section names * Restructure encoding of function signatures * Revert "[Binary 11] Update the version number to 0xB." * Leave index space for growing the number of base types * Comments addressed * clarify how export/import names convert to JS strings (#569) (#573) * When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback. * Access to proprietary APIs apart from HTML5 (#656) * comments
* Merge pull request #648 from WebAssembly/current_memory Add current_memory operator * Reorder section size field (#639) * Prettify section names (#638) * Extensible encoding of function signatures (#640) * Prettify section names * Restructure encoding of function signatures * Revert "[Binary 11] Update the version number to 0xB." * Leave index space for growing the number of base types * Comments addressed * clarify how export/import names convert to JS strings (#569) (#573) * When embedded in the web, clarify how export/import names convert to JS strings (#569) * Fixes suggested by @jf * Address more feedback Added a link to http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. Simplified the decoding algorithm thanks to Luke's feedback. * Access to proprietary APIs apart from HTML5 (#656) * comments * Merge pull request #641 from WebAssembly/postorder_opcodes Postorder opcodes * fix some text that seems to be in the wrong order (#670) * Clarify that br_table has a branch argument (#664) * Add explicit argument counts (#672) * Add explicit arities * Rename * Replace uint8 with varint7 in form field (#662) This needs to be variable-length.
[Retargeted #636.]
Better support future type system extensions with
Motivation: when moving to a richer type language, you'll need a proper AST for type expressions. Currently, our grammar is simple:
value_type ::= i32 | i64 | f32 | f64
structural_type ::= value_type* -> value_type?
type_id ::= uint32
where structural types are the ones defined in the type section, and type_ids are references to these. In particular, there is no overlap between the different syntactic classes, they are always used in distinct places.
Now, when we grow the language, it becomes highly conceivable that (some of) this separation no longer applies, and that some of the phrases grow additional alternatives. For example, in a potential extension with both struct and function pointers, we might have the following:
value_type ::= i32 | i64 | f32 | f64 | type_id
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32
It's also conceivable that we will want to allow nested type expressions at some point, or not require naming every structural type, in which case we'd get something like
value_type ::= i32 | i64 | f32 | f64 | type_id | structural_type
structural_type ::= value_type* -> value_type? | {value_type*}
type_id ::= int32
This PR simply ensures that the encoding of the current grammar is future-compatible with such potential extensions, by not baking in assumptions about the size or disjointness of syntactic classes: the production "value_type ::= type_id" could be encoded with another opcode for type references (followed by an index immediate), struct types with another opcode for the struct type constructor; the embedding of structural types into value types would require no extra opcode if we keep their index spaces disjoint (which is a design choice with zero cost).
A variation would be to overlay "type constructor" and "type id" opcode/index space e.g. by signedness, similar to what Luke suggests. That would save introducing an extra opcode for type references in the future. Regardless, I'd still propose to keep primitive and structural type constructor spaces disjoint.