-
Notifications
You must be signed in to change notification settings - Fork 74
Define iso-recursive type semantics #243
Conversation
I figured that the formulation was a bit hacky and obscure after all, so I replaced it was something more "standard" and simultaneously more confined to the definition of type equivalence. It should be semantically equivalent to the earlier formulation, but cleaner. |
Thanks for posting this, @rossberg, and sorry for the delay in responding to it. I still don't completely understand all the details here, so I could use help in answering a few questions.
|
Recently we have had a number of discussions establishing consensus on what we want for specifically the MVP. In light of the outcomes of those discussions, could you articulate why this variant on isorecursive types should be an MVP feature rather than a Post-MVP feature. |
@RossTate given that we agreed to prototype both a nominal and a structural system as MVP candidates, I'm not sure what the motivation would be for requiring such a justification right now, unless you're suggesting that an alternative structural proposal would be a more viable MVP candidate? |
According to the meeting notes, the following is what we agreed to:
The changes in this PR are substantial and will require significant work to implement. As with anything requiring substantial resources, a motivation should be provided. |
Then the poll was poorly phrased given the spirit of what was decided, and if you feel it's necessary we can secure a revised wording in a future meeting. Since we have a group understanding that equi-recursive canonicalisation is not viable, making a decision based on an evaluation of the current MVP against a nominal proposal is essentially the same as selecting the nominal proposal. Given the conversations that took place in the meeting, I don't think many people interpreted the poll in this way. |
It was my understanding that we would pursue both structural and nominal types, and the only viable option (due to the performance issues with equi-recursive types) was an iso-recursive formulation. So I have been working under the assumption that we'd be doing this change. So I generally support this PR, though I have not dug into the details. |
Speaking for Google folks, we've been implementing the system described in the milestone 4 prototype spec to fulfill the intent of that poll. We're also interested in the system in this PR as a potential next step because the milestone 4 spec still uses equirecursive types for its structural component and we will want a better structural component in the final MVP system if there will be a structural component at all. The first step to investigate this system further is discussing it to see if any issues come up that would make implementing it impractical or unnecessary, so I am happy to have this PR open to enable that discussion. |
Attempting to answer my own questions
I think that example works in this new system because subtyping is precisely according to the declared supertypes and the only constraints on the declared supertypes are that their indices must be less than the subtype index. So in that example B and C would be in the same rec group and they would each declare A as a supertype and each would be valid because in checking each individual subtype relation, you look only at what other subtypes are declared, not whether the others are also valid. This is exactly how subtyping works in our nominal implementation. One potential problem I can think of is when the ordering constraint on declared supertypes comes into conflict with the adjacency constraint of types within a If we do find (or believe there exist) concrete instances of this problem, the simplest fix would be to disallow cyclic supertype chains without requiring an ordering constraint. That's the approach we've implemented in Binaryen's nominal implementation and the approach described in the milestone 4 prototype.
Yes, I believe that two structurally equivalent definitions in the same rec group define distinct types. The reason is that recursive types are equivalent only when their
If they do, they will get the current nominal behavior but won't be able to structurally share types. This is fine because these compilers see the whole program, so they should feel free to do this.
A compiler that does want to share source-level structural types structurally would have to place the shared structural types in their own
Each A problemConsider module 1 defining structural tuple types So it seems that this system cannot directly express shared structural types when those structural types have subtyping, which I think is the case for at least function types in most relevant languages. I'm struggling to think of workarounds that would not also work in a simpler, purely nominal system. Given this limitation, are the benefits of this system over a purely nominal system worth the extra complexity? What do you think @rossberg? Is there a simple workaround that I'm not seeing? |
[Sorry for not replying earlier, was out sick.]
Right.
Right, there are alternative ways to impose the non-cyclicity constraint, but this is the simplest, and I can't think of any scenario where it would be a serious problem. Note that in your example, B could not be pulled out of the recursion group no matter what. The definition of C depends on B depends on A, so they are all mutually recursive.
Yes, the degenerate case of just using one whole-program recursive group essentially emulates fully nominal behaviour. That said, I'm confused why an optimisation would depend on distinguishing equivalent representations. Can you elaborate? This is the first time I'm hearing of this problem, I believe.
I don't necessarily expect them to do that, but of course they could, if it avoids some hassle for them.
Yes. This duplication is generally unavoidable anyway, because at the import boundaries each module has to repeat the definition of any imported type it uses, even under a hypothetical fully nominal semantics. Otherwise, validation, compilation, and link-time checking wouldn't work under Wasm's module system. However, a nominal or equi-recursive semantics might get away with keeping some types in a recursion group fully abstract in some import cases, while iso-recursive does not. Not clear how relevant this case would be in practice, though. But it certainly is an incentive to keep recursion groups small when doing separate compilation.
Yes, producers have to canonicalise type definitions suitably. That is already the case in some places, e.g., with the order of fields in a struct, or the order of cases in a source-level variant type (when lowering it to int tags). FWIW, the main use case here is linking multiple modules produced by the same compiler, so the type canonicalisation ought to be implicitly consistent. Multiple different compilers for the same language interacting is not a use case we have seriously discussed thus far. If they would not want to go through interface types, they would of course have to agree on and codify their low-level data layout strategy. But that is a fairly hypothetical scenario, I think – I'm not aware of such degree of compatibility between high-level language compilers for native code elsewhere.
Exactly. That is the main property of the iso-recursive semantics.
Yes, with pre-declared subtyping, you have the same limitations in this regard as any such system, including a nominal one.
The benefit of structural typing is independent of that. When compiling nominal source languages, this obviously is a non-issue. When compiling structural source languages that do not have source-level subtyping, you also have no issue. Such languages might still make internal use of subtyping hierarchies, but these tend to be very shallow and well-layered (such as lowering closure types). That is the main use case this is trying to address. The only scenario where the limitation would become relevant would be lowering a source language with structural subtyping to Wasm in a direct, non-coercive manner. Such non-coercive compilation obviously won't fly with pre-declared subtyping, no surprise there. That is the price we are willing to pay, since such languages are rare, and non-coercive compilation for them would likely run into other blockers anyway (e.g., wrt to generics). If this ever becomes a practical concern, the explicit, zero-cost coercion operator I mentioned in my presentation could be a way out. This possibility was pointed out to me by Sam Lindley. |
Co-authored-by: Thomas Lively <[email protected]>
Co-authored-by: Thomas Lively <[email protected]>
Given that we have decided that separate compilation (of any language) is a Post-MVP concern, and that we are willing to add more features in the Post-MVP to address issues that arise for separate compilation, could you either articulate an MVP scenario with which we could evaluate the claimed benefits granted by this extra complexity or articulate a forwards-compatibility issue in the MVP that would prevent us from adding this in a Post-MVP? |
I think it's safe to assume that if a producer wants Binaryen to perform type minimization, then it does not depend on distinguishing the identities of structurally identical types. Otherwise, as you say, there isn't really much that Binaryen can do. So the flag controlling this behavior in Binaryen will be a signal from the producer that it does not need this distinction, which then enables Binaryen to perform the minimization / canonicalization. If the producer needs to distinguish between certain types, it will be its responsibility to create recursion groups that uphold that distinction. |
To make sure I understand, is it correct that we are expecting two workflows here? Like this:
And optimizers like Binaryen will need a flag to distinguish between these two cases. An open question is what the default should be for such a flag. Does all that make sense? Or is there a third use case, or is something wrong in one of the use cases? |
There is a third use case, the first in #243 (comment): The producer expects to link its Wasm modules, but it will always optimize its modules with Binaryen (or a similar tool, but crucially the same for all modules). Here, the optimizer is free to change the recursion groups as long as so does it in a deterministic manner (e.g. equi-recursive canonicalization) to make the groups consistent across the modules. |
@askeksa-google I see, yes, I guess as long as the optimizations are "deterministic" (do not change depending on the code in each module) then that could work. But how much benefit do you expect there? My intuition is that the large improvements come from optimizations that do depend on the code, like removing unused types and fields, making fields immutable, etc. |
I measured the type validation and canonicalization time for a real-world j2wasm module across three type systems, both before and after applying Binaryen optimizations at -O2. For each type system, I measured the total validation and canonicalization time as well as times for two separate components of the work. The first component involves validating the new heap type definitions and deduplicating their definitions and the second component involves moving the new heap types and copying the new value types to the global type stores. The implementation of the second part is the same for all three systems, but there's probably some room to optimize it for the nominal and isorecursive systems. The modules have single large recursion groups in the isorecursive case. I am also interested in measuring the case of minimal recursion groups, but don't have the modules in that format yet. The unoptimized module has 88442 type definitions and is about 93 MB, including a name section. Validating and building these types takes:
The optimized module has 24843 type definitions and is about 10 MB without a name section. Validating and building these types takes:
My takeaway is that the extra overhead of canonicalizing isorecursive types over nominal types is probably not an issue, especially considering that the canonicalization algorithm is linear. |
I also measured an upper bound on the code size cost of the additional constraints on the order of types in the type section. In nominal and equirecursive modes, Binaryen orders types by decreasing use count so that frequently used types have smaller LEB-encoded indices. I measured the same optimized module from the previous comment with the optimal default ordering, an ordering that respects the isorecursive constraints on the ordering of subtypes but otherwise tries to be optimal, and the reverse of the optimal ordering. Again, the isorecursive case uses a single large recursion group, so this is the least constrained case for the isorecursive system, but even in the worst case the isorecursive system can be no worse than the pessimal ordering. Optimal nominal ordering (baseline):
Isorecursive ordering:
Pessimal nominal ordering:
So I don't think binary size will be a problem for the isorecursive type system either. |
@kripken I don't see optimization as the main goal in this scenario. Rather, it allows a producer to not perform any type canonicalization itself, leaving it all to e.g. Binaryen, while still being able to link its modules at runtime. |
@askeksa-google Ok, that's where we were seeing things differently then. I was only thinking about optimizations, where I think a single flag is enough. Yes, makes sense that if there are other non-optimization stuff we should do in Binaryen then those could be additional flags as needed. |
@tlively, thanks for these numbers. That's encouraging! FWIW, if necessary, I wouldn't mind imposing a different constraint to ensure acyclicity in the subtype relation. The ordering restriction merely seemed the simplest and cheapest to check. |
proposals/gc/MVP.md
Outdated
* `deftype` is a new category of types that generalises the existing type definitions in the type section | ||
- `deftype ::= <functype> | <structtype> | <arraytype>` | ||
* `deftype` is the syntax for an entry in the type section, generalising the existing syntax | ||
- `deftype ::= <subtype> | rec <subtype>*` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this would be simplified if we made singleton type definitions equivalent to rec groups of size 1? @rossberg, what do you think about making that change in this PR to simplify the semantics and hopefully make them more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: preexisting "regular" type definitions are now reinterpreted as syntactic shorthands for singleton recursion groups.
proposals/gc/MVP.md
Outdated
* `subtype` is a new category of type defining a single type, as a subtype of possible other types | ||
- `subtype ::= sub <typeidx>* <strtype>` | ||
- the preexisting syntax with no `sub` clause is redefined to be a shorthand for a `sub` clause with empty `typeidx` list: `<strtype> == sub () <strtype>` | ||
- Note: This allows multiple supertypes. For the MVP, it could be restricted to at most one supertype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the language from "could be restricted" to "is restricted."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
proposals/gc/MVP.md
Outdated
|
||
Note 2: This semantics implies that type equivalence checks can be implemented in constant-time by representing all types as trees in tied form and canonicalising them bottom-up in linear time upfront. | ||
|
||
Note 3: It's worth noting that the only observable difference to a nominal type system is the equivalence rule on (non-recursive) type indices: instead of looking at their definitions, a nominal system would require `$t = $t'` syntactically (at least as long as we ignore things like checking imports, where type indices become meaningless). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other differences are the presence of recursion groups and the restrictions on which type indices can be used in the definition of a type. Maybe it would be more useful to note here that having a single large recursion group essentially gives you a nominal system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was referring to the difference in the rules. I reworded it to make that more explicit, and also added a sentence describing the practical consequence you mention.
proposals/gc/MVP.md
Outdated
- if `$t = <structtype>` or `$t = <arraytype>` | ||
- or `$t = type ht` and `rt <: data` (imports) | ||
- `(type $t) <: func` | ||
- or `$t = type ht` and `ht <: data` (imports) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these lines referring to type imports? I know they were already in here, but let's remove them since we don't plan to have type imports in the MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, removed dependence on type imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
proposals/gc/MVP.md
Outdated
|
||
Note: RTT values correspond to type descriptors or "shape" objects as they exist in various engines. RTT equality can be implemented as a single pointer test by memoising RTT values. More interestingly, runtime casts along the hierachy encoded in these values can be implemented in an engine efficiently by using well-known techniques such as including a vector of its (direct and indirect) super-RTTs in each RTT value (with itself as the last entry). The value `<n>` then denotes the length of this vector. A subtype check between two RTT values can be implemented as follows using such a representation. Assume RTT value v1 has static type `(rtt n1? $t1)` and v2 has type `(rtt n2? $t2)`. To check whether v1 denotes a sub-RTT of v2, first verify that `n1 >= n2` -- if both `n1` and `n2` are known statically, this can be performed at compile time; if either is not statically known, it has to be read from the respective RTT value dynamically, and `n1 >= n2` becomes a dynamic check. Then compare v2 to the n2-th entry in v1's supertype vector. If they are equal, v1 is a sub-RTT. | ||
Note: RTT values correspond to type descriptors or "shape" objects as they exist in various engines. RTT equality can be implemented as a single pointer test by memoising RTT values. More interestingly, runtime casts along the hierachy encoded in these values can be implemented in an engine efficiently by using well-known techniques such as including a vector of its (direct and indirect) super-RTTs in each RTT value (with itself as the last entry). A subtype check between two RTT values can be implemented as follows using such a representation. Assume RTT value v1 has static type `(rtt $t1)` and v2 has type `(rtt $t2)`. Let `n1` and `n2` be the lenghts of the respective supertype vectors. To check whether v1 denotes a subtype RTT of v2, first verify that `n1 >= n2` -- if both `n1` and `n2` are known statically, this can be performed at compile time; if either is not statically known, it has to be read from the respective RTT value dynamically, and `n1 >= n2` becomes a dynamic check. Then compare v2 to the n2-th entry in v1's supertype vector. If they are equal, v1 is a subtype RTT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple spelling fixes and remove the mention of n1 and n2 not being statically known.
Note: RTT values correspond to type descriptors or "shape" objects as they exist in various engines. RTT equality can be implemented as a single pointer test by memoising RTT values. More interestingly, runtime casts along the hierachy encoded in these values can be implemented in an engine efficiently by using well-known techniques such as including a vector of its (direct and indirect) super-RTTs in each RTT value (with itself as the last entry). A subtype check between two RTT values can be implemented as follows using such a representation. Assume RTT value v1 has static type `(rtt $t1)` and v2 has type `(rtt $t2)`. Let `n1` and `n2` be the lenghts of the respective supertype vectors. To check whether v1 denotes a subtype RTT of v2, first verify that `n1 >= n2` -- if both `n1` and `n2` are known statically, this can be performed at compile time; if either is not statically known, it has to be read from the respective RTT value dynamically, and `n1 >= n2` becomes a dynamic check. Then compare v2 to the n2-th entry in v1's supertype vector. If they are equal, v1 is a subtype RTT. | |
Note: RTT values correspond to type descriptors or "shape" objects as they exist in various engines. RTT equality can be implemented as a single pointer test by memoising RTT values. More interestingly, runtime casts along the hierarchy encoded in these values can be implemented in an engine efficiently by using well-known techniques such as including a vector of its (direct and indirect) super-RTTs in each RTT value (with itself as the last entry). A subtype check between two RTT values can be implemented as follows using such a representation. Assume RTT value v1 has static type `(rtt $t1)` and v2 has type `(rtt $t2)`. Let `n1` and `n2` be the lengths of the respective supertype vectors. To check whether v1 denotes a subtype RTT of v2, first verify that `n1 >= n2` -- both `n1` and `n2` are known statically, so this can be performed at compile time. Then compare v2 to the n2-th entry in v1's supertype vector. If they are equal, v1 is a subtype RTT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but for casts the type $t1
itself usually is not statically known (only a supertype of it), so neither is n1
. I pushed a tweak that points that out.
This is a detailed definition of the iso-recursive type semantics with explicit subtyping that I proposed in the recent meeting. I tried to keep the type representation as "natural" as possible, e.g. avoiding explicit type variables or substitutions except as auxiliary notions in the definition of type equivalence . Please let me know if something is unclear (or wrong).
I intend to implement this in the interpreter next.