-
Notifications
You must be signed in to change notification settings - Fork 74
Mini-MVP Design #73
Comments
Now I'll focus on what I'm proposing to remove from the current proposal.
Thoughts? There are many other differences, but these seem like the big conceptual ones. |
There are various problems with your proposal, but one reoccurring thing you are missing is that Wasm code is not limited to a single module, and that all the mechanisms need to work under an open world assumption where modules can freely interact and additional modules can be instantiated at any time.
At Wasm module boundaries, i.e., for linking, you cannot avoid a structural definition of type equivalence. Wasm implements strong modularity, where modules are described solely by their imports and exports. A module, by design, cannot reference (a type from) another module. Hence nominal types are simply insufficient there: to share a type across modules, a module has to import it, but since it wants to access it structurally, this import has to be transparent, i.e., described by its structure, which then has to be matched by the linking semantics. And since you need structural semantics anyway, adding nominal semantics as well is just an extra complication (and not a trivial one, once you get into the details). It would also break modular de/compositionality if type equivalence was defined differently inside vs outside modules. Then there is the whole producer-side problem of compiling structural types (including basic ones like tuples) into a language that requires making everything nominal. I honestly have no idea how you would practically make that work in a modular fashion. Nominal types very much force you into a monolithic or centralised architecture.
The Wasm type system can never be powerful enough to be able to derive all static knowledge that a compiler has about representations everywhere. Casts seem absolutely unavoidable as an escape hatch. E.g., all the examples in the Overview use some amount of casting. You can create an ever-more complicated type system that's avoiding one or the other case, but it is principally impossible to ever be able to support everything. And with downcasts comes the necessity to somehow represent type information at runtime. And in a low-level VM you want explicit control over any extra space or time cost this involves. A producer is totally free to use that in a way that resembles size-polymorphism, but then it will just push certain casts elsewhere. (Often it has to check casts that a native VM doesn't have to check, because the latter can rely on global knowledge that the Wasm engine does not have because it's producer-specific. That seems to be an important point that is often missed in these discussions.)
The engine has all the information required to do the same under the current proposal.
No, it can't. Indices are purely static and internal to a single module. But casts may involve comparing types from different modules. Any RTT mechanism thus requires a global type store.
We had this discussion before, but that is still incorrect. Uniform representation via pointer tagging is used by many language impls (the vast majority of dynamic or polymorphic languages), and there is no efficient substitute. Most prominently, JS engines themselves use this technique heavily. Consequently, all Web engines can trivially support it.
See previous discussions. It is not ideal on 64 bit, but also not worse. That's the price for portability. Either way, it's giving producers a choice. If they want 32 bit integers instead then they can simply define a single element immutable struct and let the engine unbox that on the architectures where it can.
AFAICS, it doesn't prevent any such technique.
That is still true in the presence of int31ref.
That would significantly increase the cost of uniform representations on 32 bit platforms, possibly prohibitively so, without giving producers any way to predict or control that cost. Predictable performance is an important goal for Wasm.
No, an engine cannot usually optimise under a closed world assumption involving only a single module, see above. |
It seems to me that this proposal could be seen as a group of "patches" to the current MVP proposal, and these patches could also be discussed individually, which might help make the discussion more tractable: (1) Require static types to list their supertype explicitly (a.k.a. switch from structural to nominal typing). (2) Restrict allowed supertypes to those that are defined earlier in the types section. (3) Allow types to have both struct fields and array fields at the same time. (4) Drop i31ref. This proposal also includes some things that I don't think are actually different from the current MVP: (5) IIUC, the "gc table" description can be seen as another way of looking at existing modules' "type sections". (6) IIUC, "ngcref" is the same as the current proposal's "optref". (7) IIUC, getting rid of RTTs as explicit objects would fall out quite naturally from giving static types nominal semantics, as the casts in the current MVP could then simply be expressed in terms of static types (and that's essentially #119 then). That said, I have trouble wrapping my head around what this might mean for various future extensions (when you type system experts use complicated terms like "<adjective> polymorphism" ;-) ), so I can't tell whether this would be a nice simplification, or a limitation we'll come to regret. Btw, is there supposed to be a semantic difference between My personal thoughts on the differences listed above: re (1): As Andreas points out, we'll continue to need structural checks on the module boundaries for multi-module apps; so simple nominal typing would only be applicable to intra-module type usages. We would have to be careful to specify what happens at the module boundary in such a way that it's consistent with what happens inside a module. For instance, I could imagine a model where (similar to exported/imported functions) types that are used on the boundary are also given a name, and the engine would perform structural checking on such types with identical names, just like it structurally checks the signatures of functions with matching names. re (2): I'm not sure why this is needed? Sounds like it's supposed to make validity checking more efficient/convenient? But if an engine did two passes: one to read the types, one to validate them, then I think it wouldn't care about order? Am I missing any considerations? re (3): In the current proposal, this is postponed to be a post-MVP feature; I think there's generally wide agreement that we'll want some form of this eventually, and the biggest question is how exactly, e.g.: re (4): I find the argument convincing that many languages would benefit from having a way to store unboxed integers in a pointer-typed slot. I also find the argument convincing that module producers should have control over this behavior, which has the unfortunate consequence that we have to spec a number of bits that all engines can support for unboxing; but on the bright side this doesn't prevent engines from additionally unboxing other values when they can. I also find the argument convincing that some producers would prefer not to pay the cost of supporting i31ref in any eqref when they have no use for it. In summary, I think #130 might be an interesting compromise here.
I wouldn't overestimate engines' appetite for such things. Sure, you could do such optimizations if you hand-crafted the optimal scheme for a small toy example. But my impression from real-world engines is that they are so complex beasts anyway that they strongly prefer simplicity/regularity wherever they can get away with it, so I would assume that any engine picks one internal object model, and then maps all features onto that. It certainly makes sense to grant engines some freedom for their internal design choices, but I wouldn't expect a whole lot of custom per-module specialization going on. (Maaaybe if we get bored in ten years we'll get interested in squeezing out another 0.5% of performance by doubling implementation complexity, but I wouldn't bet on it.) Generally speaking, I would suggest to treat (1) - (4) above as independent suggestions, and discuss them on separate threads (if there is anything else to discuss for any of them). Untangling concerns probably makes it easier to find agreement. |
Thanks @jakobkummerow for the thoughts! This was a (year-old) stab at suggesting a very minimal MVP for the sake of prompting discussion. (That discussion didn't happen, and we were advised to write up something more fleshed out to better facilitate one, which is part of the origin of the SOIL Initiative's proposal.) As you say, there are a number of orthogonal ideas being explored here, and per more recent advice I'm now trying to take the approach you suggest of considering each of these one by one (e.g. #119 per your item (1)). You brought up a bunch of valuable points and questions here though, so I'll follow up on those here so that there's suitable context.
Hah, yes, but it's subtle and not at all apparent in the notation.
Structural types and nominal types are each better suited for various purposes. Compatibility of module signatures is better suited to structural types. But those signatures declare heap types, which are better suited to nominal types. The fields of those heap types are then better described structurally. One way to think about it is that nominal types name/abstract the fixed points in recursive structures and explicitly declare the intended subtyping relationships between these fixed points. Structural checking can then proceed using these declared subtyping relationships, and there is a structural check that these subtyping relationships are sound. This is the "staging" we refer to here, and it is known to let you express more complex structures.
It's an easy way to ensure there are no cycles in the nominal subtyping hierarchy. In the current MVP, two distinct type indices can denote the same type. Restriction (2) ensures that is not possible, simplifying type-checking.
Agreed on this summary. For now, changing (3) to be just (Remaining points are heard. My lack of response to them is just me not having anything particularly useful to add 😄) |
Thanks, @RossTate , for the comments.
Yes, I'm aware that this issue is old; it was filed before I got involved with WasmGC :-)
Out of curiosity, what's the reason for this restriction? (I'm not opposed, just surprised; I would have guessed that adding additional repeated/array fields would be just as legal as adding singular/struct fields.) If I had to guess: you'd expect each array field's associated length field to occur somewhere in the struct-fields part of the object (as opposed to behind the previous array field, where a bounds check would have a harder=slower time finding it), so adding more of them would shift the offsets of the elements? That would make sense; and poses an interesting limitation to possible future "arbitrary object nesting" designs... But on second thought, that would mean that subtypes also can't add additional struct fields once an array field exists on the supertype. Would that be too much of a restriction?
Absolutely; and one way or the other we'll want/need some of both.
I think we might mean the same thing here. Let me try again to explain what I had in mind, along with the line of reasoning: Example, similar to the slides linked above: with the current MVP, we could have two modules: ;; module A This works because on import, $f's and $g's signatures are checked (structurally) for compatibility, because they both have the name "hash". The check recurses into structural comparisons of $ta and $tb. I could phrase this as: the name "hash" "identifies $f and $g as the same function". ;; module A The engine could use this hint in order to do exactly what it did before with structural typing: it recognizes that we want $ta and $tb to be compatible (nominally equivalent, even), so it structurally compares them to verify that. From then on, in its internal global type registry, it can treat $ta and $tb as the same nominal type. Does that make sense? Or is there a flaw in this reasoning?
Ah, yes, that makes sense.
Yes, agreed. What I called "(3)(a)" could later be generalized to be either "(3)(b)" or "(3)(c)", so it sounds appropriate for an MVP (or early post-MVP version, if we decide to limit ourselves to struct |
To clarify, under this idea, the "hash" exported from A would not be able to be imported as the "hash" in B if $ta and $tb had different publicnames because $ta and $tb would be considered different types? So the only difference between this and a pure-nominal approach is that under this idea there is no "single definition" requirement?
I'm having trouble reconciling this with the design of the EH proposal, which does require single definitions for exception tags. In principle it seems that tags and types should use the same cross-module usage mechanism. That the single-definition requirement hasn't been raised as a big issue on the EH proposal seems to weakly suggest that it is not so problematic after all. Where does this design goal come from? |
This premise is the key contention. It comes from high-level languages, but it forgets that modules of high-level languages all link with one common low-level module: the runtime (and often the standard library). As such, it's easy enough for this common module to be the "type authority" for at least the primitive types shared across the high-level modules comprising this program. They need to link with some common module anyways for other host-generated "magic number" identifiers like exception events, as @tlively points out. And the only way to avoid coordinating on RTTs is to use
Back when the GC proposal was first created, the expectation was that everyone would use structural types (statically and dynamically, i.e. casts were structural as well) and as such the GC proposal would be used for interlanguage data sharing. So modules could exchange pairs just by each using the structural pair type. But I pointed out (offline) both that structural casts would be very slow and that different languages represent the same high-level structural concepts differently and as such you could not expect Java pairs to be shared as OCaml pairs or vice versa. Eventually the casting mechanism was changed to a nominal approach to address the former problem and the Interface Types proposal was introduced to address the latter problem, but the core design philosophy was never revisited despite the significant change in goals.
As an example, when slide 7 was presented I pointed out that it was completely unrealistic. Most language runtimes do not represent surface-level arrays as just an
Nominal types, contrary to their name, have nothing to do with string names, and as such they have no effect on which exports are matched with which imports. Hope I managed to make some sense. There's an abstract mathematical argument for all this as well, but somehow I don't think that would help clarify anything 😅 |
The data count section has a count that must match the number of data segments. If the data count section isn't present, then `memory.init` and `data.drop` cannot be used. Fixes issue #73.
Since we've settled the MVP type section and moved to phase 2 and since a lot of these individual proposed changes have already been merged or are under discussion elsewhere, I'll close this issue. If there are additional components that could benefit from further discussion, please create new issues to discuss them individually. |
Here's the Stage 1 design I had in mind for the strategy in #72. I debated making a pull request describing it in more detail, but decided an issue would be better so that we're focusing on high-level aspects rather than syntactic details. This design expects typed function references to already be in place.
First, one specifies a gc-``table'' (in the general sense of the word) whose rows are of the form:
<int>
<gcidx>?
<fieldtype>*
<fieldtype>+?
This table is valid if, whenever a row has a specified parent, the parent's
gcidx
is smaller than the row's own index (which is its owngcidx
) and the appropriate subtyping/prefixing of the slots and array-slots hold. (A row can have array-slots even if its parent does not.) Thegcidx
of a row is not explicitly specified but instead corresponds to the row's position in the table.Second, the grammar of types is extended to include
ngcref <gcidx>
, which represents a possibly null reference to a gc-structure allocated by this module instance (wheregcidx
is an ancestor of the value's actual index), and which is only valid if there is a row in the table corresponding togcidx
. Subtyping is extended so thatngcref <gcidx1>
is a subtype ofngcref <gcidx2>
ifgcidx1
's parent isgcidx2
. The grammar of types is also extended to includenullgcref
, which is a subtype ofngcref <gcidx>
for allgcidx
. (To clarify, these new types are usable within the fieldtype's in the gc-table.)Third, instructions are extended to include the obvious getters and setters (and indexed getters and setters as well as array length in the case of array-slots), explicitly indexed allocation of
ngcref
s, the singletonnullref
value, null-checking, and equality comparison. There are also instructions for casting/testing whether a givenngcref <gcidx>
value in fact has a more precise runtime index (which must be a descendent of<gcidx>
).I think that's about it. Is it perfect? No. Is it easy to translate to something better when something better comes along? Yes. Does it impose any long-term backwards compatibility constraints besides syntactic support? Not really. Are most gc languages going to be able to compile to this and get reasonable/acceptable/tolerable performance? Running through the many candidates I know, it seems so.
The text was updated successfully, but these errors were encountered: