Skip to content
This repository was archived by the owner on Apr 25, 2025. It is now read-only.

Add method dispatch as a post-MVP feature #297

Merged
merged 2 commits into from
May 27, 2022
Merged

Conversation

tlively
Copy link
Member

@tlively tlively commented May 12, 2022

And document the potential performance benefits we've measured.

And document the potential performance benefits we've measured.

## Method Dispatch

Right now OO-style method dispatch requires downcasting the reciever parameter from the top receiver type in the method's override group. As of May 2022, unsafely removing this reciever downcast improved performance by 3-4% across a suite of real-world j2wasm workloads. Introducing a method dispatch mechanism into WebAssembly and its type system would allow these reciever casts to be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Receiver is misspelled a few times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for the catch!

@jakobkummerow
Copy link
Contributor

I've thought about this some more, and now believe that a built-in mechanism for method dispatch could have much bigger impact than the quoted "3-4%" gained by avoiding receiver downcasts, and that's because of inlining. If there was a way to statically infer virtual call targets based on the receiver's RTT/static-type, that would allow much better engine-side optimizations. Unfortunately, the impact is very hard to estimate without actually building such a system.

Consider the following example, which is a simplified pseudo-code version of a pattern that occurs in the well-known "DeltaBlue" benchmark, and presumably in many other OO-style applications:

class Super {
  // Will actually be in the vtable, but for simplicity we may assume
  // that it's stored on the struct itself.
  method: function(Super) -> ();
};
class Sub1 : Super {
  method = function(Super s) {
    ref.cast_static<Sub1>(s);
    struct.get<Sub1, field1>(s);  // Obvsly the value will be used for something.
  }
  field1: i32;
}
class Sub2: Super {
  method = function(Super s) {
    ref.cast_static<Sub2>(s);
    // <Sub2::method body...>
  }
  field2: ...;
}

function Dispatcher(Super s) {
  // This function is called with both Sub1 and Sub2 arguments,
  // so this is a polymorphic call.
  m = struct.get<Super, method>(s));
  call_ref(m, s);
}

function Entry1() {
  s = struct.new<Sub1>;
  struct.set<Sub1, field1>(s, 42);
  Dispatcher(s);
}
function Entry2() {
  s = struct.new<Sub2>;
  // maybe more setup...
  Dispatcher(s);
}

Now, assume Entry1 gets optimized. We can throw heavyweight "engine magic" at this (and, in fact, that's what we're doing right now), but it only gets us so far.
The call to Dispatcher is direct, so it's easy to inline it. Dispatcher can collect feedback for its call_ref, and we can speculatively inline (even polymorphically) the functions observed there. We can also propagate the statically known types. So the generated code for Entry1 will be something like:

  s = struct.new<Sub1>;
  struct.set<Sub1>(s, field1, 42);
  m = struct.get<Super>(s, method);  // Actual vtable takes more than one load.
  if (m == Sub1::method) {
    // No cast thanks to propagated type information.
    42;  // No load, we remember the value from storing it before.
  } else if (m == Sub2::method) {
    ref.cast<Sub2>(s);  // Non-sensical, but never executed.
    // <Sub2::method body...>
  } else {
    // The compiler can't prove that this won't be needed.
    call_ref m;
  }

Doing all the work to inline Sub2::method is a waste of compilation time, but the compiler just knows it's a function reference of matching type, it can't know that it can't possibly occur in s's vtable.
Needing the else { call_ref m; } part is unfortunate because the existence of a call potentially disables many other optimizations, especially if this whole affair happens inside a loop. We have ideas how we might be able to avoid this (think "deoptimization support"), but it'll be a lot of engineering effort and may well come at a performance cost in other places.
If there was a way to statically infer that Sub1::method is what will get called given that s has type Sub1 (exactly, i.e. it's not any random subtype), we could eliminate most of the generated code here based on the type information we can propagate into inlined functions.

For comparison: dynamic optimizations for JavaScript can generate faster code than Wasm for such patterns for now. Without Wasm spec additions, the only way for Wasm to catch up would be to build very similar dynamic engine tricks, which is widely seen as an anti-goal of Wasm.

There is a chance that Binaryen could do more AOT inlining for such cases (as long as it gets to assume whole-world knowledge); it remains to be seen how far we can drive that effort, and how much binary module size we're willing to pay for AOT inlining.

In summary, big +1 to mentioning method dispatch as a planned future feature, and maybe it's worth pointing out in the text that there's potentially quite a lot of performance impact from it (3% is a lower bound and not necessarily a very tight one).

@tlively
Copy link
Member Author

tlively commented May 15, 2022

I added text pointing out that the 3-4% is a lower bound.

@kripken
Copy link
Member

kripken commented May 16, 2022

ref.cast(s); // Non-sensical, but never executed.

Needing the else { call_ref m; } part is unfortunate because the existence of a call potentially disables many other optimizations

IIUC, the first line I quoted will trap (since we know we have a Sub1, so casting to Sub2 will fail), and the trap lets us avoid anything else in that case, but then the remaining big problem is that call in the final else?

It seems that to really avoid an else of a call to the reference we'd need to know all possible call targets, globally. If we had immutable tables then the toolchain could emit a specific vtable for all possible implementations of this method, and use call_indirect instead of call_ref. In the example above, the vtable would only contain Sub1::method, Sub2::method, so nothing else can be called. Would that be enough to optimize well in the VM?

@jakobkummerow
Copy link
Contributor

IIUC, the first line I quoted will trap (since we know we have a Sub1, so casting to Sub2 will fail), and the trap lets us avoid anything else in that case, but then the remaining big problem is that call in the final else?

It would trap if it were executed and s != null, yes. As long as s's static type is nullable, we can't just optimize it out.

Would [that special vtable] be enough to optimize well in the VM?

Interesting idea! If we teach V8 to recognize that pattern, it might indeed help. Or Binaryen could just emit if/else-guarded direct calls directly...

@tlively
Copy link
Member Author

tlively commented May 27, 2022

I'm going to go ahead and merge this. Clearly this should be a much larger discussion, but mentioning the idea in the post-MVP doc does not commit us to anything.

@tlively tlively merged commit af0acce into main May 27, 2022
@rossberg rossberg deleted the post-mvp-method-dispatch branch May 27, 2022 10:05
@rossberg
Copy link
Member

rossberg commented May 27, 2022

Yeah, I don't know. Personally, I would regard the addition of methods as a primitive concept a massive failure, since they are conceptually redundant and there should be nothing magic about methods at the machine level. Rather, the goal should be to enrich the type system such that the downcast isn't necessary in most cases.

@tlively
Copy link
Member Author

tlively commented May 27, 2022

This proposal is precisely to enrich the type system so the downcast is no longer necessary, as you say.

@jakobkummerow
Copy link
Contributor

such that the downcast isn't necessary in most cases

Note that avoiding casts in non-inlined methods is just one benefit; in the inlining example I gave, the downcast is already optimized out. Adding a built-in method dispatch feature could (maybe!) unlock benefits that are much bigger than that.

there should be nothing magic about methods at the machine level

I don't think anyone claimed that there was? What's "magic" is having a static connection between types and methods in optimizing compilers. The machine level indeed doesn't care.
We can try to hack around the lack of such a static connection by attempting to rely on whole-program knowledge and feedback-directed speculative optimizations and immutable globals and multi-step inference and specific unofficial conventions that module producers and engines adhere to and whatnot, and maybe that'll turn out to be good enough. Or we could explore built-in mechanisms that provide this connection with more clarity, simplicity, and reliability.

I would regard the addition of methods as a primitive concept a massive failure

I have to say that I'm finding this tone disrespectful and, frankly, quite infuriating. I have described a problem, and a potential solution, and explained in detail how/why that could help. I was careful to describe this in humble terms ("I now believe", "could have impact", "very hard to estimate", "there is a chance that [alternatives work well]; it remains to be seen how far we can drive that") to emphasize that this is just an idea with potential that I think is worth exploring. Having all of this effort brushed away as "doing any of that would be a massive failure" is making me not want to engage any more.

More generally, I also believe that the general notion of dismissing ideas before their merits have been explored is misguided and unhelpful. You are well aware that creating a multi-language high-performance VM has been tried before, but hasn't satisfactorily been accomplished so far, so it's safe to say that it's a hard problem. In a perfect world, "there should be" a simple solution to it, but in reality, we face hard constraints and tradeoffs. Finding solutions for them requires creativity, experimentation, discussion, and often the acceptance of compromises. Nobody knows what we might have to do in order to achieve our goal, so nothing that's technically feasible should be categorically off the table: it may turn out to be the best compromise we can collectively come up with.

@conrad-watt
Copy link
Contributor

conrad-watt commented May 27, 2022

FWIW, unless we get really deep into something like F-bounded polymorphism (likely to the point of novel research), I don't think we'll be able to eliminate casts on receivers purely through making general-purpose extensions to the type system (edit: although I'd love to be proven wrong, as I do agree that it's the cleaner design).

If we can find a general enough design, and once we've already given appropriate thought to any lower-hanging "performance fruit", I'm personally sympathetic to the idea of blessing method dispatch in the type system through some special "object with methods" type, as that seems to be what most real OO semantics (and even several research projects) do anyway.

@titzer
Copy link
Contributor

titzer commented May 28, 2022

I think we should stay focused on posing problems, evaluating the priority of problems, proposing solutions to problems, trying to integrate those solutions into wasm's design philosophy, implementing prototype solutions, empirically evaluating those implementations, and then making additions to Wasm that bring benefits. Our discussions should have the backdrop that programming language implementations are our "customers" that come with requirements, and we negotiate the right abstractions to solve those requirements.

It's clear to me that a large, important class of languages have some method dispatch sequence that Wasm imposes overhead on given current designs, and it should be a high priority to reduce, maybe even eliminate, such overheads. A healthy relationship with our customers demands we take these problems seriously, and a sign of such seriousness is actually considering features that might help but are not morally pure, rather than precluding them from the outset.

@rossberg
Copy link
Member

rossberg commented Jun 1, 2022

@jakobkummerow, I apologise if my comment came across as disrespectful. I agree that a multi-language VM is a hard problem and has never succeeded before. And that's exactly why we should be highly skeptical of language-oriented or -specific solutions!

Wasm's unique selling point – and the only reason why it has better chances of achieving this goal than any previous contender – is that it is low-level, i.e., abstracting hardware concepts instead of abstracting specific language constructs.

So, above all, we must be careful not to risk losing this unique characteristic along the way, e.g., by running afoul the fallacy of climbing the nearest local maxima and turning Wasm into an object-oriented VM. Long-term that would be a recipe for failure wrt to Wasm's goals, with the end game being an additive design approach with more and more language-specific features and/or optimisations being bolted on. Additive approaches inevitably create either a monster, or a dinosaur that privileges a limited set of legacy uses from before when the complexity budget ran out.

The risk for us falling into that trap has much increased lately, as it is tempting to see the addition of GC support as a free ticket for adding more high-level and language-specific mechanisms. I know it is hard to resist that temptation, when something seems "simple" and would be an obvious win near-term.

My preferred guideline is that anything we're adding ought to be either sufficiently close to actual hardware, or where that isn't possible (like with questions of typing), at least "canonical" in some sufficiently broad sense.

@conrad-watt, I don't think it's too bad. The key thing we need is a self type, and solutions of varying complexity and expressiveness have been around for a long time, see e.g. the links I added to #303.

@jakobkummerow
Copy link
Contributor

@rossberg, thanks for the thoughtful follow-up. I agree that we want to build neither a language specific VM, nor a monster, nor a dinosaur. That said, we are (for many use cases) competing with language-specific VMs, and I would be disappointed if the end result was that folks conclude: "sure, I could translate my $Language to Wasm, but I'm not interested in working on that because it would just be so much slower than on my existing $Language-VM".

It seems quite clear to me that some of Wasm's assumptions are changing over time (which we must of course balance with not losing sight of objectives that continue to be valuable). In the "good old days of i32.add", it was obvious how to map that to a hardware concept, and engine implementers could still allow themselves to think that it's OK to simply throw all functions into their optimizing compiler right away (and also that their "optimizing" compiler can be quite simple because all interesting optimizations can be done AOT by the producer). Reality has moved on from that in several ways; one of them being the insight that it's hard to compete with common OO-style virtual method dispatch when you can't use any of the tricks that existing OO-targeted VMs are employing.

In yesterday's meeting, @titzer 's presentation made some good points, and I got the impression that it was generally received favorably. I can imagine a "customizeable static fields on the RTT" approach becoming a pretty general (i.e. useful for many languages and situations) feature that among other things would support Java-style virtual method dispatch quite well. We should certainly take care to design it with broad usefulness in mind. (For example: while the virtual-dispatch use case only cares about storing methods there, a more general design would allow arbitrary values.)

@kripken
Copy link
Member

kripken commented Jun 1, 2022

@rossberg

Wasm's unique selling point – and the only reason why it has better chances of achieving this goal than any previous contender – is that it is low-level, i.e., abstracting hardware concepts instead of abstracting specific language constructs.

Maybe I am not understanding your point, but I think we have already been making significant compromises there, haven't we? In particular, wasm GC itself does not abstract any hardware concept. We could have focused on supporting GC languages in a lower-level manner - and had specific proposals for that, even - but instead we've made the choice to focus on GC types, which are higher-level and correspond more to language constructs.

That has tradeoffs, and I'm not saying we've made the wrong choice! (In particular, implementers preferred GC types, and it also allows for smaller binaries.) But we are far from abstracting hardware concepts at this point, I think, and wasm GC post-MVP ideas like generics move even further away from the hardware level, arguably. Such features are applicable to narrower and narrower sets of languages. Hardware concepts are not driving such features, but language needs + performance.

I agree that we want to implement low-level concepts as much as possible, and we need to be aware of our overall complexity budget, as you said. That still leaves a very large space of tradeoffs to explore, I think.

@rossberg
Copy link
Member

rossberg commented Jun 2, 2022

@jakobkummerow, @kripken, yes, we have to adjust. But a lot of care is being put into staying as close as possible to the original goals. As I said on several occasions, the goal still ought to remain "as low-level as possible, but no lower". I don't think we will succeed otherwise.

In some cases, we have to raise the abstraction level somewhat to meet some vital constraints (like safety or portability). But carefully backing away some additional distance from the raw metal for some features should not be mistaken for marching forward in the opposite direction!

In other words, there is a spectrum, but in that spectrum, we have to approach design solutions from below (machine), not from above (language).

For example, GC types deliberately are not language-level types. They merely describe low-level memory layout, in a way that a GC can handle. They neither try to mirror every form of type present in high-level type systems, nor do they attempt to provide any form of guarantee beyond the VM's memory safety. They are the smallest step away from linear memory that we could take while (1) enabling safe GC and (2) avoiding the need for runtime type checks on every access.

Similar shifts had to occur for exceptions or stack switching. Some non-machine abstractions are already present in Wasm 1.0, e.g., functions themselves. But in all these cases, we kept the abstractions as low and machine-like as is possible without losing relevant properties.

That is an important goal that I don't think is sufficiently acknowledged in some of the design discussions. I sometimes observe a tendency to instead just want to adopt whatever mechanism the source language du jour has.

it's hard to compete with common OO-style virtual method dispatch when you can't use any of the tricks that existing OO-targeted VMs are employing.

None of the tricks would certainly be bad. But likewise it would be an odd expectation that Wasm-based implementations will eventually be able to use all native tricks. For many reasons, that will never be possible without destroying Wasm, and we have to manage expectations accordingly.

@kripken
Copy link
Member

kripken commented Jun 2, 2022

@rossberg

Thanks for the extra detail, that helps me understand your point of view!

I think we agree on all the principles here. +1 to focusing on low-level solutions as much as possible, to designing "from the machine", and for not expecting wasm VMs to use every single native trick.

However, our views differ greatly on this:

They [GC types] are the smallest step away from linear memory that we could take while (1) enabling safe GC and (2) avoiding the need for runtime type checks on every access.

That sentence presupposes we had to move away from linear memory. But that was not forced on us. The 2 goals you mention, of enabling GC languages to run on wasm in a safe and fast manner, could have been done at a lower level, using new features on top of linear memory (if what I mean here isn't clear, please see the last paragraph down below). Instead, we chose a relatively higher-level approach, GC types.

Again, I'm not saying that's the wrong thing to do. But it goes very much against the goal of "as low-level as possible." That shows, I think, that we have other factors that are equally strong that influence us, including speed, code size, practical tradeoffs in implementations, etc.

All I am saying in all this is that we cannot rule out an idea like method dispatch just because it is somewhat less low-level than other ideas, or because it only helps one type of language. If we have another lower-level idea that is as safe/compact/fast/etc. as dispatch then of course we should prefer it! And I really like @titzer's ideas on that in the presentation earlier this week. But @rossberg you appeared in this thread to strongly oppose dispatch before we even explore the space. That is all I am arguing against here.

And I am doing that because the big picture is that Java-on-WasmGC is still not faster than Java-on-JS 😢 (and still many times slower than the JVM). To justify wasm GC we need to get a lot faster! I'm not sure if method dispatch can help or not, but we should keep an open mind.

--

More details on what I meant earlier by "enabling GC languages on top of linear memory": Today we have various languages that compile their GCs to wasm, like C#/Mono and Go. It is very possible that they will never be able to use wasm GC because of the specific behaviors and optimizations their VMs have (e.g. I have heard from Go people that interior pointers are, in their view, something that really requires the VM to be built around for full speed, as Go does; and C# has finalizers that we may never want in wasm). But we should still support those languages as best we can, and we are, by adding features to wasm that (among other things) help such compiled VMs like e.g. stack switching. We will also need stack scanning, JIT support in wasm itself, etc. Specific to my point earlier, we could also add a way to make cycles collectible even if they include both compiled VM objects in linear memory as well as the host, and we've had several ideas for that. Such ideas have tradeoffs, of course: while GC inside the compiled VM might be faster than using wasm GC (because of using language-specific tricks in the compiled VM), GC across the boundary might be slower, etc. Maybe we'll get back to these ideas eventually (depending on how many languages end up using wasm GC), but my point here is that we've made the choice to try the higher-level approach first, of GC types. And sometimes that makes sense to do.

@titzer
Copy link
Contributor

titzer commented Jun 3, 2022

@kripken I think supporting both interior pointers and weak callbacks are worth addressing with tailored post-MVP features. They are preferable to the downsides of putting support for linear-memory GC into Wasm, IMO.

I have given a lot of thought to the Wasm engine side of how to implement stack scanning for GC-in-linear-memory in an efficient way. I have come to the conclusion that non-local updates to locals (don't forget the operand stack) breaks a number of assumptions that Wasm engines already make and thus would significantly complicate optimizing tiers, necessitating complex retrofitting. Specifically, allowing a user program to scan all i32s in every frame and potentially update them undoes compiler reasoning about arithmetic, impacts instruction selection, dead code elimination, and many other optimizations. It would require additional deoptimization metadata. Implemented naively, every GC would kick code back to the interpreter or baseline tier. It's also a kind of information leak to scan another module's stack frames; we'd need to police the use of stack scanning with some kind of capability mechanism that doesn't exist yet.

IMHO it's better to just use a shadow stack and do it all in user code--no stack scanning API required, no stack walking, even--and basically the same performance characteristics. I've implemented this. Today, Virgil compiles to linear memory with its GC and does its own management of a shadow stack. That's just a stopgap solution while waiting for this proposal to advance. It was also the most unpleasant programming experience I've had in recent memory :) Hooray for GC, we should make it work for all GC'd languages!

@titzer
Copy link
Contributor

titzer commented Jun 3, 2022

It's true that Wasm has succeeded to the extent it has because it is low-level, and generally the lower-level a computational paradigm is, the more universal. We did a good job, but NANDscript would crush us in simplicity and generality :-)

I agree that we need to keep Wasm simple and low level, but I also think that it needs to be a priority that we take the pragmatics of language demands seriously. I agree with Jakob that in the limit we are competing with language's other implementation choices, and if Wasm has significant performance penalties (or bad ergonomics!) then language implementors will be discouraged.

I see Wasm as obsoleting parts of language implementations piece by piece. Step one (Wasm MVP), we eliminated all but one of the compiler backends for your language. No register allocator needed, no instruction selector or scheduler. You get a really good backend for pretty much all hardware. You also get a code format, loader, and a rudimentary linking system and a way to talk to the outside world (imports). Step two (Wasm GC), we eliminate the need to boil your objects and functions down to bytes, and the debugging nightmare that represents, and your language no longer needs to bring its own garbage collector, which are complex beasts. Step three (Wasm GC post-MVP), we expand the set of features for more languages to target Wasm GC, as well as strategically adding just the right amount of programmability so that slightly more sophisticated runtimes and be even more space- and time- efficient. This includes stuff like the static fields (or metaobjects), as well as a JIT capability.

I think step 3 is going to be exciting and fun, TBH!

@kripken
Copy link
Member

kripken commented Jun 3, 2022

@titzer Good points! I agree that would be the best path, if we can make it fast enough, which I hope we can.

Local updates are a hard problem for gc-in-linear-memory, yeah. We would need some simplifying assumptions to make that work well, probably. I do think we can find such assumptions, though, if we need to - the key thing is that it's ok if cross-VM GCs are slower (in return for super-fast inner GCs). But again, for all the reasons you mention, hopefully we don't need to do that...

@rossberg
Copy link
Member

rossberg commented Jun 7, 2022

@kripken, by "enabling safe GC" I was referring to GC built into the language and implemented by the engine, while you are considering the alternative of user-space GC. So I believe we don't necessarily disagree on that point either?

User-space GC can be done today, and I agree with Ben's comment that adding new features wouldn't really help it much and likely just complicate matters overall. The inherent problem of multiple competing GCs and cross-heap collection with the host will always remain, and a solution to it be a nightmarish technical challenge, as we know from browsers. In that regard, it's not an alternative to built-in GC, at least not a simpler one.

@titzer, I agree with pragmatics wrt user demands, but they have to be weighed very carefully. We must likewise ensure that we don't over-specialise and that the complexity and cost of implementing (or processing) Wasm does not get out of bounds. We'd risk making building a Wasm engine so complicated that we end up with a browser-like monoculture again, which would be a failure mode.

As they say, the art in (language) design is not putting enough features in, but leaving enough features out. :)

(I was at ECOOP yesterday, which had an all-day workshop on Program Analysis for Wasm. Gives a different perspective on the advantages and potentials that Wasm currently has and that are easy to lose when we focus too narrowly.)

@kripken
Copy link
Member

kripken commented Jun 8, 2022

@rossberg

Ok, if I understand you correctly, you think we can't do linear memory GC well enough, and you think linear memory GC is necessarily more complex. And so linear memory GC is not really viable as an alternative to wasm GC, so in your view we didn't pick wasm GC, but it was the only reasonable path?

Fair enough, though my intuition is the opposite of yours on both of those two points and the conclusion from them. However, I'm not sure it's worth debating them. The larger issue is that either approach, linear memory GC or wasm GC, needs to show it is fast enough (among the other requirements). I'd argue that linear memory GC is the safer route there, because it is low-level and the cost model is much more explicit. Specifically, for Java we could have ported GraalVM AOT (similar to how Mono ported its AOT) and I think there's a good chance it would have been fast enough. Simply because in linear memory it would have been able to use most of the tricks it is used to, and GraalVM AOT is comparable to the JVM in speed.

Whereas in wasm GC we work at a higher level and compilers to it have far less low-level control. That may be why wasm GC is still too slow - as I said, it's not even faster than Java-on-JS, which is not a high bar, and it is far slower than the JVM. Maybe just more work will get us fast enough, but maybe we'll need to add more performance features to wasm GC. Something like method dispatch may be one such feature, and so I think we should keep an open mind about it.

@rossberg
Copy link
Member

rossberg commented Jun 8, 2022

@kripken, my primary point was that – under the premise that we want engine-level GC – the current design is as low-level as we can go.

Questioning the premise is a different discussion, and tbh, one that I don’t want to spend considerable time on at this point. It was already discussed extensively, and there never was consensus for taking a different route. Personally, I don’t think linear memory GC is comparable, both approaches are mostly complementary. To be honest, I only have a very vague idea what LM GC support would even mean beyond what’s already possible in 1.0. There have been suggestions about adding e.g. stack scanning support, but AFAICT, nobody has worked out anything close to detailed enough to properly assess it.

I find it rather counter-intuitive to assume that user-space GC could be more efficient than GC encapsulated in the engine, where the engine can go wild with optimising complex algorithms and representation choices for each hardware. Ultimately, the only inherent overhead implied by the current design direction is the need for occasional casts, which we can reduce over time. To be safe and portable, I’d estimate that support for e.g. stack scanning would induce at least as much overhead and runtime checking, while being more leaky and thus less aggressively optimisable by an engine (cf what @titzer said). And that’s even without considering the multiple-heaps problems.

@titzer
Copy link
Contributor

titzer commented Jun 9, 2022

One point not often mentioned when discussing linear-memory GC is that the host GC actually imposes requirements on the user (linear memory) GC. For example, if the host GC is incremental (i.e. does small batches of marking work and uses a write barrier to maintain a system invariant), then it will still incur large pauses if the linear memory GC is stop-the-world. Similarly, if the host GC is concurrent and/or parallel, it will not enjoy the concomitant performance gains unless the userspace linear memory GC is also concurrent or parallel. Further, even if the GC is not concurrent, but instead collects multiple wasm threads at once, userspace code needs safepoints (and likely polling). Basically, advanced host GCs would be bottlenecked on (un-upgradeable) userspace GCs. The concurrency of the host GC is then also observable, unless we specify a very conservative concurrency model that bakes in a stop-the-world pause.

@kripken
Copy link
Member

kripken commented Jun 9, 2022

@rossberg

my primary point was that – under the premise that we want engine-level GC – the current design is as low-level as we can go.

Understood, and I agree given that premise.

I find it rather counter-intuitive to assume that user-space GC could be more efficient than GC encapsulated in the engine, where the engine can go wild with optimising complex algorithms and representation choices for each hardware.

Yes, the engine can do really cool things, as you said. But OTOH it must do so in a generic way, not one that is tailored to each language (e.g. Go does not use generational GC, and optimizes interior pointers very well) or to the application (e.g. the JVM lets developers adjust GC parameters, and some VMs let you control when GC even happens, both of which can make huge differences). A lower-level approach would let each language define its own GC performance tradeoffs, but wasm GC can't do that.

We can't be certain which approach will end up faster. But there are strong arguments both ways.

Ultimately, the only inherent overhead implied by the current design direction is the need for occasional casts, which we can reduce over time.

I think the one-GC-fits-all perf downsides I just mentioned are also inherent in wasm GC. And aside from GC, different language VMs do all sorts of other low-level tricks. Casts are not our only challenge in wasm GC.

Adding more optional performance features to wasm GC, like method dispatch and @titzer 's ideas on object layout, may help address the shortcomings of wasm GC. I'm just suggesting we keep an open mind here.

@kripken
Copy link
Member

kripken commented Jun 9, 2022

@titzer

Basically, advanced host GCs would be bottlenecked on (un-upgradeable) userspace GCs.

Good points! I agree that the use case of deep integration between the compiled GC and the host GC is very difficult to handle, including for the reasons you just mentioned @titzer. I share your and @rossberg's skepticism about how well we can do such deep integration, and it may justify work on wasm GC. However, that is actually not the most common use case or request I have seen myself, fwiw. Here are the cases I think are more urgent:

  • In some cases linear memory GC has no connection to the host GC. That is the situation on the Web today, but it hasn't stopped various VMs from shipping like Mono/Blazor. It will also be the situation in the component model where a component happens to be implemented using GC. Another example is an embedded GC encapsulated in a ported application, like a game engine such as Unity. We can help these use cases with stack scanning, stack switching, etc.
  • In some cases the connection between linear memory GC and the host GC concerns few objects and does not need to be fast. For example, at least in some cases the compiled Python and Mono VMs are really all the user wants - they have no JS code to integrate with - but they do need to interact with Web APIs, so there is some minor amount of host GC interaction. In particular, things like long-lived event handlers - we don't want those cycles to leak forever, but also we don't need them to be cleaned up quickly or even efficiently. One idea that can help these use cases is some form of "snapshotting" mechanism that would allow such cycles to be collected eventually by the host (to some extent this can be done in userspace, and actually I was working on such a prototype before wasm GC started to pick up, after which I switched to optimizing wasm GC in Binaryen).

I do agree with @rossberg that it's not worth debating linear memory GC in depth atm. So apologies for that text. I'm just trying to get us to keep an open mind on all these topics. I really worry about us having too fixed an idea of how GC should work in wasm, as that could limit us, both about the big picture as in linear memory GC, and the small picture as in method dispatch.

@fgmccabe
Copy link

fgmccabe commented Jun 9, 2022

There are some languages that will not be able to use any host provided GC. One example of this is Prolog: it cannot use 'normal' GCs and still be reasonably performant. I suspect that go-lang (already mentioned) also fits into this category.

@rossberg
Copy link
Member

@kripken, @fgmccabe, no disagreement there. I believe it's always been acknowledged by the CG that a built-in GC mechanism can never hope to fit 100% of use cases, and at best be like a 90% thing. Another borderline example is lazy languages, which like Prolog, typically perform certain extra optimizations like path compression at GC time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants