-
Notifications
You must be signed in to change notification settings - Fork 74
Add method dispatch as a post-MVP feature #297
Conversation
And document the potential performance benefits we've measured.
proposals/gc/Post-MVP.md
Outdated
|
||
## Method Dispatch | ||
|
||
Right now OO-style method dispatch requires downcasting the reciever parameter from the top receiver type in the method's override group. As of May 2022, unsafely removing this reciever downcast improved performance by 3-4% across a suite of real-world j2wasm workloads. Introducing a method dispatch mechanism into WebAssembly and its type system would allow these reciever casts to be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Receiver is misspelled a few times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks for the catch!
I've thought about this some more, and now believe that a built-in mechanism for method dispatch could have much bigger impact than the quoted "3-4%" gained by avoiding receiver downcasts, and that's because of inlining. If there was a way to statically infer virtual call targets based on the receiver's RTT/static-type, that would allow much better engine-side optimizations. Unfortunately, the impact is very hard to estimate without actually building such a system. Consider the following example, which is a simplified pseudo-code version of a pattern that occurs in the well-known "DeltaBlue" benchmark, and presumably in many other OO-style applications:
Now, assume
Doing all the work to inline For comparison: dynamic optimizations for JavaScript can generate faster code than Wasm for such patterns for now. Without Wasm spec additions, the only way for Wasm to catch up would be to build very similar dynamic engine tricks, which is widely seen as an anti-goal of Wasm. There is a chance that Binaryen could do more AOT inlining for such cases (as long as it gets to assume whole-world knowledge); it remains to be seen how far we can drive that effort, and how much binary module size we're willing to pay for AOT inlining. In summary, big +1 to mentioning method dispatch as a planned future feature, and maybe it's worth pointing out in the text that there's potentially quite a lot of performance impact from it (3% is a lower bound and not necessarily a very tight one). |
I added text pointing out that the 3-4% is a lower bound. |
IIUC, the first line I quoted will trap (since we know we have a Sub1, so casting to Sub2 will fail), and the trap lets us avoid anything else in that case, but then the remaining big problem is that call in the final else? It seems that to really avoid an else of a call to the reference we'd need to know all possible call targets, globally. If we had immutable tables then the toolchain could emit a specific vtable for all possible implementations of this method, and use |
It would trap if it were executed and
Interesting idea! If we teach V8 to recognize that pattern, it might indeed help. Or Binaryen could just emit if/else-guarded direct calls directly... |
I'm going to go ahead and merge this. Clearly this should be a much larger discussion, but mentioning the idea in the post-MVP doc does not commit us to anything. |
Yeah, I don't know. Personally, I would regard the addition of methods as a primitive concept a massive failure, since they are conceptually redundant and there should be nothing magic about methods at the machine level. Rather, the goal should be to enrich the type system such that the downcast isn't necessary in most cases. |
This proposal is precisely to enrich the type system so the downcast is no longer necessary, as you say. |
Note that avoiding casts in non-inlined methods is just one benefit; in the inlining example I gave, the downcast is already optimized out. Adding a built-in method dispatch feature could (maybe!) unlock benefits that are much bigger than that.
I don't think anyone claimed that there was? What's "magic" is having a static connection between types and methods in optimizing compilers. The machine level indeed doesn't care.
I have to say that I'm finding this tone disrespectful and, frankly, quite infuriating. I have described a problem, and a potential solution, and explained in detail how/why that could help. I was careful to describe this in humble terms ("I now believe", "could have impact", "very hard to estimate", "there is a chance that [alternatives work well]; it remains to be seen how far we can drive that") to emphasize that this is just an idea with potential that I think is worth exploring. Having all of this effort brushed away as "doing any of that would be a massive failure" is making me not want to engage any more. More generally, I also believe that the general notion of dismissing ideas before their merits have been explored is misguided and unhelpful. You are well aware that creating a multi-language high-performance VM has been tried before, but hasn't satisfactorily been accomplished so far, so it's safe to say that it's a hard problem. In a perfect world, "there should be" a simple solution to it, but in reality, we face hard constraints and tradeoffs. Finding solutions for them requires creativity, experimentation, discussion, and often the acceptance of compromises. Nobody knows what we might have to do in order to achieve our goal, so nothing that's technically feasible should be categorically off the table: it may turn out to be the best compromise we can collectively come up with. |
FWIW, unless we get really deep into something like F-bounded polymorphism (likely to the point of novel research), I don't think we'll be able to eliminate casts on receivers purely through making general-purpose extensions to the type system (edit: although I'd love to be proven wrong, as I do agree that it's the cleaner design). If we can find a general enough design, and once we've already given appropriate thought to any lower-hanging "performance fruit", I'm personally sympathetic to the idea of blessing method dispatch in the type system through some special "object with methods" type, as that seems to be what most real OO semantics (and even several research projects) do anyway. |
I think we should stay focused on posing problems, evaluating the priority of problems, proposing solutions to problems, trying to integrate those solutions into wasm's design philosophy, implementing prototype solutions, empirically evaluating those implementations, and then making additions to Wasm that bring benefits. Our discussions should have the backdrop that programming language implementations are our "customers" that come with requirements, and we negotiate the right abstractions to solve those requirements. It's clear to me that a large, important class of languages have some method dispatch sequence that Wasm imposes overhead on given current designs, and it should be a high priority to reduce, maybe even eliminate, such overheads. A healthy relationship with our customers demands we take these problems seriously, and a sign of such seriousness is actually considering features that might help but are not morally pure, rather than precluding them from the outset. |
@jakobkummerow, I apologise if my comment came across as disrespectful. I agree that a multi-language VM is a hard problem and has never succeeded before. And that's exactly why we should be highly skeptical of language-oriented or -specific solutions! Wasm's unique selling point – and the only reason why it has better chances of achieving this goal than any previous contender – is that it is low-level, i.e., abstracting hardware concepts instead of abstracting specific language constructs. So, above all, we must be careful not to risk losing this unique characteristic along the way, e.g., by running afoul the fallacy of climbing the nearest local maxima and turning Wasm into an object-oriented VM. Long-term that would be a recipe for failure wrt to Wasm's goals, with the end game being an additive design approach with more and more language-specific features and/or optimisations being bolted on. Additive approaches inevitably create either a monster, or a dinosaur that privileges a limited set of legacy uses from before when the complexity budget ran out. The risk for us falling into that trap has much increased lately, as it is tempting to see the addition of GC support as a free ticket for adding more high-level and language-specific mechanisms. I know it is hard to resist that temptation, when something seems "simple" and would be an obvious win near-term. My preferred guideline is that anything we're adding ought to be either sufficiently close to actual hardware, or where that isn't possible (like with questions of typing), at least "canonical" in some sufficiently broad sense. @conrad-watt, I don't think it's too bad. The key thing we need is a self type, and solutions of varying complexity and expressiveness have been around for a long time, see e.g. the links I added to #303. |
@rossberg, thanks for the thoughtful follow-up. I agree that we want to build neither a language specific VM, nor a monster, nor a dinosaur. That said, we are (for many use cases) competing with language-specific VMs, and I would be disappointed if the end result was that folks conclude: "sure, I could translate my $Language to Wasm, but I'm not interested in working on that because it would just be so much slower than on my existing $Language-VM". It seems quite clear to me that some of Wasm's assumptions are changing over time (which we must of course balance with not losing sight of objectives that continue to be valuable). In the "good old days of In yesterday's meeting, @titzer 's presentation made some good points, and I got the impression that it was generally received favorably. I can imagine a "customizeable static fields on the RTT" approach becoming a pretty general (i.e. useful for many languages and situations) feature that among other things would support Java-style virtual method dispatch quite well. We should certainly take care to design it with broad usefulness in mind. (For example: while the virtual-dispatch use case only cares about storing methods there, a more general design would allow arbitrary values.) |
Maybe I am not understanding your point, but I think we have already been making significant compromises there, haven't we? In particular, wasm GC itself does not abstract any hardware concept. We could have focused on supporting GC languages in a lower-level manner - and had specific proposals for that, even - but instead we've made the choice to focus on GC types, which are higher-level and correspond more to language constructs. That has tradeoffs, and I'm not saying we've made the wrong choice! (In particular, implementers preferred GC types, and it also allows for smaller binaries.) But we are far from abstracting hardware concepts at this point, I think, and wasm GC post-MVP ideas like generics move even further away from the hardware level, arguably. Such features are applicable to narrower and narrower sets of languages. Hardware concepts are not driving such features, but language needs + performance. I agree that we want to implement low-level concepts as much as possible, and we need to be aware of our overall complexity budget, as you said. That still leaves a very large space of tradeoffs to explore, I think. |
@jakobkummerow, @kripken, yes, we have to adjust. But a lot of care is being put into staying as close as possible to the original goals. As I said on several occasions, the goal still ought to remain "as low-level as possible, but no lower". I don't think we will succeed otherwise. In some cases, we have to raise the abstraction level somewhat to meet some vital constraints (like safety or portability). But carefully backing away some additional distance from the raw metal for some features should not be mistaken for marching forward in the opposite direction! In other words, there is a spectrum, but in that spectrum, we have to approach design solutions from below (machine), not from above (language). For example, GC types deliberately are not language-level types. They merely describe low-level memory layout, in a way that a GC can handle. They neither try to mirror every form of type present in high-level type systems, nor do they attempt to provide any form of guarantee beyond the VM's memory safety. They are the smallest step away from linear memory that we could take while (1) enabling safe GC and (2) avoiding the need for runtime type checks on every access. Similar shifts had to occur for exceptions or stack switching. Some non-machine abstractions are already present in Wasm 1.0, e.g., functions themselves. But in all these cases, we kept the abstractions as low and machine-like as is possible without losing relevant properties. That is an important goal that I don't think is sufficiently acknowledged in some of the design discussions. I sometimes observe a tendency to instead just want to adopt whatever mechanism the source language du jour has.
None of the tricks would certainly be bad. But likewise it would be an odd expectation that Wasm-based implementations will eventually be able to use all native tricks. For many reasons, that will never be possible without destroying Wasm, and we have to manage expectations accordingly. |
Thanks for the extra detail, that helps me understand your point of view! I think we agree on all the principles here. +1 to focusing on low-level solutions as much as possible, to designing "from the machine", and for not expecting wasm VMs to use every single native trick. However, our views differ greatly on this:
That sentence presupposes we had to move away from linear memory. But that was not forced on us. The 2 goals you mention, of enabling GC languages to run on wasm in a safe and fast manner, could have been done at a lower level, using new features on top of linear memory (if what I mean here isn't clear, please see the last paragraph down below). Instead, we chose a relatively higher-level approach, GC types. Again, I'm not saying that's the wrong thing to do. But it goes very much against the goal of "as low-level as possible." That shows, I think, that we have other factors that are equally strong that influence us, including speed, code size, practical tradeoffs in implementations, etc. All I am saying in all this is that we cannot rule out an idea like method dispatch just because it is somewhat less low-level than other ideas, or because it only helps one type of language. If we have another lower-level idea that is as safe/compact/fast/etc. as dispatch then of course we should prefer it! And I really like @titzer's ideas on that in the presentation earlier this week. But @rossberg you appeared in this thread to strongly oppose dispatch before we even explore the space. That is all I am arguing against here. And I am doing that because the big picture is that Java-on-WasmGC is still not faster than Java-on-JS 😢 (and still many times slower than the JVM). To justify wasm GC we need to get a lot faster! I'm not sure if method dispatch can help or not, but we should keep an open mind. -- More details on what I meant earlier by "enabling GC languages on top of linear memory": Today we have various languages that compile their GCs to wasm, like C#/Mono and Go. It is very possible that they will never be able to use wasm GC because of the specific behaviors and optimizations their VMs have (e.g. I have heard from Go people that interior pointers are, in their view, something that really requires the VM to be built around for full speed, as Go does; and C# has finalizers that we may never want in wasm). But we should still support those languages as best we can, and we are, by adding features to wasm that (among other things) help such compiled VMs like e.g. stack switching. We will also need stack scanning, JIT support in wasm itself, etc. Specific to my point earlier, we could also add a way to make cycles collectible even if they include both compiled VM objects in linear memory as well as the host, and we've had several ideas for that. Such ideas have tradeoffs, of course: while GC inside the compiled VM might be faster than using wasm GC (because of using language-specific tricks in the compiled VM), GC across the boundary might be slower, etc. Maybe we'll get back to these ideas eventually (depending on how many languages end up using wasm GC), but my point here is that we've made the choice to try the higher-level approach first, of GC types. And sometimes that makes sense to do. |
@kripken I think supporting both interior pointers and weak callbacks are worth addressing with tailored post-MVP features. They are preferable to the downsides of putting support for linear-memory GC into Wasm, IMO. I have given a lot of thought to the Wasm engine side of how to implement stack scanning for GC-in-linear-memory in an efficient way. I have come to the conclusion that non-local updates to locals (don't forget the operand stack) breaks a number of assumptions that Wasm engines already make and thus would significantly complicate optimizing tiers, necessitating complex retrofitting. Specifically, allowing a user program to scan all i32s in every frame and potentially update them undoes compiler reasoning about arithmetic, impacts instruction selection, dead code elimination, and many other optimizations. It would require additional deoptimization metadata. Implemented naively, every GC would kick code back to the interpreter or baseline tier. It's also a kind of information leak to scan another module's stack frames; we'd need to police the use of stack scanning with some kind of capability mechanism that doesn't exist yet. IMHO it's better to just use a shadow stack and do it all in user code--no stack scanning API required, no stack walking, even--and basically the same performance characteristics. I've implemented this. Today, Virgil compiles to linear memory with its GC and does its own management of a shadow stack. That's just a stopgap solution while waiting for this proposal to advance. It was also the most unpleasant programming experience I've had in recent memory :) Hooray for GC, we should make it work for all GC'd languages! |
It's true that Wasm has succeeded to the extent it has because it is low-level, and generally the lower-level a computational paradigm is, the more universal. We did a good job, but NANDscript would crush us in simplicity and generality :-) I agree that we need to keep Wasm simple and low level, but I also think that it needs to be a priority that we take the pragmatics of language demands seriously. I agree with Jakob that in the limit we are competing with language's other implementation choices, and if Wasm has significant performance penalties (or bad ergonomics!) then language implementors will be discouraged. I see Wasm as obsoleting parts of language implementations piece by piece. Step one (Wasm MVP), we eliminated all but one of the compiler backends for your language. No register allocator needed, no instruction selector or scheduler. You get a really good backend for pretty much all hardware. You also get a code format, loader, and a rudimentary linking system and a way to talk to the outside world (imports). Step two (Wasm GC), we eliminate the need to boil your objects and functions down to bytes, and the debugging nightmare that represents, and your language no longer needs to bring its own garbage collector, which are complex beasts. Step three (Wasm GC post-MVP), we expand the set of features for more languages to target Wasm GC, as well as strategically adding just the right amount of programmability so that slightly more sophisticated runtimes and be even more space- and time- efficient. This includes stuff like the static fields (or metaobjects), as well as a JIT capability. I think step 3 is going to be exciting and fun, TBH! |
@titzer Good points! I agree that would be the best path, if we can make it fast enough, which I hope we can. Local updates are a hard problem for gc-in-linear-memory, yeah. We would need some simplifying assumptions to make that work well, probably. I do think we can find such assumptions, though, if we need to - the key thing is that it's ok if cross-VM GCs are slower (in return for super-fast inner GCs). But again, for all the reasons you mention, hopefully we don't need to do that... |
@kripken, by "enabling safe GC" I was referring to GC built into the language and implemented by the engine, while you are considering the alternative of user-space GC. So I believe we don't necessarily disagree on that point either? User-space GC can be done today, and I agree with Ben's comment that adding new features wouldn't really help it much and likely just complicate matters overall. The inherent problem of multiple competing GCs and cross-heap collection with the host will always remain, and a solution to it be a nightmarish technical challenge, as we know from browsers. In that regard, it's not an alternative to built-in GC, at least not a simpler one. @titzer, I agree with pragmatics wrt user demands, but they have to be weighed very carefully. We must likewise ensure that we don't over-specialise and that the complexity and cost of implementing (or processing) Wasm does not get out of bounds. We'd risk making building a Wasm engine so complicated that we end up with a browser-like monoculture again, which would be a failure mode. As they say, the art in (language) design is not putting enough features in, but leaving enough features out. :) (I was at ECOOP yesterday, which had an all-day workshop on Program Analysis for Wasm. Gives a different perspective on the advantages and potentials that Wasm currently has and that are easy to lose when we focus too narrowly.) |
Ok, if I understand you correctly, you think we can't do linear memory GC well enough, and you think linear memory GC is necessarily more complex. And so linear memory GC is not really viable as an alternative to wasm GC, so in your view we didn't pick wasm GC, but it was the only reasonable path? Fair enough, though my intuition is the opposite of yours on both of those two points and the conclusion from them. However, I'm not sure it's worth debating them. The larger issue is that either approach, linear memory GC or wasm GC, needs to show it is fast enough (among the other requirements). I'd argue that linear memory GC is the safer route there, because it is low-level and the cost model is much more explicit. Specifically, for Java we could have ported GraalVM AOT (similar to how Mono ported its AOT) and I think there's a good chance it would have been fast enough. Simply because in linear memory it would have been able to use most of the tricks it is used to, and GraalVM AOT is comparable to the JVM in speed. Whereas in wasm GC we work at a higher level and compilers to it have far less low-level control. That may be why wasm GC is still too slow - as I said, it's not even faster than Java-on-JS, which is not a high bar, and it is far slower than the JVM. Maybe just more work will get us fast enough, but maybe we'll need to add more performance features to wasm GC. Something like method dispatch may be one such feature, and so I think we should keep an open mind about it. |
@kripken, my primary point was that – under the premise that we want engine-level GC – the current design is as low-level as we can go. Questioning the premise is a different discussion, and tbh, one that I don’t want to spend considerable time on at this point. It was already discussed extensively, and there never was consensus for taking a different route. Personally, I don’t think linear memory GC is comparable, both approaches are mostly complementary. To be honest, I only have a very vague idea what LM GC support would even mean beyond what’s already possible in 1.0. There have been suggestions about adding e.g. stack scanning support, but AFAICT, nobody has worked out anything close to detailed enough to properly assess it. I find it rather counter-intuitive to assume that user-space GC could be more efficient than GC encapsulated in the engine, where the engine can go wild with optimising complex algorithms and representation choices for each hardware. Ultimately, the only inherent overhead implied by the current design direction is the need for occasional casts, which we can reduce over time. To be safe and portable, I’d estimate that support for e.g. stack scanning would induce at least as much overhead and runtime checking, while being more leaky and thus less aggressively optimisable by an engine (cf what @titzer said). And that’s even without considering the multiple-heaps problems. |
One point not often mentioned when discussing linear-memory GC is that the host GC actually imposes requirements on the user (linear memory) GC. For example, if the host GC is incremental (i.e. does small batches of marking work and uses a write barrier to maintain a system invariant), then it will still incur large pauses if the linear memory GC is stop-the-world. Similarly, if the host GC is concurrent and/or parallel, it will not enjoy the concomitant performance gains unless the userspace linear memory GC is also concurrent or parallel. Further, even if the GC is not concurrent, but instead collects multiple wasm threads at once, userspace code needs safepoints (and likely polling). Basically, advanced host GCs would be bottlenecked on (un-upgradeable) userspace GCs. The concurrency of the host GC is then also observable, unless we specify a very conservative concurrency model that bakes in a stop-the-world pause. |
Understood, and I agree given that premise.
Yes, the engine can do really cool things, as you said. But OTOH it must do so in a generic way, not one that is tailored to each language (e.g. Go does not use generational GC, and optimizes interior pointers very well) or to the application (e.g. the JVM lets developers adjust GC parameters, and some VMs let you control when GC even happens, both of which can make huge differences). A lower-level approach would let each language define its own GC performance tradeoffs, but wasm GC can't do that. We can't be certain which approach will end up faster. But there are strong arguments both ways.
I think the one-GC-fits-all perf downsides I just mentioned are also inherent in wasm GC. And aside from GC, different language VMs do all sorts of other low-level tricks. Casts are not our only challenge in wasm GC. Adding more optional performance features to wasm GC, like method dispatch and @titzer 's ideas on object layout, may help address the shortcomings of wasm GC. I'm just suggesting we keep an open mind here. |
Good points! I agree that the use case of deep integration between the compiled GC and the host GC is very difficult to handle, including for the reasons you just mentioned @titzer. I share your and @rossberg's skepticism about how well we can do such deep integration, and it may justify work on wasm GC. However, that is actually not the most common use case or request I have seen myself, fwiw. Here are the cases I think are more urgent:
I do agree with @rossberg that it's not worth debating linear memory GC in depth atm. So apologies for that text. I'm just trying to get us to keep an open mind on all these topics. I really worry about us having too fixed an idea of how GC should work in wasm, as that could limit us, both about the big picture as in linear memory GC, and the small picture as in method dispatch. |
There are some languages that will not be able to use any host provided GC. One example of this is Prolog: it cannot use 'normal' GCs and still be reasonably performant. I suspect that go-lang (already mentioned) also fits into this category. |
@kripken, @fgmccabe, no disagreement there. I believe it's always been acknowledged by the CG that a built-in GC mechanism can never hope to fit 100% of use cases, and at best be like a 90% thing. Another borderline example is lazy languages, which like Prolog, typically perform certain extra optimizations like path compression at GC time. |
And document the potential performance benefits we've measured.