-
Notifications
You must be signed in to change notification settings - Fork 74
Understanding Overhead #249
Comments
I think our best chance of making progress on understanding overheads is to have replicable experiments on actual wasm engines. In the above, there isn't enough context (no source code, program, or object code), so it's not possible to make any conclusion about what those numbers mean. If you can generate a |
@titzer This sort of experiment is not measurable on wasm engines because it is measuring overhead incurred simply by using wasm and wasm engines. The baseline here is what our native ahead-of-time incremental compiler generates for our language, which ends up being in line with other major industry compilers/runtimes. The 5 sources of overhead I measured are things that a WebAssembly generator and a WebAssembly engine (currently) have to do that these other systems do not. I am not sure how to measure these overheads in the manner you are suggesting without having to develop alternate designs for WebAssembly and alternate implementations of engines. Do you have suggestions? (Also, as I said, I'm happy to answer any questions you have about the experiment, but your dismissal seems to be due to the overall approach rather than the specifics.) |
There are just too many confounding variables to compare across systems in a meaningful way. We're at the level where a few machine instructions this or that way makes a huge difference, and the vast accidental complexity of those other systems just muddy the waters. The best way to measure Wasm overhead is in Wasm engines, on Wasm programs. E.g. if you are interested in measuring the overhead of a (necessary) cast, then modify an engine to add an unsafe option that turns casts into nops. That way you can change just one variable and get the most meaningful comparison. To move forward we have to push for implementations (engines+tools) to get complete enough that experiments like this are done in Wasm, on Wasm. Everything else is a distraction. TBH, a distraction at best. At worst, wrong data is misleading and deceptive, sending us on wild goose chases, making poor choices. |
The problem with your demand is that, while it's easy to toggle a switch to enable or disable casts in an engine implementation, it's not so easy to implement a variation of an engine that does not use the instance-object compilation model, or that supports additional instructions for building v-tables into object descriptors. Thus one needs preliminary evidence that such variations might provide performance benefits before actually implementing them. So, if we can't compare WebAssembly implementations to gather this preliminary evidence, we have to look to other systems. If a potential optimization or a potential overhead has a notable impact on performance of another system, that gives reason to suspect—even if it's not guaranteed—it will also have a similar sort of impact on WebAssembly performance, providing a form of preliminary evidence. In other words, I am conducting experiments to help us determine (and prioritize) which variations are and are not worth exploring in engine implementations or WebAssembly's design. |
One experiment we did was to compile vtables to switches over direct calls. I believe you mentioned this on a recent slide as well. Has the advantage that an optimizer can inline the lookups at the call site to better take advantage of branch prediction and, when only constants are involved (after optimizations), eliminate the switch altogether. Of course the lookup can be omitted when sufficient static type information is already present indicating a particular call. In general I think that it should be obvious that any amount of unnecessary casts, indirection or likewise in basic language features will add up. |
Ah, yes, and the results of this experiment suggests that another advantage to that approach is that it avoids some of the overheads seemingly incurred by the instance-object compilation model. Since it sounds like you've already implemented the alternate generation strategy, do you have some numbers available to share? I'd be interested to hear how the two compared for y'all. |
FWIW, in V8 we recently added That said, I agree with both Ross' and Ben's points: there's value to generating/evaluating promising ideas for implementation tricks/techniques to try, and to really prove that such an idea is worthwhile it has to be prototyped in a real Wasm engine. Both of those steps typically take significant effort. |
Thanks, @jakobkummerow, I had intended this to be a preliminary experiment that might prompt more (increasingly expensive) experiments, with the ultimate being actual engine implementation. To that end, the introduction of speculation yesterday prompted me to run two more experiments. For the first experiment, I took the benchmark and completely devirtualized it, and then I let LLVM perform its suite of optimizations. This resulted in 20% speedup. Note that this benchmark makes very heavy use of interface methods in its hot code, all of which have very simple method implementations (e.g. For the second experiment, I also fully devirtualized and then LLVM-optimized the version of the benchmark where I encoded the 5 sources of overhead listed in the OP. This eliminates 3 of those 5 because v-tables are no longer dereferenced (so it doesn't matter if they're in the heap or not), the closures in those v-tables are no longer called, and the casts are all eliminated. This resulted in 43% speedup. This is a lot more than before because there are a lot more sources of overhead to eliminate. At the same time, it only brings the performance of the program to match that of the original benchmark. In other words, the remaining 2 sources of overhead still incur 20% overhead in the fully devirtualized program. There's an analogous experiment that can be run using the existing GC-WebAssembly tools and engines. This is the benchmark in Java. I would be interested to learn how the performance of this benchmark compares when compiled without whole-program optimizations (so, in particular, using interface-method dispatch) versus when compiled with whole-program optimizations (which should be fully devirtualized). It would provide a sense of scale of the overhead that we're trying to navigate. Is anyone well positioned to perform such a comparison? |
Btw what about providing unsafe versions of |
@MaxGraey, it's not enough that the producers prove such properties statically. They need to convey those proofs to the engine, e.g. via the type system, such that the engine can verify them. Otherwise, the engine would no longer be safe. FWIW, |
That's great!
It seems to me that this should still optimize the VMs. Most js engines already have bounds check elimination optimization and/or range analysis for other purposes. |
@MaxGraey In V8, we plan to optimize bounds checks away in some cases. For example in a loop of the form
This was very revealing, as we measured that, for the J2CL benchmarks, null checks constitute 4% of total running time, and We will repeat this experiment for arrays bounds checks at some point. Edit: The bounds check optimization is still in the works. |
@manoskouk Thanks for the info about V8 state. I thought it was already pretty advanced bounds check elimination. Turns out not. Is it because of analysis time constraints (to get within reasonable JIT compilation time limits) or just because there were no opportunities to devote attention and effort to this problem? |
To be clear, if bounds check elimination happens at all, it is always up to the engine. (When we previously experimented with it in V8, we found that it's surprisingly costly to do and doesn't help all that much, probably because branch predictors make conditional jumps that are never taken quite cheap; and there is a huge risk that even tiny bugs in it will cause security catastrophes.) There will definitely never be any Wasm instructions that allow modules to (accidentally or intentionally) read or write out-of-bounds of allocated objects. All accesses must either be statically provable to be safe (e.g.: skipping null-checks when the object's type is statically non-nullable), or must be checked at runtime. |
Closing this as non-actionable for the MVP, although PRs adding ideas for post-MVP features or investigations to the post-MVP doc would be very welcome. |
I've been wondering where the performance overheads of the current proposal are coming from, especially since different language teams seem to be having very different experiences. There's been some discussion of opportunities to improve V8's implementation of some instructions, but here I'm digging into the expressiveness and high-level compilation model of the current proposal.
For the following, I'm using a benchmark that Java, C#, Node.js, and our ahead-of-time compiler all perform roughly the same on. I do so because this suggests to me that this performance is a good baseline for evaluating what can be done with ahead-of-time compilation, regardless of whether it is incremental or whole-program compilation. The benchmark is quick-sorting a large pseudorandom doubly-linked list of
i64
s using bidirectional iterators (i.e. no random access). (The original benchmark used generics, but I monomorphized everything toi64
s to eliminate a potentially complicating factor.)I then took this benchmark and then modified it to incorporate some, though not nearly all, of the possible sources of overhead in the current system. In particular,
funcref
is a closure over its module instance rather than a code pointer.In all cases, I made sure the encoding captures less overhead than what the current system incurs. For example, in the current system casts have to load the rtt to cast to from the instance, but the encoding does not capture that aspect of cast overhead. As another example, the original benchmark uses interface methods rather than class methods, but I did not encode the overhead incurred by searching through an interface-table.
Furthermore, our compiler goes through a suite of LLVM optimizations that we've already checked takes care of eliminating redundant reads of
readonly
fields (which the fields pointing to v-tables and the fields within v-tables pointing to method implementations) and eliminating redundant loads of function pointers behind the scenes. And, since this is a very small program, many of the extra loads caused by the above sources of potential overhead are much more likely to hit the cache than they would in larger programs.All this is to say that I tried to make this a best case estimation of overhead.
What I found was that altogether these sources of overhead caused a 43% slowdown in run-time performance. Unfortunately, I can't at present break down the individual contributions of each source of overhead except for one: if I remove casts of the "this" pointer, the overhead drops to 29%. So superfluous casts do have a notable impact on performance, but so do the other measured sources of overhead. Note that these were "exact" casts, i.e. the instance being cast always belonged to exactly the class it was being cast to, which hits the fast path in our casting algorithm. To give a sense of scale, making the doubly-linked list implementation generic, which in particular means having to pack/box ints into a uniform representation when put in the list, incurs 20% overhead.
Notably @askeksa-google avoids v-tables entirely, instead using
call_indirect
through afuncref
table. And an advantage offuncref
tables is that they can unbox the closure over the instance, and a single dispatch table has better locality than a bunch of disjointed v-tables. So this can reduce the number of chained loads involved in a (megamorphic) method dispatch, as well as increase the number of cache hits, which might explain the discrepancy with J2CL's poor performance of (megamorphic) method dispatch. (But this is conjecture.) Also, @askeksa-goole (and J2CL) use whole-program compilation to eliminate casts of the "this" pointer that would otherwise be necessary in even incremental compilation.If people are interested in eliminating these overheads, here are some suggestions (some short-time, some long-term):
funcref <: anyref
subtyping so thatfuncref
can be implemented as a wide value, eliminating a chained load. Or...(Meta comment: if Discussions were enabled, this would be a Discussion rather than an Issue.)
I'm happy to clarify what I did/observed, and I'd be very interested in hearing about other experiments investigating potential sources of overhead.
The text was updated successfully, but these errors were encountered: