Skip to content
This repository was archived by the owner on Apr 25, 2025. It is now read-only.

Understanding Overhead #249

Closed
RossTate opened this issue Oct 8, 2021 · 15 comments
Closed

Understanding Overhead #249

RossTate opened this issue Oct 8, 2021 · 15 comments

Comments

@RossTate
Copy link
Contributor

RossTate commented Oct 8, 2021

I've been wondering where the performance overheads of the current proposal are coming from, especially since different language teams seem to be having very different experiences. There's been some discussion of opportunities to improve V8's implementation of some instructions, but here I'm digging into the expressiveness and high-level compilation model of the current proposal.

For the following, I'm using a benchmark that Java, C#, Node.js, and our ahead-of-time compiler all perform roughly the same on. I do so because this suggests to me that this performance is a good baseline for evaluating what can be done with ahead-of-time compilation, regardless of whether it is incremental or whole-program compilation. The benchmark is quick-sorting a large pseudorandom doubly-linked list of i64s using bidirectional iterators (i.e. no random access). (The original benchmark used generics, but I monomorphized everything to i64s to eliminate a potentially complicating factor.)

I then took this benchmark and then modified it to incorporate some, though not nearly all, of the possible sources of overhead in the current system. In particular,

  1. I made every object have a v-table field, rather than the v-table being part of the object descriptor. These v-tables have typed fields that provide the (typed) implementations of the class's methods.
  2. V-tables are allocated in the heap rather than preallocated in the binary (which affects locality and cache hits).
  3. Rather than those method implementations being represented as code pointers, I made them into closures over the module instance, modeling the fact that each funcref is a closure over its module instance rather than a code pointer.
  4. I changed the benchmark to use the instance-object model currently being used to instantiate WebAssembly modules. In particular, an additional "instance" object is passed throughout the program. This instance is only used for storing the v-tables of the classes (which every allocation fetches from the instance).
  5. I made the method implementations first cast the "this" pointer to the class's type—in all cases this cast is necessary to access the fields of the pointer. (Our class casts are implemented just like rtt casts.)

In all cases, I made sure the encoding captures less overhead than what the current system incurs. For example, in the current system casts have to load the rtt to cast to from the instance, but the encoding does not capture that aspect of cast overhead. As another example, the original benchmark uses interface methods rather than class methods, but I did not encode the overhead incurred by searching through an interface-table.

Furthermore, our compiler goes through a suite of LLVM optimizations that we've already checked takes care of eliminating redundant reads of readonly fields (which the fields pointing to v-tables and the fields within v-tables pointing to method implementations) and eliminating redundant loads of function pointers behind the scenes. And, since this is a very small program, many of the extra loads caused by the above sources of potential overhead are much more likely to hit the cache than they would in larger programs.

All this is to say that I tried to make this a best case estimation of overhead.

What I found was that altogether these sources of overhead caused a 43% slowdown in run-time performance. Unfortunately, I can't at present break down the individual contributions of each source of overhead except for one: if I remove casts of the "this" pointer, the overhead drops to 29%. So superfluous casts do have a notable impact on performance, but so do the other measured sources of overhead. Note that these were "exact" casts, i.e. the instance being cast always belonged to exactly the class it was being cast to, which hits the fast path in our casting algorithm. To give a sense of scale, making the doubly-linked list implementation generic, which in particular means having to pack/box ints into a uniform representation when put in the list, incurs 20% overhead.

Notably @askeksa-google avoids v-tables entirely, instead using call_indirect through a funcref table. And an advantage of funcref tables is that they can unbox the closure over the instance, and a single dispatch table has better locality than a bunch of disjointed v-tables. So this can reduce the number of chained loads involved in a (megamorphic) method dispatch, as well as increase the number of cache hits, which might explain the discrepancy with J2CL's poor performance of (megamorphic) method dispatch. (But this is conjecture.) Also, @askeksa-goole (and J2CL) use whole-program compilation to eliminate casts of the "this" pointer that would otherwise be necessary in even incremental compilation.

If people are interested in eliminating these overheads, here are some suggestions (some short-time, some long-term):

  1. Make it possible to have v-table data be stored in the object descriptor rather than a separate field.
  2. Make it possible for (immutable) v-tables to be stored in the binary (or in some packed manner) rather than on the heap.
  3. Remove the funcref <: anyref subtyping so that funcref can be implemented as a wide value, eliminating a chained load. Or...
  4. Eliminate the instance-object model and move towards the instantiation/compilation model other VMs use. (No change to the spec, but huge change to the engine.)
  5. Enrich the type system in order to eliminate superfluous casts (likely not in the MVP though). In the meanwhile, find ways to make casts cheaper, such as the suggestion in Hierarchies #245 (comment).

(Meta comment: if Discussions were enabled, this would be a Discussion rather than an Issue.)

I'm happy to clarify what I did/observed, and I'd be very interested in hearing about other experiments investigating potential sources of overhead.

@titzer
Copy link
Contributor

titzer commented Oct 9, 2021

I think our best chance of making progress on understanding overheads is to have replicable experiments on actual wasm engines. In the above, there isn't enough context (no source code, program, or object code), so it's not possible to make any conclusion about what those numbers mean.

If you can generate a .wasm file and run it on an engine like V8 or wasmtime, the experiment will be much more valuable, since it will be replicable and relevant. As it is, I don't see how any of the above data is actionable.

@RossTate
Copy link
Contributor Author

@titzer This sort of experiment is not measurable on wasm engines because it is measuring overhead incurred simply by using wasm and wasm engines. The baseline here is what our native ahead-of-time incremental compiler generates for our language, which ends up being in line with other major industry compilers/runtimes. The 5 sources of overhead I measured are things that a WebAssembly generator and a WebAssembly engine (currently) have to do that these other systems do not. I am not sure how to measure these overheads in the manner you are suggesting without having to develop alternate designs for WebAssembly and alternate implementations of engines. Do you have suggestions? (Also, as I said, I'm happy to answer any questions you have about the experiment, but your dismissal seems to be due to the overall approach rather than the specifics.)

@titzer
Copy link
Contributor

titzer commented Oct 10, 2021

There are just too many confounding variables to compare across systems in a meaningful way. We're at the level where a few machine instructions this or that way makes a huge difference, and the vast accidental complexity of those other systems just muddy the waters.

The best way to measure Wasm overhead is in Wasm engines, on Wasm programs. E.g. if you are interested in measuring the overhead of a (necessary) cast, then modify an engine to add an unsafe option that turns casts into nops. That way you can change just one variable and get the most meaningful comparison.

To move forward we have to push for implementations (engines+tools) to get complete enough that experiments like this are done in Wasm, on Wasm. Everything else is a distraction. TBH, a distraction at best. At worst, wrong data is misleading and deceptive, sending us on wild goose chases, making poor choices.

@RossTate
Copy link
Contributor Author

The problem with your demand is that, while it's easy to toggle a switch to enable or disable casts in an engine implementation, it's not so easy to implement a variation of an engine that does not use the instance-object compilation model, or that supports additional instructions for building v-tables into object descriptors. Thus one needs preliminary evidence that such variations might provide performance benefits before actually implementing them. So, if we can't compare WebAssembly implementations to gather this preliminary evidence, we have to look to other systems. If a potential optimization or a potential overhead has a notable impact on performance of another system, that gives reason to suspect—even if it's not guaranteed—it will also have a similar sort of impact on WebAssembly performance, providing a form of preliminary evidence. In other words, I am conducting experiments to help us determine (and prioritize) which variations are and are not worth exploring in engine implementations or WebAssembly's design.

@dcodeIO
Copy link

dcodeIO commented Oct 11, 2021

One experiment we did was to compile vtables to switches over direct calls. I believe you mentioned this on a recent slide as well. Has the advantage that an optimizer can inline the lookups at the call site to better take advantage of branch prediction and, when only constants are involved (after optimizations), eliminate the switch altogether. Of course the lookup can be omitted when sufficient static type information is already present indicating a particular call. In general I think that it should be obvious that any amount of unnecessary casts, indirection or likewise in basic language features will add up.

@RossTate
Copy link
Contributor Author

Ah, yes, and the results of this experiment suggests that another advantage to that approach is that it avoids some of the overheads seemingly incurred by the instance-object compilation model. Since it sounds like you've already implemented the alternate generation strategy, do you have some numbers available to share? I'd be interested to hear how the two compared for y'all.

@jakobkummerow
Copy link
Contributor

modify an engine to add an unsafe option that turns casts into nops

FWIW, in V8 we recently added --experimental-wasm-assume-ref-cast-succeeds (makes ref.cast a nop except for static typing) and --experimental-wasm-skip-null-checks (skips null-checks in all instructions that do them implicitly, e.g. struct.get) for such experiments.

That said, I agree with both Ross' and Ben's points: there's value to generating/evaluating promising ideas for implementation tricks/techniques to try, and to really prove that such an idea is worthwhile it has to be prototyped in a real Wasm engine. Both of those steps typically take significant effort.

@RossTate
Copy link
Contributor Author

Thanks, @jakobkummerow, I had intended this to be a preliminary experiment that might prompt more (increasingly expensive) experiments, with the ultimate being actual engine implementation.

To that end, the introduction of speculation yesterday prompted me to run two more experiments.

For the first experiment, I took the benchmark and completely devirtualized it, and then I let LLVM perform its suite of optimizations. This resulted in 20% speedup. Note that this benchmark makes very heavy use of interface methods in its hot code, all of which have very simple method implementations (e.g. Current() just returns the value of a field of the receiver). So avoiding all those interface-method lookups and inlining all their contents and then performing optimizations on the results saves 20%, at the cost of making the program no longer extensible (or, from another perspective, at the risk of being lost if there are multiple implementations of lists that get sorted by the program).

For the second experiment, I also fully devirtualized and then LLVM-optimized the version of the benchmark where I encoded the 5 sources of overhead listed in the OP. This eliminates 3 of those 5 because v-tables are no longer dereferenced (so it doesn't matter if they're in the heap or not), the closures in those v-tables are no longer called, and the casts are all eliminated. This resulted in 43% speedup. This is a lot more than before because there are a lot more sources of overhead to eliminate. At the same time, it only brings the performance of the program to match that of the original benchmark. In other words, the remaining 2 sources of overhead still incur 20% overhead in the fully devirtualized program.

There's an analogous experiment that can be run using the existing GC-WebAssembly tools and engines. This is the benchmark in Java. I would be interested to learn how the performance of this benchmark compares when compiled without whole-program optimizations (so, in particular, using interface-method dispatch) versus when compiled with whole-program optimizations (which should be fully devirtualized). It would provide a sense of scale of the overhead that we're trying to navigate. Is anyone well positioned to perform such a comparison?

@MaxGraey
Copy link

MaxGraey commented Oct 14, 2021

Btw what about providing unsafe versions of array.get/set without bound checking and struct.get/set without null checking? The point is that the compiler could rely on Range Analysis and Nullability Analysis in most cases (certainly not always) and could prove statically that the array or structure would never be null or out of bounds. Although perhaps these optimizations are better left for VMs

@rossberg
Copy link
Member

@MaxGraey, it's not enough that the producers prove such properties statically. They need to convey those proofs to the engine, e.g. via the type system, such that the engine can verify them. Otherwise, the engine would no longer be safe.

FWIW, struct.get/set without null check can already be expressed by simply using operands with non-null type. That is trivial to spot for the engine, and validation proves that it's sound. Bounds checks on arrays are of course a whole different story...

@MaxGraey
Copy link

MaxGraey commented Oct 14, 2021

struct.get/set without null check can already be expressed by simply using operands with non-null type

That's great!

Bounds checks on arrays are of course a whole different story...

It seems to me that this should still optimize the VMs. Most js engines already have bounds check elimination optimization and/or range analysis for other purposes.

@manoskouk
Copy link
Contributor

manoskouk commented Oct 14, 2021

@MaxGraey In V8, we plan to optimize bounds checks away in some cases. For example in a loop of the form
for (int i = 0; i < array.length; i++) { ... array[i] ... }
the bounds check should be eliminated. At the beginning, this will all be based on load elimination and common subexpression elimination, though, and there might be more opportunities based on the structure of wasm arrays specifically.

FWIW, in V8 we recently added --experimental-wasm-assume-ref-cast-succeeds (makes ref.cast a nop except for static typing) and --experimental-wasm-skip-null-checks (skips null-checks in all instructions that do them implicitly, e.g. struct.get) for such experiments.

This was very revealing, as we measured that, for the J2CL benchmarks, null checks constitute 4% of total running time, and ref.casts 17% (!!!). Note that these checks are no-ops in a non-trapping program. The issue is, many ref.casts are required due to type erasure, and it is not easy to get rid of them. There are definitely optimizations that we can try.

We will repeat this experiment for arrays bounds checks at some point.

Edit: The bounds check optimization is still in the works.

@MaxGraey
Copy link

MaxGraey commented Oct 14, 2021

@manoskouk Thanks for the info about V8 state. I thought it was already pretty advanced bounds check elimination. Turns out not. Is it because of analysis time constraints (to get within reasonable JIT compilation time limits) or just because there were no opportunities to devote attention and effort to this problem?

@jakobkummerow
Copy link
Contributor

To be clear, if bounds check elimination happens at all, it is always up to the engine. (When we previously experimented with it in V8, we found that it's surprisingly costly to do and doesn't help all that much, probably because branch predictors make conditional jumps that are never taken quite cheap; and there is a huge risk that even tiny bugs in it will cause security catastrophes.)

There will definitely never be any Wasm instructions that allow modules to (accidentally or intentionally) read or write out-of-bounds of allocated objects. All accesses must either be statically provable to be safe (e.g.: skipping null-checks when the object's type is statically non-nullable), or must be checked at runtime.

@tlively
Copy link
Member

tlively commented Nov 1, 2022

Closing this as non-actionable for the MVP, although PRs adding ideas for post-MVP features or investigations to the post-MVP doc would be very welcome.

@tlively tlively closed this as completed Nov 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants