Skip to content
This repository was archived by the owner on Apr 25, 2025. It is now read-only.
This repository was archived by the owner on Apr 25, 2025. It is now read-only.

Understanding Overhead #249

Closed
Closed
@RossTate

Description

@RossTate

I've been wondering where the performance overheads of the current proposal are coming from, especially since different language teams seem to be having very different experiences. There's been some discussion of opportunities to improve V8's implementation of some instructions, but here I'm digging into the expressiveness and high-level compilation model of the current proposal.

For the following, I'm using a benchmark that Java, C#, Node.js, and our ahead-of-time compiler all perform roughly the same on. I do so because this suggests to me that this performance is a good baseline for evaluating what can be done with ahead-of-time compilation, regardless of whether it is incremental or whole-program compilation. The benchmark is quick-sorting a large pseudorandom doubly-linked list of i64s using bidirectional iterators (i.e. no random access). (The original benchmark used generics, but I monomorphized everything to i64s to eliminate a potentially complicating factor.)

I then took this benchmark and then modified it to incorporate some, though not nearly all, of the possible sources of overhead in the current system. In particular,

  1. I made every object have a v-table field, rather than the v-table being part of the object descriptor. These v-tables have typed fields that provide the (typed) implementations of the class's methods.
  2. V-tables are allocated in the heap rather than preallocated in the binary (which affects locality and cache hits).
  3. Rather than those method implementations being represented as code pointers, I made them into closures over the module instance, modeling the fact that each funcref is a closure over its module instance rather than a code pointer.
  4. I changed the benchmark to use the instance-object model currently being used to instantiate WebAssembly modules. In particular, an additional "instance" object is passed throughout the program. This instance is only used for storing the v-tables of the classes (which every allocation fetches from the instance).
  5. I made the method implementations first cast the "this" pointer to the class's type—in all cases this cast is necessary to access the fields of the pointer. (Our class casts are implemented just like rtt casts.)

In all cases, I made sure the encoding captures less overhead than what the current system incurs. For example, in the current system casts have to load the rtt to cast to from the instance, but the encoding does not capture that aspect of cast overhead. As another example, the original benchmark uses interface methods rather than class methods, but I did not encode the overhead incurred by searching through an interface-table.

Furthermore, our compiler goes through a suite of LLVM optimizations that we've already checked takes care of eliminating redundant reads of readonly fields (which the fields pointing to v-tables and the fields within v-tables pointing to method implementations) and eliminating redundant loads of function pointers behind the scenes. And, since this is a very small program, many of the extra loads caused by the above sources of potential overhead are much more likely to hit the cache than they would in larger programs.

All this is to say that I tried to make this a best case estimation of overhead.

What I found was that altogether these sources of overhead caused a 43% slowdown in run-time performance. Unfortunately, I can't at present break down the individual contributions of each source of overhead except for one: if I remove casts of the "this" pointer, the overhead drops to 29%. So superfluous casts do have a notable impact on performance, but so do the other measured sources of overhead. Note that these were "exact" casts, i.e. the instance being cast always belonged to exactly the class it was being cast to, which hits the fast path in our casting algorithm. To give a sense of scale, making the doubly-linked list implementation generic, which in particular means having to pack/box ints into a uniform representation when put in the list, incurs 20% overhead.

Notably @askeksa-google avoids v-tables entirely, instead using call_indirect through a funcref table. And an advantage of funcref tables is that they can unbox the closure over the instance, and a single dispatch table has better locality than a bunch of disjointed v-tables. So this can reduce the number of chained loads involved in a (megamorphic) method dispatch, as well as increase the number of cache hits, which might explain the discrepancy with J2CL's poor performance of (megamorphic) method dispatch. (But this is conjecture.) Also, @askeksa-goole (and J2CL) use whole-program compilation to eliminate casts of the "this" pointer that would otherwise be necessary in even incremental compilation.

If people are interested in eliminating these overheads, here are some suggestions (some short-time, some long-term):

  1. Make it possible to have v-table data be stored in the object descriptor rather than a separate field.
  2. Make it possible for (immutable) v-tables to be stored in the binary (or in some packed manner) rather than on the heap.
  3. Remove the funcref <: anyref subtyping so that funcref can be implemented as a wide value, eliminating a chained load. Or...
  4. Eliminate the instance-object model and move towards the instantiation/compilation model other VMs use. (No change to the spec, but huge change to the engine.)
  5. Enrich the type system in order to eliminate superfluous casts (likely not in the MVP though). In the meanwhile, find ways to make casts cheaper, such as the suggestion in Hierarchies #245 (comment).

(Meta comment: if Discussions were enabled, this would be a Discussion rather than an Issue.)

I'm happy to clarify what I did/observed, and I'd be very interested in hearing about other experiments investigating potential sources of overhead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions