Skip to content

wasm: improve malloc/free heap tracking #3162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from
Draft

Conversation

aykevl
Copy link
Member

@aykevl aykevl commented Sep 15, 2022

Using a slice requires a lot less in code size than a map. This is visible when compiling a very small "hello world" style program.

Before tracking memory in malloc/free: 2873 bytes
With tracking using a map: 6551 bytes
With a slice instead of a map: 3532 bytes

Of course, most of this code size increase won't be visible with #3142, but it's still a saving of around 3kB in this minimal example.

@dgryski
Copy link
Member

dgryski commented Sep 15, 2022

Should we be concerned about the change from O(1) to O(n) time for tracking allocations?

@aykevl
Copy link
Member Author

aykevl commented Sep 15, 2022

Perhaps. But the GC itself is also rather inefficient. We could of course switch this algorithm to an algorithm with the same complexity as the one in the GC (which is probably only marginally better).

@dgryski
Copy link
Member

dgryski commented Sep 15, 2022

Fair enough. Let's leave the simple implementation for now and if it becomes an issue we can maybe make a build-tag to drop in a map-based implementation for people who care about it.

Copy link
Member

@dgryski dgryski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions, but otherwise LGTM.

allocs[i] = nil
return
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to complain if we can't find the pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it's undefined behavior to call free on an invalid pointer so we're not required to.
But of course we could do it when gcAsserts is true for example.

panic("free: invalid pointer")
}
removeAlloc(ptr)
free(ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm that with the //export free line, calling free() here still does the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it calls the function free in gc_conservative.go. The //export line does not affect the namespace of the Go package.

@deadprogram
Copy link
Member

@aykevl
Copy link
Member Author

aykevl commented Sep 15, 2022

Yes, that looks like a real bug. I'm glad #3148 added some tests!

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reference the -opt flag and pick between slice and map? We'll be executing a lot of logic compiled from C++ or Rust and its unclear what pattern the malloc / free would take. I understand the code size savings is important, but having the chance of pathological behavior makes it scary to run any code compiled with TinyGo doing polyglot. If the slice could be used with -opt=s or z, and map used with 2, that could allow both niches to be satisfied with opts that already reflect those semantics.

var allocs = make(map[uintptr][]byte)
func removeAlloc(ptr unsafe.Pointer) {
// Remove the pointer so it can be garbage collected.
for i, slot := range allocs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider iterating in reverse order, complete heuristic but I think newer pointers tend to be freed sooner than older pointers in real code.

@codefromthecrypt
Copy link
Contributor

I think the tension here, correct me if I'm wrong, is basically you can't override this (a part of how GC works) with build tags until tinygo adds them. chicken egg. Then, the next tension is proving the performance is worse vs having the next impl prove it isn't worse.

Strings are routinely used in plugins. ex regular expression matches etc. It would help in the future balance size/perf in a less conjecturey way. Ex place a representative main.go here that exports regex functions, and benchmark using them. I think that would help balance some discussions in general. A bias towards size can exist with guards against severe performance regressions.

my 2p

@aykevl
Copy link
Member Author

aykevl commented Sep 29, 2022

Is it possible to reference the -opt flag and pick between slice and map?

It is possible, but wouldn't help much when it comes to algorithmic complexity. The heap allocator itself is also O(n) in the worst case, where n is the number of blocks (heap size divided by 16, usually). This is why I expect this PR won't have a lot of effect in practice.

Generally when it comes to performance, I want to act on data. In this case, I do have data on code size (because it is easy to measure) but I do not have data on performance. We can always change the implementation if benchmarks suggest this is a bottleneck.

@anuraaga
Copy link
Contributor

Generally when it comes to performance, I want to act on data. In this case, I do have data on code size (because it is easy to measure) but I do not have data on performance.

I do agree with the sentiment but feel that if there are clear pathological cases then those are also similarly clear.

But I didn't know allocation is similarly pathological - in that case indeed this does not need a change.

Is it fair to say TinyGo's wasi target shouldn't be used in cases where performance is a concern given the allocator's limitations?

@dgryski
Copy link
Member

dgryski commented Sep 29, 2022

If performance is a concern, maybe you shouldn't be using WASM...

To be fair, we have the gc disabled for some of our services: -gc=leaking is fine for short-lived processes if their initial heap is set correctly.

@anuraaga
Copy link
Contributor

If performance is a concern, maybe you shouldn't be using WASM...

WASM is the only way to extend Envoy for now so we're focusing on it for better or worse 😅 Unfortunately a leaking mode also isn't supported there, but maybe it should be.

For a bit of context, we're trying to bring Coraza WAF to Envoy via WASM - for performance reasons, we've swapped out several libraries from Go to C(++) or Rust

https://github.com/anuraaga/coraza-wasm-filter/tree/main/lib

The biggest was regexp itself with ~5x overall performance improvement from swapping in libre2. But other's like aho-corasick also provide 20% - the numbers are e2e not microbenchmark.

I think this means that while native will of course perform the best, wasm does have potential for reasonable performance. While I am curious what may be causing such drastic performance difference, if TinyGo itself is designed in a way that we expect lower performance than other compilers, it wouldn't be productive to do a deep dive. So I'm just trying to understand the vision there.

If small code size is the end all (as tiny would represent which makes a lot of sense!) then I would suggest that browser rather than wasi would be the target that fits better with the vision. But that's just a straw-man's proposal. Sorry if too much noise

@aykevl
Copy link
Member Author

aykevl commented Sep 29, 2022

Performance of TinyGo compiled code varies a lot. I've seen cases where TinyGo outperformed standard Go by a wide margin (for integer heavy code that doesn't allocate memory). For other cases, TinyGo will be a lot slower. In general, you can expect C like code to perform well but code that uses the heap, goroutines, etc to perform much worse.

In particular, for WebAssembly, TinyGo has to jump through hoops to get it supported. WebAssembly is just a really weird instruction set that is incredibly difficult for a Go-like language to target whereas other architectures are much simpler to support. This is getting improved slowly (with exception handling, stack switching, and a GC) but this will likely take years. Once these features are supported in WebAssembly and used by TinyGo, I expect TinyGo binaries will perform not very different from C and Rust (perhaps a bit slower but not by much). Until then, we have to keep our workarounds in place which inevitably means code will run slower.

That said, the GC could certainly be improved. It works, but it wasn't optimized for speed (for example, the heap is fully conservative). I believe @dgryski is investigating how to improve this.

@dgryski
Copy link
Member

dgryski commented Sep 29, 2022

The two performance improvements I see for the current garbage collector (without a large scale rewrite) are:

  1. increase the size of the mark work queue; an overflow requires an additional scan of the entire heap
  2. make note of which allocations do not contain any pointers and this don't require scanning

The first one is easy (although a bit tricky to get right so as not to affect performance on low-end embedded machines) and the second one requires a bit more work to to track the newly required information.

@anuraaga
Copy link
Contributor

Thanks for the context everyone - definitely would be nice if the GC spec goes somewhere especially for polyglot! Those GC optimizations, especially the second one seems quite compelling too.

@dgryski
Copy link
Member

dgryski commented Sep 30, 2022

TinyGo is likely to benefit from the stack switching proposal (for goroutines) but not the garbage collection one (because Go supports interior pointers which the proposal doesn't allow.)

@aykevl
Copy link
Member Author

aykevl commented Sep 30, 2022

but not the garbage collection one (because Go supports interior pointers which the proposal doesn't allow.)

It can probably still use the WebAssembly GC if all struct methods (etc) are made heap objects. For example, a struct like this:

type Point struct {
    X, Y int
}

could be compiled like this:

type Point struct {
    X, Y *int
}

This way, it is possible to have interior pointers at the cost of a large heap increase.
This will be even worse for things like byte slices (where it is possible to construct a pointer to each individual byte) but there may be workarounds for that too, like fat pointers. Still, it'll be rather difficult to get this working.

@aykevl
Copy link
Member Author

aykevl commented Sep 30, 2022

@anuraaga one way you could help here is by making a good (realistic) GC benchmark. Having a good benchmark would help a lot to improve the GC.

uintptr is not tracked by the GC, while any pointer type (including
unsafe.Pointer) is tracked. Make sure to only cast pointers to uintptr
when absolutely necessary.

This fixes a bug found in #3162.
Using a slice requires a lot less in code size than a map. This is
visible when compiling a very small "hello world" style program.

Before tracking memory in malloc/free:  2873 bytes
With tracking using a map:              6551 bytes
With a slice instead of a map:          3532 bytes

Of course, most of this code size increase won't be visible with
#3142, but it's still a saving
of around 3kB in this minimal example.
@aykevl
Copy link
Member Author

aykevl commented Sep 30, 2022

Updated the PR. The PR was failing because the tests actually contained a bug: they passed pointers around as uintptr. The uintptr type is an integer, not a pointer, so the GC wasn't tracking it and the memory was freed before all references were gone.
The first commit fixes the bug, the second commit is the original commit of this PR.

@dgryski
Copy link
Member

dgryski commented Sep 30, 2022

One of the gc-heavy benchmarks I've been playing with is the binarytrees program from the Benchmarks Game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html

@anuraaga
Copy link
Contributor

The uintptr type is an integer, not a pointer, so the GC wasn't tracking it and the memory was freed before all references were gone.

This was intentional - the tests are calling malloc and the pointer should be valid until free is called even treated as an integer. If this code was c++ it would not be tracked either, hence why the allocator needs to track.

I'm not sure why but I believe I had issues using unsafe.Pointer in trackAlloc, it had to be []byte to work.

@aykevl
Copy link
Member Author

aykevl commented Oct 1, 2022

This was intentional - the tests are calling malloc and the pointer should be valid until free is called even treated as an integer.

...you are entirely correct. Yes, the pointer should remain valid even if it is treated as uintptr.

I'm not sure why but I believe I had issues using unsafe.Pointer in trackAlloc, it had to be []byte to work.

That's interesting, sounds like a bug actually. I'll need to investigate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants