Skip to content

Porting Boehm to Emscripten #18251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
juj opened this issue Nov 23, 2022 · 15 comments
Open

Porting Boehm to Emscripten #18251

juj opened this issue Nov 23, 2022 · 15 comments

Comments

@juj
Copy link
Collaborator

juj commented Nov 23, 2022

I have a fork of Boehm cooking at ivmai/bdwgc@master...juj:bdwgc:emscripten , where I am looking to make the latest Boehm upstream work with Emscripten. (it looks like its Emscripten support has regressed by quite a bit since Unity has last synced with upstream Boehm many many years ago)

My idea is to experiment with a couple of different strategies for Boehm + Emscripten support:

  1. "the way it was intended to work": using Binaryen's --spill-pointers pass that @kripken restored to work this summer, to automatically spill all pointers from wasm locals to the stack, so that Boehm will be able to scan the stack,
  2. "the way that Unity currently uses Boehm": i.e. enable only "asynchronous GC collection", where GC collection only occurs when no wasm code (with managed local functions) is on the stack, for example via a delayed "setTimeout(GC_gcollect, 0);" type of call.
  3. "cooperative garbage collection": this is the idea from Smarter way for preventing LLVM from placing a function local variable on the hidden WebAssembly stack? #17131, experimenting with scenarios where user code is able to manually annotate all GC pointers via some kind of a DECLARE_GC_POINTER(myPtr);. While this would be brittle, it is something that might work well in automatic codegen scenarios where correctness can be verified.

Looking at current Boehm Emscripten support, someone landed a change there that switches Boehm to require -sASYNCIFY to be enabled at the same time, thinking that it would enable stack scanning. ( ivmai/bdwgc@1431bda#diff-f82cc1b932242998b29875a64e8490b12c461ba04ab72e59ef0ac5b2efa8af4a ).

While I am not intimate with the 100% details of ASYNCIFY, I was pondering that maybe that would not be correct at all? @kripken true or false: ASYNCIFY only spills locals to the stack only in those functions that can lead to calling asyncified functions, but not in all of the functions? In other words, enabling ASYNCIFY is not enough to substitute the current need for a --spill-pointers pass? I.e. simply enabling ASYNCIFY would not properly spill pointers on the stack?

Also doesn't ASYNCIFY use its own stack that is separate from the regular stack? (would emscripten_scan_stack() scan this Asyncify stack?)

How would I appropriately enable the --spill-pointers pass in Binaryen? I've so far tried just to do

diff --git a/emcc.py b/emcc.py
index f2dc5821a..28ee19f4d 100755
--- a/emcc.py
+++ b/emcc.py
@@ -594,6 +594,8 @@ def should_run_binaryen_optimizer():

 def get_binaryen_passes():
   passes = []
+  if not settings.BOOTSTRAPPING_STRUCT_INFO:
+    passes += ['--spill-pointers']
   optimizing = should_run_binaryen_optimizer()
   # safe heap must run before post-emscripten, so post-emscripten can apply the sbrk ptr
   if settings.SAFE_HEAP:

though that gets me correctness problems. Looks like printf() starts to fail with that, and most of the core0 tests in the suite as well. I am not sure if there might be some ordering dependencies with other passes that this spilling pointers pass should be run at? Or maybe something might be off with the --spill-pointers pass?

@juj
Copy link
Collaborator Author

juj commented Nov 23, 2022

I am not sure if there might be some ordering dependencies with other passes that this spilling pointers pass should be run at?

(if I try to add the --spill-pointers pass as the last pass, looks like it does give the same result, so maybe not an ordering issue here)

@kripken
Copy link
Member

kripken commented Nov 28, 2022

First thing, have you seen emscripten_scan_registers? That uses Asyncify to scan state in "registers" (spilled or not), so it could be an option here.

ASYNCIFY only spills locals to the stack only in those functions that can lead to calling asyncified functions, but not in all of the functions? In other words, enabling ASYNCIFY is not enough to substitute the current need for a --spill-pointers pass? I.e. simply enabling ASYNCIFY would not properly spill pointers on the stack?

Asyncify should be enough if "asyncified functions" == "GCing functions". That is, Asyncify will save stuff on locals all the way up the current stack when you pause/resume, and so if GC functions are async functions then each GC will be able to see locals up the stack. And those are the only relevant locals so it should be enough (aside from the heap of course).

Spill-pointers can still be useful, if you don't have a clear set of GCing functions perhaps, and also spill-pointers has different overhead (Asyncify instruments control flow too, and adds code to read spilled registers, not just write them). I've never compared them but I'd hope spill-pointers could be faster.

Also doesn't ASYNCIFY use its own stack that is separate from the regular stack?

Yes, it uses data in the heap somewhere (which is internally a stack). So you need to scan that region in addition to the normal user stack. I believe emscripten_scan_registers does that.

How would I appropriately enable the --spill-pointers pass in Binaryen?

I think you can use `-sBINARYEN_EXTRA_PASSES=--spill-pointers".

But the correctness problems you saw seem worrying - that sounds like a possible bug in spill-pointers, I can't think of another explanation. It might be worth reducing a testcase, or alternatively adding --spill-pointers to the list of passes in the binaryen fuzzer and running the fuzzer for a while (that will find more easily-reducible things). (For that, you can add to this list, or even just replace the list with just the pass you want.)

@juj
Copy link
Collaborator Author

juj commented Nov 30, 2022

Asyncify should be enough if "asyncified functions" == "GCing functions

Thanks, that is what I was expecting. So basically this means that in order for ASYNCIFY to work as a solution, that means that the function GC_gcollect(); which performs the mark+sweep garbage collection should be marked asynchronous(?).

Or actually, looking at Boehm's internals, the code structure it is currently used with looks like this:

void (*gc_scan_roots_func)(void);

void GC_gcollect()
{
   // register roots:
   // ...
   gc_scan_roots_func();

   // marking:
   ...

   // sweeping:
   ...
}
...
static void scan_regs_cb(void *begin, void *end)
{
  GC_push_all_stack((ptr_t)begin, (ptr_t)end);
}

void GC_default_push_other_roots()
{
  emscripten_scan_registers(scan_regs_cb);
}
gc_scan_roots_func = &GC_default_push_other_roots;

That is, Boehm allows the user to set an extra callback to inject custom roots, and an Emscripten specific callback is registered there to call emscripten_scan_registers().

But that call is done via an indirect function pointer dispatch - does that affect ASYNCIFY's ability to infer that GC_gcollect() should be async? Or should it be manually declared as such on the command line? (or the call from GC_gcollect() -> GC_default_push_other_roots() be made a direct function call?)

Spill-pointers can still be useful, if you don't have a clear set of GCing functions perhaps, and also spill-pointers has different overhead (Asyncify instruments control flow too, and adds code to read spilled registers, not just write them). I've never compared them but I'd hope spill-pointers could be faster.

That makes sense. I am not actually interested at all in using ASYNCIFY in conjunction with GCing at Unity, but the reason I am looking into this is that someone changed upstream Boehm to hard-depend on ASYNCIFY (regressing our use case), so I want to restore our use case there, but in order to do it right, I wanted to double check that the ASYNCIFY implementation there even works like would be expected.

But the correctness problems you saw seem worrying - that sounds like a possible bug in spill-pointers, I can't think of another explanation. It might be worth reducing a testcase

I do have a reduced test case cooking, I'll try to get that down to a minimum.

In one test case I see that --spill-pointers is working ok in -O0, but fails with -O1. Do you anticipate currently that there would be any reason for -O1 to be incompatible with --spill-pointers?

@kripken
Copy link
Member

kripken commented Nov 30, 2022

So basically this means that in order for ASYNCIFY to work as a solution, that means that the function GC_gcollect(); which performs the mark+sweep garbage collection should be marked asynchronous(?).

Yes. That is automatic if the function is an import. Otherwise, add it to ASYNCIFY_ADD.

But that call is done via an indirect function pointer dispatch - does that affect ASYNCIFY's ability to infer that GC_gcollect() should be async?

If you don't flip the default behavior, all indirect calls are assumed to be relevant (we could track by the function type, but don't atm.) So that should just work, but in a big program indirect calls will mean lots of code gets instrumented, sadly.

Interesting btw that upstream Boehm GC was using emscripten_scan_registers - I didn't know that!

Do you anticipate currently that there would be any reason for -O1 to be incompatible with --spill-pointers?

No. It should be a simple and safe transform. So there must be a weird bug somewhere...

@juj
Copy link
Collaborator Author

juj commented Dec 1, 2022

The smallest test case I currently have is this:

#include <stdio.h>
#include <stdarg.h>

void xprintf(const char *restrict fmt, ...)
{
	va_list ap;
	va_start(ap, fmt);
	vfprintf(stdout, fmt, ap);
	va_end(ap);
}

int main()
{
	xprintf("%d\n", 5);
}

When built with e.g. emcc a.c -o a.js -sBINARYEN_EXTRA_PASSES=--spill-pointers -O1, it incorrectly prints out 0, but with -O0, it prints out 5.

To be able to easier compare what is different with -O0 vs -O1, I applied the following local changes to Emscripten:

diff --git a/emcc.py b/emcc.py
index 55f308f6f..4cc85da8d 100755
--- a/emcc.py
+++ b/emcc.py
@@ -1890,6 +1894,7 @@ def phase_linker_setup(options, state, newargs):
     # However, for debugability is better to have the stack come first
     # (becuase stack overflows will trap rather than corrupting data).
     settings.STACK_FIRST = True
+  settings.STACK_FIRST = False

   # Default to TEXTDECODER=2 (always use TextDecoder to decode UTF-8 strings)
   # in -Oz builds, since custom decoder for UTF-8 takes up space.
@@ -3637,7 +3642,7 @@ def phase_binaryen(target, options, wasm_target):

     wasm2js = building.wasm2js(wasm2js_template,
                                wasm_target,
-                               opt_level=settings.OPT_LEVEL,
+                               opt_level=0,
                                minify_whitespace=minify_whitespace(),
                                use_closure_compiler=options.use_closure_compiler,
                                debug_info=debug_info,

and built the test case with

emcc a.c -o a1.js -sBINARYEN_EXTRA_PASSES=--spill-pointers -O0 -g3 -sWASM=0 -sASSERTIONS
emcc a.c -o a2.js -sBINARYEN_EXTRA_PASSES=--spill-pointers -O1 -g3 -sWASM=0 -sASSERTIONS

and the relevant diff I get is

image

The different parts are marked in purple. It looks like the relevant issue is the difference on lines

__stack_pointer = $3 - 48 | 0;

vs

__stack_pointer = $3 - 16 | 0;

The other changes don't seem relevant.

I don't understand why there is such a large 48 byte stack frame bump, it seems that is too large compared to what is actually needed. But when -O1 replaces that code with only a 16 byte bump, that breaks the code.

@juj
Copy link
Collaborator Author

juj commented Dec 1, 2022

The behavior with the second stack frame bumping looks odd overall. In the -O0 spill-pointers case on the left, there is

$3 = __stack_pointer;
__stack_pointer = $3 - 48 | 0;
...
HEAP32[($3 + 16 | 0) >> 2] = $2;
...
__stack_pointer = $3;

but that construct does not look like a correct stack frame bump operation? The line where $2 is stored, doesn't that go on the calling function's stack space, since $3 represents the pointer before the stack bump?

(Also the value of $2 is the value of the stack pointer, not sure why it would be spilling the stack pointer itself?)

I would expect that the second stack frame bump should instead read like the first stack frame bump, i.e.:

$3 = __stack_pointer - 48 | 0;
__stack_pointer = $3 | 0;
...
HEAP32[($3 + 16 | 0) >> 2] = $2;
...
__stack_pointer = $3 + 48 | 0;

then the HEAP write would not go out of bounds?

@kripken
Copy link
Member

kripken commented Dec 1, 2022

Hmm, maybe the error is we run out of stack? Though it's strange it happens in -O1 which uses less stack than -O0. Perhaps it overwrites other data on the stack, then. We should check how the LLVM backend manages the stack pointer, I'm not sure if it needs to be bumped before or after use, in those code fragments.

Btw, we could actually do this in a better way now: the pass could create a new Memory just for spilling, and operate on that (and trap if we run out of space). That would be simple and avoid any risks of bad interactions with the normal stack. We have a multimemory lowering pass now which would let that run in all VMs, even ones without multimemory support. We've recently added that support in Asyncify and wasm-split actually so this would be pretty simple to do.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 1, 2022

For details of how llvm treats the stack pointer see: https://github.com/WebAssembly/tool-conventions/blob/main/BasicCABI.md#the-linear-stack.

@juj
Copy link
Collaborator Author

juj commented Dec 1, 2022

Hmm, maybe the error is we run out of stack?

I don't think this is an out of stack error, but to me it reads that the second stack is being spilled incorrectly? The stack frame generation code pattern for spilling does not seem to be proper.

We should check how the LLVM backend manages the stack pointer, I'm not sure if it needs to be bumped before or after use, in those code fragments.

The regular stack bump without --spill-pointers looks like this:

$0 = __stack_pointer - <stack_frame_size> | 0;
__stack_pointer = $0 | 0;
...
HEAP32[($0 + x | 0) >> 2] = some_data_on_stack;
...
__stack_pointer = $0 + <stack_frame_size> | 0;

I'd expect that the --spill-pointers stack bump logic should look the same as this? (ideally these two would be merged together to create just one bump, but that's probably a separate optimization)

Also the stack bump amount that --spill-pointers does is excessively big, though there shouldn't be any harm in that.

There is also a "to-optimize-later" performance issue with spilling that it seems that the spilled data is later read, but if I understand correctly, the pointers that are spilled on the stack could be treated in a write-only "fire and forget" fashion, there is no reason to read those pointers back later? (they will be the same)

Btw, we could actually do this in a better way now: the pass could create a new Memory just for spilling, and operate on that (and trap if we run out of space)

Hmm, that does not sound that exciting. I think it would probably be best to keep this simple, since there is no real need for multimemories. (how would that extend to multiple threads?)

@kripken
Copy link
Member

kripken commented Dec 1, 2022

Yes, could be that the simplest path here is to modify the pass to emit the same type of code as the LLVM backend does. That should work. My idea with multimemories was that it would give a guarantee of not interacting badly with anything else since it would be separate (which is a nice property that we now have as an option in Asyncify and wasm-split) but it would be more work, assuming the only needed fix otherwise is how the stack pointer is bumped.

@juj
Copy link
Collaborator Author

juj commented Dec 2, 2022

Ok, found the issue - I'll submit a PR soon.

It looks like Asyncify does not work with spill pointers, not quite sure why though. Anything you think would be incompatible there? (it's not particularly important I suppose, these features are likely never used together, given that asyncify has its own mechanism it can use)

@kripken
Copy link
Member

kripken commented Dec 2, 2022

Hmm, maybe it's possible that spill-pointers uses the stack pointer local before asyncify restored it during rewinding, or something like that? I can imagine that happening if the order is asyncify and then spill-pointers. But I'd expect the order of spill-pointers and then asyncify to always work.

@waywardmonkeys
Copy link
Contributor

@juj Did you end up submitting a PR? Will --spill-pointers work if I'm not using -s ASYNCIFY?

@hoodmane
Copy link
Collaborator

@waywardmonkeys it looks like @juj submitted WebAssembly/binaryen#5315

@juj
Copy link
Collaborator Author

juj commented Mar 20, 2023

Ops, sorry, missed your original ping @waywardmonkeys .

Yeah, I have the above PR cooking, though I found that that PR alone did not fix Boehm to work without ASYNCIFY, but something else is amiss in either the --spill-pointers pass or elsewhere. Though I was pulled to solve another issue so haven't yet had a moment to come back to debug further what the remaining issues there are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants