Skip to content

Firefox v98 regression in v2.0.28 #16538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hoodmane opened this issue Mar 20, 2022 · 15 comments
Open

Firefox v98 regression in v2.0.28 #16538

hoodmane opened this issue Mar 20, 2022 · 15 comments

Comments

@hoodmane
Copy link
Collaborator

hoodmane commented Mar 20, 2022

I've been working on trying to get Pyodide working on a more recent Emscripten since we're about a year out of date. So far I've made it up to v2.0.27 but on v2.0.28 there is a failure in Firefox v98. Many Pyodide tests fail, it seems that some function pointers are getting corrupted and cause either indirect call signature mismatch or just calling some random function. I haven't bisected the firefox version, but the bug is present in v98.0.1 and not in v93 or v99b5.

I bisected the emscripten-releases repo and found the regression first appears here:

commit 5a9a0c3d01c75410e237670633fd28280e8b2a47 (HEAD)
Author: chromium-autoroll <[email protected]>
Date:   Fri Aug 20 22:06:58 2021 +0000

    Roll llvm-project from b311a040ef9c to 9e9d70591e72 

Anyways I was wondering if you had any advice about what to do with this. Maybe the most reasonable thing is that Pyodide should just stick to Emscripten v2.0.27 until Firefox makes its next release, since the problem is gone on Firefox beta.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 20, 2022

Interesting ... it sounds like this is either regression in emscripten or in firefox or both. Does the regression occur on chrome or safari? If not, that would strongly suggest a firefox regression.

IIRC your build uses dyanmic linking, correct? (MAIN_MODULE/SIDE_MODULE). How about threading? Are you able to find a single test that fails that doesn't rely on threading? Is it possible to make a build with no dynamic linking.. would such a build be able to run any of the failing tests? If so, does the static (non-dynamic) build also fail any of the tests? In short, can we figure out if dyanamic linking and/or threading is triggering this issue?

@sbc100
Copy link
Collaborator

sbc100 commented Mar 20, 2022

Is it just a single firefox release that is effected? How about v94 through v97?

@hoodmane
Copy link
Collaborator Author

hoodmane commented Mar 20, 2022

Does the regression occur on chrome or safari?

Does not occur in chrome or node. I am not testing Safari.

your build uses dynamic linking, correct? (MAIN_MODULE/SIDE_MODULE). How about threading?

Yes dynamic linking, no threading

Is it possible to make a build with no dynamic linking.. would such a build be able to run any of the failing tests? If so, does the static (non-dynamic) build also fail any of the tests? In short, can we figure out if dynamic linking and/or threading is triggering this issue?

Every failing test involves dynamic linking.

Is it just a single firefox release that is effected? How about v94 through v97?

Yeah I'll go ahead and check those too. I don't have a particularly efficient way of testing different browser versions though.

@hoodmane
Copy link
Collaborator Author

Bug is present in versions 96, 97, and 98, absent from other versions.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 21, 2022

Well this rather odd. I guess the next step would be to try and figure out exactly where the crash is coming from. To do this you could compare the output of emscripten before and after its output break firefox, and also compare the output of firefox in both working and non-working versions. You can build with -sDYLINK_DEBUG to get a lot of information about the loading of code and the assignment of function pointers.

It seems somewhat unlikely (although not impossible) that it would be an llvm-side change. Are you totally sure about the bisection result?

@hoodmane
Copy link
Collaborator Author

hoodmane commented Mar 21, 2022

I will double check bisection. I think I can reduce one of the test failures into a pretty small reproduction, so that should be helpful.

@hoodmane
Copy link
Collaborator Author

hoodmane commented Mar 21, 2022

Okay maybe the issue is more complicated than it seems. I have a test failure that involves only C/C++ code, no Python APIs so that's encouraging because it can compile without building all of Python and its dependencies. But if I only compile the code that is needed for the test, the problem goes away... Maybe I am missing some compiler flag that contributes to the problem,

@hoodmane
Copy link
Collaborator Author

hoodmane commented Mar 24, 2022

Confirmed that the bug was introduced on 5a9a0c3d01c75410e237670633fd28280e8b2a47.

CI failure on 5a9a0c3d01c75410e237670633fd28280e8b2a47 :
https://app.circleci.com/pipelines/github/hoodmane/pyodide/2563/workflows/1946f0a2-b2ba-4d1c-aecd-c9aea3868ad7/jobs/27691/tests#failed-test-0

CI success on the commit before 5a9a0c3d01c75410e237670633fd28280e8b2a47 which was dde381c7972be2e62ff8ac9b8ded83d09c869e6e:
https://app.circleci.com/pipelines/github/hoodmane/pyodide/2565/workflows/875ef1f6-8bfd-4866-bf02-5955a1a597a3/jobs/27708

@sbc100
Copy link
Collaborator

sbc100 commented Mar 24, 2022

Can you try with the very latest emscripten? It could be a regression in 2.0.28 that was already fixed.

@hoodmane
Copy link
Collaborator Author

I think it's still present on tot.

@hoodmane
Copy link
Collaborator Author

Will check.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 24, 2022

I guess we need to try to narrow down what is actually going wrong, and how it could be only effecting firefox? Could it be a regression in firefox? Unlikely but possible I suppose.

@hoodmane
Copy link
Collaborator Author

hoodmane commented Mar 24, 2022

It wouldn't be the first browser bug like this we've hit
https://bugs.chromium.org/p/chromium/issues/detail?id=1200031

Also statically linking the code into the main module while changing nothing else does make the bug go away.

@hoodmane
Copy link
Collaborator Author

It's confusing that chromium-autoroll commits from UTC but you commit from PST because the commits don't look like they are in chronological order.

@hoodmane
Copy link
Collaborator Author

Can you try with the very latest emscripten?

I tried it one commit back from tip of tree and the earliest test failure went away but there are still many other test failures from the same bug:
https://app.circleci.com/pipelines/github/hoodmane/pyodide/2582/workflows/64a45116-3547-42ad-9d78-b6de27cae9a7/jobs/27932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants