Skip to content

openblas crashes firefox #2030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
beew opened this issue Feb 27, 2019 · 25 comments
Closed

openblas crashes firefox #2030

beew opened this issue Feb 27, 2019 · 25 comments

Comments

@beew
Copy link

beew commented Feb 27, 2019

OS Ubuntu 16.04 64 bit
openblas 0.3.4 and 0.3.5 compiled from source with

make  NO_AFFINITY=1 DYNAMIC_ARCH=1 USE_OPENMP=1

I use update-alternatives to switch between openblas and intel-mkl, update-alternatives creates a symlink of libblas to /usr/lib and I sue LD_PRELOAD to preload libblas (whichever that is the default in update-alternatives)

If openblas is used, firefox crashes. Trying to start Firefox from terminal after it crashes got these outputs and it crashes immediately again (FF is up to date 65.01)

ExceptionHandler::GenerateDump cloned child 13834
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...

This doesn't happen with openblas 0.3.3 and before when compiled and preloaded the same way.

@martin-frbg
Copy link
Collaborator

What is your hardware ? And is there anything in the firefox output that suggests which component of firefox has a dependency on OpenBLAS ? Also does it work if you do not use LD_PRELOAD (there was a recent issue with sage where it appeared that preloading openblas led to a problem with a signal handler in the python/cython code)

@beew
Copy link
Author

beew commented Feb 27, 2019

hardware is

Intel® Core™ i7-7820HK CPU @ 2.90GHz × 8
GeForce GTX 1070/PCIe/SSE2

I don't know how to get additional FF crash info other than the exceptions posted. No it doesn't crash without LD_PRELOAD or LD_PRELOAD older versions of openblas (or intel-mkl)

@brada4
Copy link
Contributor

brada4 commented Feb 27, 2019

How do you use BLAS in Firefox? Firefox does not even load native libraries nowadays like java etc.
It is quite aparent that OpenBLAS and firefox uses platform-native threading while MKL uses own homebrew equivalents.
It is very hard to debug with LD_PRELOAD, and that is typically used to override malloc with debug versions etc.
Could you work out firefox with gdb so that we see if firefox actually signalled foreign, openblas thread.
EDIT: it boils down to starting firefox in gdb, then pressing r(un), once it crashes typing t(hread) a(pply) a(ll) bt
In principle neither firefox nor openblas should lose track of own threads.

@martin-frbg
Copy link
Collaborator

Must be some third-party plugin for firefox, either for some computationally expensive operation - why would one need BLAS or LAPACk to view websites - or something that creates an indirect dependency,
(I think there is even a ff plugin that uses gimp for in-browser image editing)

@martin-frbg
Copy link
Collaborator

Problem starting with 0.3.4 at least matches #1936

@beew
Copy link
Author

beew commented Feb 27, 2019

@brada4

what is gdb?

@brada4
Copy link
Contributor

brada4 commented Feb 27, 2019

GNU debugger that shows backtraces, it is available as package on any linux
It will show function call chain leading to crash

@martin-frbg
Copy link
Collaborator

Do you preload OpenBLAS specifically for firefox, or do you do this globally for some other software and just happened to notice that firefox now crashes ?

@beew
Copy link
Author

beew commented Feb 28, 2019

@martin-frbg

No, not specifically for Firefox, I put it in my .profile for other things and just happened to notice Firefox crashes.

@martin-frbg
Copy link
Collaborator

I see. This is probably risky in general, not just with libopenblas. In this particular case I suspect (similar to #1936) that a relatively small change in thread-local memory usage with 0.3.4 changes the minimum stack size requirements of threads subsequently started by firefox. (Possibly due to a bug in the C runtime library, glibc, and not actually in OpenBLAS itself). So if the situation is similar to that other issue, thread creation now fails if a (now too) small stack size is requested, but unlike #1936 there is another bug in firefox (or a plugin) where it does not check if thread creation succeeded, so you get a crash.

@brada4
Copy link
Contributor

brada4 commented Mar 2, 2019

@beev does it crash same if you load openblas but without openmp?

@beew
Copy link
Author

beew commented Mar 2, 2019

openblas was compiled against openmp, I don't know how to load one without the other.

@brada4
Copy link
Contributor

brada4 commented Mar 2, 2019

Just compile without any options....

@beew
Copy link
Author

beew commented Mar 3, 2019

@brada4

I have tested by compiling openblas0.3.5 with no option and a fresh Firefox profile with no addon, still crashes immediately.

@brada4
Copy link
Contributor

brada4 commented Mar 3, 2019

I cannot get firefox to crash by ld_preload-ing openblas or even various malloc replacements.
Backtrace is realy needed from you:

LD_PRELOAD=./libopenblas.so gdb firefox
gdb> run
blah blah SEGV
gdb> t a a bt
Here is interesting stuff for us

if backtrace does not hav function names in it, just long hex addresses - while in crashed state, look at the pid-s mentioned and make a copy of /proc//maps to clarify process memory layout.
Later are rather long, best zip them to upload.

@brada4
Copy link
Contributor

brada4 commented Mar 3, 2019

Can you try if #2039 fixes any?

@martin-frbg
Copy link
Collaborator

@brada4 that seems highly unlikely, as the only platform that still defines WHEREAMI is 32bit x86

@beew
Copy link
Author

beew commented Mar 7, 2019

Hi
@ brada4

openblas compiled without option, Firefox clean profile with no addon.

$export LD_PRELOAD=~/Downloads/openblas_test/lib/libopenblas.so.0
$ firefox -g
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/firefox/firefox...Reading symbols from /usr/lib/debug/.build-id/e2/567bf2b7c48fd113e9c3677beb7b5af79f6816.debug...done.
done.
(gdb) 
(gdb) run
Starting program: /usr/lib/firefox/firefox 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff55ff700 (LWP 19419)]
[New Thread 0x7ffff4dfe700 (LWP 19420)]
[New Thread 0x7ffff05fd700 (LWP 19421)]
[New Thread 0x7fffefdfc700 (LWP 19422)]
[New Thread 0x7fffeb5fb700 (LWP 19423)]
[New Thread 0x7fffe8dfa700 (LWP 19424)]
[New Thread 0x7fffe85f9700 (LWP 19425)]
[New Thread 0x7fffd0d57700 (LWP 19434)]
[Thread 0x7fffd0d57700 (LWP 19434) exited]
[Thread 0x7fffe85f9700 (LWP 19425) exited]
[Thread 0x7fffe8dfa700 (LWP 19424) exited]
[Thread 0x7fffeb5fb700 (LWP 19423) exited]
[Thread 0x7fffefdfc700 (LWP 19422) exited]
[Thread 0x7ffff05fd700 (LWP 19421) exited]
[Thread 0x7ffff4dfe700 (LWP 19420) exited]
[Thread 0x7ffff55ff700 (LWP 19419) exited]
[New Thread 0x7fffe85f9700 (LWP 19436)]
[New Thread 0x7fffe8dfa700 (LWP 19437)]
[New Thread 0x7fffeb5fb700 (LWP 19438)]
[New Thread 0x7ffff7ef0700 (LWP 19439)]

Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
Watchdog::Init (this=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:155
155	/build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp: No such file or directory.
(gdb) t a a bt

Thread 13 (Thread 0x7ffff7ef0700 (LWP 19439)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffd2b67b0a in epoll_wait (epfd=<optimized out>, 
    events=events@entry=0x7ffff5660b80, maxevents=<optimized out>, 
    timeout=timeout@entry=-1)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/third_party/libevent/epoll_sub.c:64
#2  0x00007fffd2b69aa0 in epoll_dispatch (base=0x7ffff4cd8400, 
    tv=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/third_party/libevent/epoll.c:462
#3  0x00007fffd2b6c38e in event_base_loop (base=0x7ffff4cd8400, 
    flags=flags@entry=1)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/third_party/libevent/event.c:1947
#4  0x00007fffd2b5484b in base::MessagePumpLibevent::Run (this=0x7ffff4c6b700, 
    delegate=0x7ffff7ee0d30)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/message_pump_libevent.cc:345
#5  0x00007fffd2b56a0d in MessageLoop::RunInternal (this=0x7ffff7ee0d30)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/message_loop.cc:314
---Type <return> to continue, or q <return> to quit---
#6  MessageLoop::RunHandler (this=0x7ffff7ee0d30)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/message_loop.cc:307
#7  MessageLoop::Run (this=this@entry=0x7ffff7ee0d30)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/message_loop.cc:289
#8  0x00007fffd2b62b32 in base::Thread::ThreadMain (this=0x7ffff4c6b600)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/thread.cc:192
#9  0x00007fffd2b5448a in ThreadFunc (closure=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/ipc/chromium/src/base/platform_thread_posix.cc:40
#10 0x00007ffff6c806ba in start_thread (arg=0x7ffff7ef0700)
    at pthread_create.c:333
#11 0x00007ffff5f1141d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 12 (Thread 0x7fffeb5fb700 (LWP 19438)):
#0  0x00007ffff5f0574d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fffdee2438c in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007fffdee24712 in g_main_loop_run ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007fffdf4229d6 in ?? () from /usr/lib/x86_64-linux-gnu/libgio-2.0.so.0
---Type <return> to continue, or q <return> to quit---
#4  0x00007fffdee4ac55 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff6c806ba in start_thread (arg=0x7fffeb5fb700)
    at pthread_create.c:333
#6  0x00007ffff5f1141d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7fffe8dfa700 (LWP 19437)):
#0  0x00007ffff5f0574d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fffdee2438c in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007fffdee2449c in g_main_context_iteration ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007fffdee244d9 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00007fffdee4ac55 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff6c806ba in start_thread (arg=0x7fffe8dfa700)
    at pthread_create.c:333
#6  0x00007ffff5f1141d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 10 (Thread 0x7fffe85f9700 (LWP 19436)):
#0  0x00007ffff5f0574d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fffdee2438c in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007fffdee2449c in g_main_context_iteration ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007ffff4d0a28d in ?? ()
   from /usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so
#4  0x00007fffdee4ac55 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff6c806ba in start_thread (arg=0x7fffe85f9700)
    at pthread_create.c:333
#6  0x00007ffff5f1141d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7f92740 (LWP 19408)):
#0  Watchdog::Init (this=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:155
#1  WatchdogManager::StartWatchdog (this=0x7ffff4c64be0)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:395
#2  WatchdogManager::RefreshWatchdog (this=0x7ffff4c64be0)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:365
#3  WatchdogManager::RegisterContext (aContext=0x7ffff0416000, 
    this=0x7ffff4c64be0)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:284
#4  XPCJSContext::XPCJSContext (this=0x7ffff0416000)
---Type <return> to continue, or q <return> to quit---
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:979
#5  0x00007fffd2ea211e in XPCJSContext::NewXPCJSContext (
    aPrimaryContext=aPrimaryContext@entry=0x0)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCJSContext.cpp:1214
#6  0x00007fffd2eb25f9 in nsXPConnect::nsXPConnect (this=0x7ffff4cff740)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/nsXPConnect.cpp:75
#7  0x00007fffd2eb2656 in nsXPConnect::InitStatics ()
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/nsXPConnect.cpp:127
#8  0x00007fffd2e94739 in xpcModuleCtor ()
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/js/xpconnect/src/XPCModule.cpp:11
#9  0x00007fffd4d2790b in nsLayoutModuleInitialize ()
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/layout/build/nsLayoutModule.cpp:255
#10 0x00007fffd26fadd5 in nsComponentManagerImpl::Init (this=0x7ffff5673230)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/xpcom/components/nsComponentManager.cpp:339
#11 0x00007fffd27327ae in NS_InitXPCOM2 (aResult=0x7ffff4c1f500, 
    aBinDirectory=<optimized out>, aAppFileLocationProvider=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/xpcom/build/XPCOMInit.cpp:668
#12 0x00007fffd2732a95 in NS_InitXPCOM2 (aResult=aResult@entry=0x7ffff4c1f500, 
    aBinDirectory=<optimized out>, aAppFileLocationProvider=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/xpcom/build/XPCOMInit.cpp:714
#13 0x00007fffd5a85c4c in ScopedXPCOMStartup::Initialize (this=0x7ffff4c1f500)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/toolkit/xre/nsAppRunner.cpp:1377
#14 0x00007fffd5a8cfcf in XREMain::XRE_main (this=this@entry=0x7fffffffc780, 
    argc=argc@entry=1, argv=argv@entry=0x7fffffffdac8, aConfig=...)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/toolkit/xre/nsAppRunner.cpp:4756
#15 0x00007fffd5a8d366 in XRE_main (argc=1, argv=0x7fffffffdac8, aConfig=...)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/toolkit/xre/nsAppRunner.cpp:4845
#16 0x000055555555acf6 in do_main (argc=1, argv=0x7fffffffdac8, 
    envp=<optimized out>)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/browser/app/nsBrowserApp.cpp:214
#17 0x000055555555a3f9 in main (argc=1, argv=0x7fffffffdac8, 
    envp=0x7fffffffdad8)
    at /build/firefox-5TPFUC/firefox-65.0.1+build2/browser/app/nsBrowserApp.cpp:---Type <return> to continue, or q <return> to quit---
293

After the last line typing return produces no more output.

@brada4
Copy link
Contributor

brada4 commented Mar 7, 2019

Thank you for backtrace already.
XPCJSContext.cpp:155 is error handling procedure
Which is called from attempt of creating watchdog by means of not finding one and starting it up, aka pthread_create.

Martin was right about thread creation failure since the very beginning.
I am wondering if 8 threads exited are openblas ones.

My speculation on what happens behind the scenes:
firefox startup script configures some ulimit for its purpose (but not repeatable on other linux)
init procedure of openblas preloaded fails at thread creation but changes some different limit
in the end firefox is not happy about thread creation failure, and their error handler crashes (maybe assuming thread lwp parameters are unmodified since invocation)

Martin knows better on how to build openblas that logs verbosely thread creation

I would like to ask for more detailed backtrace of same:
1/ install firefox-dbg package to get code lines in backtrace, say line 155 in current gecko source is blank inside error handler.
2/ while in debugger from outside it
/proc/(pid_of_ thread_that_crashed)/maps that should show openblas object loaded
3/ after first crash you could set breakpoint on pthread create (b pthread_create) and see if thread creations follow some pattern (start attaching outputs, I ask for longer ones)

@martin-frbg
Copy link
Collaborator

From what we saw in #1936, I think it goes more like this - libopenblas gets preloaded into firefox' address space (although firefox probably does not need it), sets up its default number of threads which since 0.3.4 reserve an additional 8k of thread-local memory within their stack frame. Then firefox tries to create its own worker threads, requesting a stack size that is big enough for their needs but possibly smaller than what OpenBLAS asked for earlier. Now either this gets rejected immediately, or glibc silently deducts the 8k it saw earlier from the requested size - I am still a bit hazy about how thread memory allocation is supposed to work internally, and where glibc goes wrong, but the glibc bug ticket suggests that the bug is on their end. In either case the thread is unable to run, and firefox bombs out as it does not have an error handler set up for this failure case.

@beew
Copy link
Author

beew commented Mar 7, 2019

@brada4

I would like to ask for more detailed backtrace of same:
1/ install firefox-dbg package to get code lines in backtrace, say line 155 in current gecko source is blank inside error handler.
2/ while in debugger from outside it
/proc/(pid_of_ thread_that_crashed)/maps that should show openblas object loaded
3/ after first crash you could set breakpoint on pthread create (b pthread_create) and see if thread creations follow some pattern (start attaching outputs, I ask for longer ones)

Hi, sorry, this is a bit over my head, can you give me more detailed step by step instructions? firefox-dgb is already installed.

@martin-frbg
Copy link
Collaborator

Seems a rather pointless exercise to me anyway, unless a firefox developer was involved.

@beew
Copy link
Author

beew commented Mar 7, 2019

I don't think it is a firefox problem in particular, I have just reproduced #1936 with sage as well. Who knows how many other programs are affected. More realistic for openblas to fix the problem that originated from it (seems to have started since 0.3.4)

@martin-frbg
Copy link
Collaborator

The problem with that is simply that the relevant change in 0.3.4 was made to fix another serious bug, and so far it is not clear (to me at least) that there is anything fundamentally wrong with the changed code. On the other hand, preloading any library is risky as it circumvents regular initialization sequences, and doing an unconditional session-wide LD_PRELOAD of a specialized library is unusual at best. I have kept issue #1936 open on purpose, although that particular case has been "solved" by
adjusting the cysignals stack frame accordingly.

@martin-frbg
Copy link
Collaborator

believed to be fixed by #2879 (reducing stack requirements for OpenBLAS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants