Skip to content

Segfault when building haskelll-opencv with openblas >= 0.3.3 #1923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
basvandijk opened this issue Dec 19, 2018 · 16 comments
Closed

Segfault when building haskelll-opencv with openblas >= 0.3.3 #1923

basvandijk opened this issue Dec 19, 2018 · 16 comments

Comments

@basvandijk
Copy link

Building haskell-opencv on nixpkgs with openblas >= 0.3.3 results in a segfault when compiling one of the included executables. The build succeeds with openblas-0.3.2.

This is reported in more detail in: NixOS/nixpkgs#52439.

This has to be caused by something in v0.3.2..v0.3.3. Any idea what it could be?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 19, 2018

No spontaneous idea - 0.3.2 to 0.3.3 included a (partial) revert of the experimental TLS memory allocator that may have made OpenBLAS a bit more likely to run out of thread pointers again, but (a) you would have experienced that problem in 0.3.0 and earlier versions already and (b) from the nixpkgs ticket it appears you already tried 0.3.4 which uses a safer default for the number of threads.
Any chance you could produce a backtrace from a build with DEBUG=1 so that we get a clearer indication of where the segfault happens. (And I assume it happens when running some opencv test, not actually during compilation). Also which cpu was OpenBLAS compiled for, and what is it running on in the failure case ?

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2018

Could you try building with 0.3.4 before reverting to old release?
Also gdb backtrace is usually handy here in case of crash, though not worth doing it against known problem in 0.3.3

@martin-frbg
Copy link
Collaborator

Luckily the time period between 0.3.2 and 0.3.3 was just one month, but apart from reverting to the pre-0.3.1 thread memory allocation logic not much happened besides addition of cpu-specific code for IBM Z and for AVX512-capable Intel processors (Skylake X and recent Xeons).

@basvandijk
Copy link
Author

basvandijk commented Dec 19, 2018

@martin-frbg I will try to get a backtrack using DEBUG=1. Note that the segfault occurs during compilation of Haskell code; not when running code during a test.

@nh2 you tried building with 0.3.4 right? You said the tests hang but did the segfault disappear?

@nh2
Copy link

nh2 commented Dec 19, 2018

you tried building with 0.3.4 right? You said the tests hang but did the segfault disappear?

Yes, it did disappear. I just posted it on NixOS/nixpkgs#52439 (comment)

By the way, it may not be an OpenBLAS-only problem, because we have one report for another package where the segfault happens and OpenBLAS is not involved.

But it may be that the TLS stuff (or something else) that OpenBLAS did for these releases is the problem in general, and that the other package uses something similar.

@basvandijk
Copy link
Author

basvandijk commented Dec 19, 2018

@nh2 I see that in the latest nixpkgs master openblas is at 0.3.4 and with that we indeed don't get the segfault anymore. Now it's just the libgomp error:

$ nix-build -A haskellPackages.opencv
...
    Feature Detection
      houghLinesP:
libgomp: Out of memory allocating 927712937064 bytes
Test suite test-opencv: FAIL

So this issue can be closed. @martin-frbg and @brada4 thanks for your help and sorry for the noise!

@nh2
Copy link

nh2 commented Dec 19, 2018

So this issue can be closed. @martin-frbg and @brada4 thanks for your help and sorry for the noise!

@basvandijk I'm not super convinced though: We should probably still figure it out.

If it is TLS and OpenBLAS wants to re-enable it in the future, we probably hit this immediately again, and the other issue I mentioned won't go away by itself either.

Though I agree it is not an OpenBLAS bug in the latest release as per current knowledge, we may be able to find the underlying cause given that we cound narrow the problem down to be triggered small amount of code in OpenBLAS.

@nh2
Copy link

nh2 commented Dec 19, 2018

Update: Here's a gdb stack trace: NixOS/nixpkgs#52439 (comment)

Quoting the important bits (this is for openblas-0.3.3):

(gdb) bt
#0  0x00007f7a31534880 in blas_memory_free () from /nix/store/ww2lj137v7mpkxcsmpnv67xa2ybrff77-openblas-0.3.3/lib/libopenblas.so.0
#1  0x00007f7a31535325 in blas_thread_shutdown_ () from /nix/store/ww2lj137v7mpkxcsmpnv67xa2ybrff77-openblas-0.3.3/lib/libopenblas.so.0
#2  0x00007f7a315348bb in blas_shutdown () from /nix/store/ww2lj137v7mpkxcsmpnv67xa2ybrff77-openblas-0.3.3/lib/libopenblas.so.0
#3  0x00007f7a313bb014 in gotoblas_quit () from /nix/store/ww2lj137v7mpkxcsmpnv67xa2ybrff77-openblas-0.3.3/lib/libopenblas.so.0
#4  0x00007f7a950cb3b7 in _dl_fini () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/ld-linux-x86-64.so.2
#5  0x00007f7a8a3f9351 in __run_exit_handlers () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/libc.so.6
#6  0x00007f7a8a3f943a in exit () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/libc.so.6
#7  0x00007f7a8b4c3bfb in stg_exit () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#8  0x00007f7a8b4c404e in shutdownHaskellAndExit () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#9  0x00007f7a8b4b46ff in hs_main () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#10 0x000000000042a8b9 in main ()

@martin-frbg
Copy link
Collaborator

this could be related to the remaining issue from #1720 (missing pthread_key_delete on thread shutdown, testing the proposed solution is already on the to-do list for the 0.3.5 milestone). Though actually I would be more interested in backtraces from 0.3.4 - given the surprise bugs encountered with the new TLS code it was supposed to be switched off already in 0.3.3, and I expect it to remain a non-default option for at least the next version despite its demonstrated advantages.

@nh2
Copy link

nh2 commented Dec 19, 2018

I would be more interested in backtraces from 0.3.4

You mean with the TLS feature explicitly enabled?

@nh2
Copy link

nh2 commented Dec 19, 2018

Here's an improved backtrace after building the same as previous 0.3.3 with -g:

(gdb) bt
#0  blas_memory_free (buffer=0x7fd09edc9040) at memory.c:1249
#1  0x00007fd0a388cb75 in blas_thread_shutdown_ () at blas_server_omp.c:132
#2  0x00007fd0a388c13b in blas_shutdown () at memory.c:1280
#3  0x00007fd0a3712014 in gotoblas_quit () at memory.c:1472
#4  0x00007fd1071a73b7 in _dl_fini () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/ld-linux-x86-64.so.2
#5  0x00007fd0fc4d5351 in __run_exit_handlers () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/libc.so.6
#6  0x00007fd0fc4d543a in exit () from /nix/store/1mnsmslnx5anjfksac6417xfzzglrwhr-glibc-2.27/lib/libc.so.6
#7  0x00007fd0fd59fbfb in stg_exit () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#8  0x00007fd0fd5a004e in shutdownHaskellAndExit () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#9  0x00007fd0fd5906ff in hs_main () from /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/../rts/libHSrts_thr-ghc8.4.4.so
#10 0x000000000042a8b9 in main ()

This one shows memory.c:1249 and blas_server_omp.c:132 are involved.

Can anybody tell me how I can add source info in this situation with gdb?

I get

(gdb) list
1244	memory.c: No such file or directory.

and set substitute-path doesn't seem to work if there's no prefix, e.g. set substitute-path / /home/niklas/src/OpenBLAS/driver/others seems to have no effect.

@martin-frbg
Copy link
Collaborator

Thanks. This will probably help to get the TLS code in shape eventually. I am still convinced that it would be a great improvement over the old memory management code from GotoBLAS, but there were just too many unexpected interactions and corner cases in the mostly uncommented OpenBLAS code to make it viable in the few months since its appearance. (Including it early has also taught me how many programs have come to depend either directly or indirectly on OpenBLAS nowadays, and how quickly a new release is picked up by distributors)

@nh2
Copy link

nh2 commented Dec 19, 2018

@martin-frbg How do I enable the DEBUG C-preprocessor macro like https://github.com/xianyi/OpenBLAS/blob/fd8d1868a126bb9f12bbc43b36ee30d1ba943fbb/driver/others/memory.c#L1245-L1247

? make DEBUG=1 seems to only turn on -g.

@martin-frbg
Copy link
Collaborator

you are right, the DEBUG=1 does not get passed through the individual Makefiles. Suggest you #define DEBUG in either driver/others/Makefile or at the head of memory.c (both files - and init.c have commented-out #undef DEBUG entries that you could repurpose for this).

@nh2
Copy link

nh2 commented Dec 19, 2018

I've enabled it now with CFLAGS=-DDEBUG make ..., and get the following result:

niklas@ares:/tmp/nix-build-opencv-0.0.2.1.drv-0/opencv-src$ LD_LIBRARY_PATH=/nix/store/ypaf3wqvs8n4g2cqqix0wrx0a88m2hff-openblas-0.3.3/lib:$LD_LIBRARY_PATH /usr/bin/time /nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4/bin/ghc -B/nix/store/wc4chfs9wgi67g1r7im93358h1j6cdkz-ghc-8.4.4/lib/ghc-8.4.4 --make -no-link -fbuilding-cabal-package -O -static -outputdir dist/build/doc-images-opencv/doc-images-opencv-tmp -odir dist/build/doc-images-opencv/doc-images-opencv-tmp -hidir dist/build/doc-images-opencv/doc-images-opencv-tmp -stubdir dist/build/doc-images-opencv/doc-images-opencv-tmp -i -idist/build/doc-images-opencv/doc-images-opencv-tmp -idoc -idist/build/doc-images-opencv/autogen -idist/build/global-autogen -Idist/build/doc-images-opencv/autogen -Idist/build/global-autogen -Idist/build/doc-images-opencv/doc-images-opencv-tmp -I/nix/store/j573pqybmmy6n2wjb9mw9h657hmqf7wb-opencv-3.4.4/include -optP-include -optPdist/build/doc-images-opencv/autogen/cabal_macros.h -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /tmp/nix-build-opencv-0.0.2.1.drv-0/package.conf.d -package-db dist/package.conf.inplace -package-id base-4.11.1.0 -package-id bytestring-0.10.8.2 -package-id containers-0.5.11.0 -package-id data-default-0.7.1.1-JU0tK7MsWQxFeBDXsMopAU -package-id directory-1.3.1.5 -package-id Glob-0.9.3-7ezF27bLnby7o45COKo9bc -package-id haskell-src-exts-1.20.3-KjyAxm84ddk16DoDOnTGLG -package-id JuicyPixels-3.2.9.5-228KAmffah2EHAhe61mJze -package-id linear-1.20.8-4ZVKNh6PPLgAMGxipcwitY -package-id opencv-0.0.2.1-HWLEHw8y7ZcCelxlYQnnuM -package-id primitive-0.6.3.0-DaZpcxwJp2TGn8ITSgfI4C -package-id template-haskell-2.13.0.0 -package-id text-1.2.3.1 -package-id transformers-0.5.5.0 -package-id vector-0.12.0.2-4IpdnxtqTfNJ9xEZNSAM2c -XHaskell2010 -XBangPatterns -XDataKinds -XLambdaCase -XOverloadedStrings -XPackageImports -XPolyKinds -XScopedTypeVariables -XTupleSections -XTypeFamilies -XTypeOperators ExampleExtractor Language.Haskell.Meta.Syntax.Translate doc/images.hs -Wall -fwarn-incomplete-patterns -threaded -funbox-strict-fields -rtsopts -j4 -split-sections
[3 of 3] Compiling Main             ( doc/images.hs, dist/build/doc-images-opencv/doc-images-opencv-tmp/Main.o )
Adjusted number of threads :   4
Alloc Start ...
  Position -> 0
Allocation Start : 0
  Success -> 7fa7be5ef000
  Mapping Succeeded. 0x7fa7be5ef000(0)
Mapped   : 0x7fa7be5ef000    0

Alloc Start ...
  Position -> 1
Allocation Start : 0
  Success -> 7fa7bc5ee000
  Mapping Succeeded. 0x7fa7bc5ee000(1)
Mapped   : 0x7fa7bc5ee000    1

Alloc Start ...
  Position -> 2
Allocation Start : 0
  Success -> 7fa7ba5ed000
  Mapping Succeeded. 0x7fa7ba5ed000(2)
Mapped   : 0x7fa7ba5ed000    2

Alloc Start ...
  Position -> 3
Allocation Start : 0
  Success -> 7fa7b85ec000
  Mapping Succeeded. 0x7fa7b85ec000(3)
Mapped   : 0x7fa7b85ec000    3

Unmapped Start : 0x7fa7be5ef000 ...
Command terminated by signal 11
18.09user 6.52system 0:16.87elapsed 145%CPU (0avgtext+0avgdata 784940maxresident)k

So it seems that even though we got Mapping succeeded for 0x7fa7be5ef000, that alloc_info pointer is somehow not valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants