Skip to content

norm(zeros(129,129)) causes Abort trap: 6 #14507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dlfivefifty opened this issue Dec 29, 2015 · 36 comments
Closed

norm(zeros(129,129)) causes Abort trap: 6 #14507

dlfivefifty opened this issue Dec 29, 2015 · 36 comments
Labels
system:mac Affects only macOS upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@dlfivefifty
Copy link
Contributor

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.2 (2015-12-06 21:47 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-apple-darwin13.4.0

julia> versioninfo()
Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> norm(zeros(129,129))
Abort trap: 6
@dlfivefifty
Copy link
Contributor Author

This is on El Capitan

@KristofferC
Copy link
Member

Can't reproduce on Ubuntu or Windows.

@eschnett
Copy link
Contributor

Cannot reproduce on El Capitan, with a slightly newer version of Julia and a newer CPU:

julia> versioninfo()
Julia Version 0.4.3-pre+6
Commit adffe19* (2015-12-11 00:38 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.2.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> norm(zeros(129,129))
0.0

Are you sure the original report uses El Capitan? I thought El Capitan is Darwin 15, not Darwin 13.

@dlfivefifty
Copy link
Contributor Author

Yes, I’m on OS X 10.11.1.

Let me try updating my OS to 10.11.2, see if that fixes it.

On 30 Dec 2015, at 10:08 AM, Erik Schnetter [email protected] wrote:

Cannot reproduce on El Capitan, with a slightly newer version of Julia and a newer CPU:

julia> versioninfo()
Julia Version 0.4.3-pre+6
Commit adffe19* (2015-12-11 00:38 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin15.2.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia> norm(zeros(129,129))
0.0
Are you sure the original report uses El Capitan? I thought El Capitan is Darwin 15, not Darwin 13.


Reply to this email directly or view it on GitHub #14507 (comment).

@eschnett
Copy link
Contributor

See https://en.wikipedia.org/wiki/OS_X_El_Capitan. Some part of your system is still on Mavericks. As a wild guess I'd point to Xcode or the Command line tools, or left-over parts from a previous (Mavericks) Julia install.

@dlfivefifty
Copy link
Contributor Author

I’ve updated to 10.11.2 and still have the issue:

julia> versioninfo()

Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> 

julia> norm(zeros(129,129))
Abort trap: 6

Let me try reinstalling Julia.

On 30 Dec 2015, at 10:15 AM, Erik Schnetter [email protected] wrote:

See https://en.wikipedia.org/wiki/OS_X_El_Capitan https://en.wikipedia.org/wiki/OS_X_El_Capitan. Some part of your system is still on Mavericks. As a wild guess I'd point to Xcode or the Command line tools, or left-over parts from a previous (Mavericks) Julia install.


Reply to this email directly or view it on GitHub #14507 (comment).

@dlfivefifty
Copy link
Contributor Author

I’m using the downloadable binary, maybe the reported System information comes from the compilation machine?

On 30 Dec 2015, at 10:23 AM, Sheehan Olver [email protected] wrote:

I’ve updated to 10.11.2 and still have the issue:

julia> versioninfo()

Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> 

julia> norm(zeros(129,129))
Abort trap: 6

Let me try reinstalling Julia.

On 30 Dec 2015, at 10:15 AM, Erik Schnetter <[email protected] mailto:[email protected]> wrote:

See https://en.wikipedia.org/wiki/OS_X_El_Capitan https://en.wikipedia.org/wiki/OS_X_El_Capitan. Some part of your system is still on Mavericks. As a wild guess I'd point to Xcode or the Command line tools, or left-over parts from a previous (Mavericks) Julia install.


Reply to this email directly or view it on GitHub #14507 (comment).

@dlfivefifty
Copy link
Contributor Author

Redownloaded the binary and the same bug is there. I’ll try making 0.4.2 from source now and see if that resolves it

On 30 Dec 2015, at 10:27 AM, Sheehan Olver [email protected] wrote:

I’m using the downloadable binary, maybe the reported System information comes from the compilation machine?

On 30 Dec 2015, at 10:23 AM, Sheehan Olver <[email protected] mailto:[email protected]> wrote:

I’ve updated to 10.11.2 and still have the issue:

julia> versioninfo()

Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> 

julia> norm(zeros(129,129))
Abort trap: 6

Let me try reinstalling Julia.

On 30 Dec 2015, at 10:15 AM, Erik Schnetter <[email protected] mailto:[email protected]> wrote:

See https://en.wikipedia.org/wiki/OS_X_El_Capitan https://en.wikipedia.org/wiki/OS_X_El_Capitan. Some part of your system is still on Mavericks. As a wild guess I'd point to Xcode or the Command line tools, or left-over parts from a previous (Mavericks) Julia install.


Reply to this email directly or view it on GitHub #14507 (comment).

@dlfivefifty
Copy link
Contributor Author

OK I did the test on a built version of Julia. It no longer crashes, but I get the message
BLAS : Bad memory unallocation! : 32 0x7fff5238db70

julia> versioninfo()
Julia Version 0.5.0-dev+1920
Commit 6bf6f35 (2015-12-29 21:09 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.2.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> norm(zeros(129,129))
BLAS : Bad memory unallocation! :   32  0x7fff5238db70
0.0

On 30 Dec 2015, at 10:30 AM, Sheehan Olver [email protected] wrote:

Redownloaded the binary and the same bug is there. I’ll try making 0.4.2 from source now and see if that resolves it

On 30 Dec 2015, at 10:27 AM, Sheehan Olver <[email protected] mailto:[email protected]> wrote:

I’m using the downloadable binary, maybe the reported System information comes from the compilation machine?

On 30 Dec 2015, at 10:23 AM, Sheehan Olver <[email protected] mailto:[email protected]> wrote:

I’ve updated to 10.11.2 and still have the issue:

julia> versioninfo()

Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> 

julia> norm(zeros(129,129))
Abort trap: 6

Let me try reinstalling Julia.

On 30 Dec 2015, at 10:15 AM, Erik Schnetter <[email protected] mailto:[email protected]> wrote:

See https://en.wikipedia.org/wiki/OS_X_El_Capitan https://en.wikipedia.org/wiki/OS_X_El_Capitan. Some part of your system is still on Mavericks. As a wild guess I'd point to Xcode or the Command line tools, or left-over parts from a previous (Mavericks) Julia install.


Reply to this email directly or view it on GitHub #14507 (comment).

@KristofferC
Copy link
Member

Ref JuliaLang/LinearAlgebra.jl#288

@Sacha0
Copy link
Member

Sacha0 commented Dec 30, 2015

I can reproduce this on 10.10.5:

julia> versioninfo()
Julia Version 0.5.0-dev+1922
Commit af0668e* (2015-12-30 00:54 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> norm(zeros(129,129))
BLAS : Bad memory unallocation! :   32  0x7fff5ad7a020
0.0

@eschnett
Copy link
Contributor

It seems unlikely that the OS version is to blame. An alternative explanation is a difference in the CPU architecture (Sandybridge vs. Haswell), which leads to different code paths in OpenBLAS. (In particular, the SIMD vector sizes are different.)

Can you post here the output of ;sysctl machdep.cpu.brand_string? For me (Haswell), it's

shell> sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz

@Sacha0
Copy link
Member

Sacha0 commented Dec 30, 2015

Can you post here the output of ;sysctl machdep.cpu.brand_string?

Don't the returns from versioninfo() above contain that information?

@dlfivefifty
Copy link
Contributor Author

yep its the same as in versioninfo:

shell> sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz

On 30 Dec 2015, at 1:03 PM, Sacha [email protected] wrote:

Can you post here the output of ;sysctl machdep.cpu.brand_string?

Don't the returns from versioninfo() above contain that information?


Reply to this email directly or view it on GitHub #14507 (comment).

@tkelman tkelman added the system:mac Affects only macOS label Dec 30, 2015
@andreasnoack
Copy link
Member

I can reporduce this on a Haswell machine if I launch Julia with

OPENBLAS_CORETYPE=Sandybridge julia

so I guess there is an issue with OpenBLAS' Sandybridge kernels.

On Tue, Dec 29, 2015 at 10:43 PM, Sheehan Olver [email protected]
wrote:

yep its the same as in versioninfo:

shell> sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz

On 30 Dec 2015, at 1:03 PM, Sacha [email protected] wrote:

Can you post here the output of ;sysctl machdep.cpu.brand_string?

Don't the returns from versioninfo() above contain that information?


Reply to this email directly or view it on GitHub <
https://github.com/JuliaLang/julia/issues/14507#issuecomment-167917363>.


Reply to this email directly or view it on GitHub
#14507 (comment).

@ViralBShah
Copy link
Member

Cc @xianyi

@ViralBShah
Copy link
Member

Yes, confirm @andreasnoack report about sandybridge.

@gomiero
Copy link

gomiero commented Dec 30, 2015

Hi for all,

My first time here!

I can reproduce the error here with El Capitan version 10.11.2.

julia> versioninfo()
Julia Version 0.4.2
Commit bb73f34 (2015-12-06 21:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-2435M CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
shell> sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i5-2435M CPU @ 2.40GHz

What I saw until now is that the error occurs on file:

/JuliaLang/julia/blob/master/base/linalg/svd.jl

on the line:

svdvals!{T<:BlasFloat}(A::StridedMatrix{T}) = any([size(A)...].==0) ? zeros(T, 0) : LAPACK.gesdd!('N', A)[2]

It calls the function LAPACK.gesdd!('N', A)

The weirdest thing is that the error occurs only with the index 129,129:

julia> a=zeros(128,128);

julia> LAPACK.gesdd!('N',a)
(128x0 Array{Float64,2},[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0  …
 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],128x0 Array{Float64,2})

julia> a=zeros(130,130);

julia> LAPACK.gesdd!('N',a)
(130x0 Array{Float64,2},[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0  …  
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],130x0 Array{Float64,2})

julia> a=zeros(300,300);

julia> LAPACK.gesdd!('N',a)
(300x0 Array{Float64,2},[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0  …  
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],300x0 Array{Float64,2})

julia> a=zeros(129,129);

julia> LAPACK.gesdd!('N',a)
Abort trap: 6
Whitestar:bin gomiero$ 

The exception is raised on the file:
/base/linalg/lapack.jl

In the function "gesdd!", there is a looping:

for i = 1:2

And the error is raised on the second call of the loop (i=2).

The error is raised in the OpenBLAS lib, and it looks like it occurs on a stack pointer subtract:

Process 6999 stopped
* thread JuliaLang/julia#1: tid = 0x4be60, 0x000000030b15b357 libopenblas64_.dylib`dgesdd_64_ + 839, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x000000030b15b357 libopenblas64_.dylib`dgesdd_64_ + 839
libopenblas64_.dylib`dgesdd_64_:
->  0x30b15b357 <+839>: jge    0x30b15c580               ; <+5488>
    0x30b15b35d <+845>: movq   %r10, 0x58(%rsp)
    0x30b15b362 <+850>: subq   $0x8, %rsp
    0x30b15b366 <+854>: movq   %rbp, %r8
(lldb) 
Process 6999 stopped
* thread JuliaLang/julia#1: tid = 0x4be60, 0x000000030b15b35d libopenblas64_.dylib`dgesdd_64_ + 845, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x000000030b15b35d libopenblas64_.dylib`dgesdd_64_ + 845
libopenblas64_.dylib`dgesdd_64_:
->  0x30b15b35d <+845>: movq   %r10, 0x58(%rsp)
    0x30b15b362 <+850>: subq   $0x8, %rsp
    0x30b15b366 <+854>: movq   %rbp, %r8
    0x30b15b369 <+857>: movq   %rbx, %rcx
(lldb) 
Process 6999 stopped
* thread JuliaLang/julia#1: tid = 0x4be60, 0x000000030b15b362 libopenblas64_.dylib`dgesdd_64_ + 850, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x000000030b15b362 libopenblas64_.dylib`dgesdd_64_ + 850
libopenblas64_.dylib`dgesdd_64_:
->  0x30b15b362 <+850>: subq   $0x8, %rsp <===== **** HERE ****
    0x30b15b366 <+854>: movq   %rbp, %r8
    0x30b15b369 <+857>: movq   %rbx, %rcx
    0x30b15b36c <+860>: leaq   0x532f5d(%rip), %rax      ; locu12.3427 + 10128
(lldb) 
Process 6999 stopped
* thread JuliaLang/julia#1: tid = 0x4be60, 0x00007fff92e93002 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff92e93002 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff92e93002 <+10>: jae    0x7fff92e9300c            ; <+20>
    0x7fff92e93004 <+12>: movq   %rax, %rdi
    0x7fff92e93007 <+15>: jmp    0x7fff92e8dbdd            ; cerror_nocancel
    0x7fff92e9300c <+20>: retq   
(lldb) 

Hope this helps!!

Sorry if I made mistakes with the English language :-)

@tkelman tkelman added the upstream The issue is with an upstream dependency, e.g. LLVM label Dec 30, 2015
@tkelman
Copy link
Contributor

tkelman commented Dec 30, 2015

Nice debugging work @gomiero, thanks and welcome! Looks like we should report this as an openblas bug, maybe with a C or Fortran reproduction case. Worth testing against the develop branch of openblas.

@andreasnoack
Copy link
Member

This seems to be fixed on the develop branch (OpenMathLib/OpenBLAS@3857581)

@gomiero
Copy link

gomiero commented Jan 4, 2016

Hello @tkelman and All,

Thanks for the welcome and a Happy New Year to All!

Sorry for the late answer, but until a few days ago, I had never programmed in FORTRAN, so it took a bit to learn it and get more information about this issue.

I looked for a message in the OS X system's log and I found the following report:

Process:               julia [5765]
Path:                  /Applications/Julia-0.4.2.app/Contents/Resources/julia/bin/julia
Identifier:            julia
Version:               ???
Code Type:             X86-64 (Native)
Parent Process:        bash [1546]
Responsible:           Terminal [1543]
User ID:               501

Date/Time:             2016-01-03 20:13:42.095 -0200
OS Version:            Mac OS X 10.11.2 (15C50)
Report Version:        11
Anonymous UUID:        C631FC11-76EE-F8D3-DDFA-789156F01DAC


Time Awake Since Boot: 9300 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Application Specific Information:
[5765] stack overflow

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib          0x00007fff8bc54002 __pthread_kill + 10
1   libsystem_pthread.dylib         0x00007fff94e3e5c5 pthread_kill + 90
2   libsystem_c.dylib               0x00007fff8e5ea787 __abort + 145
3   libsystem_c.dylib               0x00007fff8e5eb066 __stack_chk_fail + 200
4   libopenblas64_.dylib            0x000000030fa7ec2a dgemv_64_ + 298
5   libopenblas64_.dylib            0x0000000313442064 dlabrd_64_ + 1636
6   libopenblas64_.dylib            0x00000003134084da dgebrd_64_ + 1562
7   libopenblas64_.dylib            0x000000031341ee17 dgesdd_64_ + 7687
8   ???                             0x000000030f235fa8 0 + 13138878376
9   ???                             0x000000030f235959 0 + 13138876761
10  libjulia.dylib                  0x000000010aac2976 jl_apply_generic + 422 (gf.c:1691)
11  libjulia.dylib                  0x000000010ab27524 do_call + 244 (interpreter.c:55)
12  libjulia.dylib                  0x000000010ab25aaf eval + 1823 (interpreter.c:213)
13  libjulia.dylib                  0x000000010ab25931 eval + 1441 (interpreter.c:219)
14  libjulia.dylib                  0x000000010ab2702d eval_body + 349 (interpreter.c:592)
15  libjulia.dylib                  0x000000010ab273d1 jl_interpret_toplevel_thunk_with + 417 (interpreter.c:612)
16  libjulia.dylib                  0x000000010ab39caf jl_toplevel_eval_flex + 1343 (toplevel.c:542)
17  libjulia.dylib                  0x000000010aacaf85 jl_toplevel_eval_in + 789 (builtins.c:579)
18  ???                             0x000000030f22f73b 0 + 13138851643
19  ???                             0x000000030f22f317 0 + 13138850583
20  libjulia.dylib                  0x000000010aac2903 jl_apply_generic + 307 (gf.c:1684)
21  ???                             0x000000030f224332 0 + 13138805554
22  libjulia.dylib                  0x000000010ab2db08 start_task + 392 (task.c:247)

I could confirm the call stack with the valgrind tool when the exception is raised:

julia> LAPACK.gesdd!('N',a)
==16730== valgrind: Unrecognised instruction at address 0x101a2e7a7.
==16730==    at 0x101A2E7A7: __abort (in /usr/lib/system/libsystem_c.dylib)
==16730==    by 0x101A2F065: __stack_chk_fail (in /usr/lib/system/libsystem_c.dylib)
==16730==    by 0x10C3A7C29: dgemv_64_ (in /Applications/Julia-0.4.2.app/Contents/Resources/julia/lib/julia/libopenblas64_.dylib)
==16730==    by 0x10FD6B063: dlabrd_64_ (in /Applications/Julia-0.4.2.app/Contents/Resources/julia/lib/julia/libopenblas64_.dylib)
==16730==    by 0x11439EF7F: ???
==16730==    by 0x7FFF5FBFE867: ???
==16730==    by 0x110279977: locu12.3427 (in /Applications/Julia-0.4.2.app/Contents/Resources/julia/lib/julia/libopenblas64_.dylib)
==16730==    by 0x11439F387: ???
==16730==    by 0x7FFF5FBFE867: ???
==16730==    by 0xB: ???

I have found that the error 'Abort Trap: 6' is raised on a call of the function __stack_chk_fail located in the libsystem_c OS X library.

You can reproduce the behavior of this function with a simple C program:

int main(int argc, char** argv) {
    int idx;
    int buffer[10];
    scanf("%d", &idx);
    buffer[idx] = 1;
    printf("%d", buffer[idx]);
    return (EXIT_SUCCESS);
}

If you enter an index 10 (above the limit of the allocated buffer), the program ends with the error 'Abort Trap: 6'.

The same C program, compiled with gcc 5.1.0 on my Windows 8.1 system (64-bit - Intel(R) Core(TM) i7-5820K - Haswell-E/EP) only generates a ACCESS_VIOLATION error if you go beyond the index 119 (120 or above).

In sistem Windows, I used the following compiler:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=C:/TDM-GCC-64/bin/../libexec/gcc/x86_64-w64-mingw32/5.1.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../../../src/gcc-5.1.0/configure --build=x86_64-w64-mingw32 --enable-targets=all --enable-languages=ada,c,c+
+,fortran,lto,objc,obj-c++ --enable-libgomp --enable-lto --enable-graphite --enable-cxx-flags=-DWINPTHREAD_STATIC --disable-b
uild-with-cxx --disable-build-poststage1-with-cxx --enable-libstdcxx-debug --enable-threads=posix --enable-version-specific-r
untime-libs --enable-fully-dynamic-string --enable-libstdcxx-threads --enable-libstdcxx-time --with-gnu-ld --disable-werror -
-disable-nls --disable-win32-registry --prefix=/mingw64tdm --with-local-prefix=/mingw64tdm --with-pkgversion=tdm64-1 --with-b
ugurl=http://tdm-gcc.tdragon.net/bugs
Thread model: posix
gcc version 5.1.0 (tdm64-1)

On stack's fail, I found that in OpenBLAS library exists an if statement, used to decide if the version of gemv function that will be called is single thread (gemv) or multi-threaded (gemv_thread).

As there is a fixed calculation related to the size of the matrix (m * n) to decide which version will be called, maybe this explains why the error does occur with the index 129,129.

In the first execution of the loop in Julia code (i = 1), the single thread function (gemv) is called, however, in the second run (i = 2) the called function is multithreaded (gemv_thread).

I realized that __stack_chk_fail error occurs near the blas_memory_free function, so it may be that some of the threads are reaching out of the bounds of the buffer allocated, in the end of the code flow.

I have no experience in debugging multithreaded code on OS X, however, I'll try to learn how to do this, and try to compile a debug version of OpenBLAS (-g flag), because I believe it will be easier to follow the code flow with LLDB and valgrind with the debug information enabled.

Maybe it is important that someone with more experience than I, can, please, confirm if the analysis made is correct and accurate.

Please, point me if I made a mistake in this analysis and, again, sorry if I made mistakes with the English language.

Best Regards

@tkelman
Copy link
Contributor

tkelman commented Jan 4, 2016

Your analysis makes sense so far (and your english is absolutely fine, I wouldn't worry about that), though given @andreasnoack's report it seems this may have already been fixed by openblas just not yet included in a release version. We could try identifying which upstream commit fixed the problem, or ask if they consider the current develop state to be stable enough to tag a release that we could try upgrading to.

@srwhite59
Copy link

I discovered this issue also, but in using eigfact() and svd() instead of norm(): for example,
eigfact(zeros(129,129)) Aborts. I also noticed that this only seems to affect odd sizes: I don't have a problem on even sizes, at least up to 800 or so. I discovered the problems originally on size 705.
See: https://groups.google.com/d/topic/julia-users/-kqCb6PEenw/discussion (I hope this link works!)

Is there anything one can do now to get around this problem?

@andreasnoack
Copy link
Member

The issue is fixed upstream so when we update the OpenBLAS version then it will go away. Depending on your platform, it might be easy to upgrade your OpenBLAS. Do you compile your own Julia?

@srwhite59
Copy link

No, I have just been downloading the precompiled versions. I was on 0.4.2; it looks like it is time to upgrade to 0.4.3, but I imagine that doesn't fix OpenBLAS or you would have mentioned it. The built-in
"framework Accelerate" on MACs has been providing me with a good BLAS for a number of years so I haven't installed OpenBLAS.

@ViralBShah
Copy link
Member

Note that accelerate does not have the fast lapack functions that openblas has.

@dlfivefifty
Copy link
Contributor Author

Here's a workaround: recompile with

override USE_SYSTEM_BLAS = 1
override USE_SYSTEM_LAPACK = 0
override USE_BLAS64 = 0
override USE_QUIET = 0

@ViralBShah
Copy link
Member

@xianyi Do you have a planned date (roughly) for the next release of openblas?

@tkelman
Copy link
Contributor

tkelman commented Feb 15, 2016

You could also set the OPENBLAS_CORETYPE environment variable to something slightly older. What came right before Sandy Bridge, Nehalem maybe?

@dlfivefifty
Copy link
Contributor Author

An issue with my fix is that it break special functions:

julia> airyai(5)
ERROR: error compiling airyai: error compiling airy: error compiling _airy: could not load library "libopenspecfun"
dlopen(libopenspecfun.dylib, 1): Library not loaded: /usr/local/Cellar/gfortran/4.8.2/gfortran/lib/libgfortran.3.dylib
  Referenced from: /Users/solver/Projects/julia/usr/lib//libopenspecfun.dylib
  Reason: image not found

@dlfivefifty
Copy link
Contributor Author

Nevermind my last comment, I just hadn't properly reset the dependencies.

Is there a reason that the OS X bundle can't be compiled with

override USE_SYSTEM_BLAS = 1

Not only does it fix the bug, eigs is roughly 20x faster for small matrices.

@andreasnoack
Copy link
Member

There seems to be an issue with GEMV in OpenBLAS for small matrices on OS X only. See JuliaLang/LinearAlgebra.jl#72. Hopefully, it can be fixed soon. We have discussed using VecLib by default on OS X before, but I think it is easier to use the same BLAS on all platforms and I don't think VecLib is uniformly faster than OpenBLAS. For GEMM, OpenBLAS is usually as fast as any other BLAS.

@dlfivefifty
Copy link
Contributor Author

Given that the issue JuliaLang/LinearAlgebra.jl#72 is over 2 years old, it doesn't look hopeful. But the bigger issue is that the current Julia bundle is essentially unusable for anything that requires svd.

@andreasnoack
Copy link
Member

@tkelman
Copy link
Contributor

tkelman commented Jun 4, 2016

Is this closed now that we're using openblas 0.2.18?

@srwhite59
Copy link

I think this is closed.

Originally I would have julia bomb on norm(zeros(129,129)). This now has
no problems. Also, svd(rand(129,129)) gives no problems. Everything went
back to working an update or two ago.

On Sat, Jun 4, 2016 at 8:04 AM, Tony Kelman [email protected]
wrote:

Is this closed now that we're using openblas 0.2.18?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#14507 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA7oJCWT8x53yYQP57ei8fPqxqEKzayKks5qIWnEgaJpZM4G8jFQ
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:mac Affects only macOS upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

9 participants