Skip to content

numexpr is not thread-safe #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FrancescAlted opened this issue Jan 22, 2014 · 14 comments · Fixed by #200
Closed

numexpr is not thread-safe #80

FrancescAlted opened this issue Jan 22, 2014 · 14 comments · Fixed by #200

Comments

@FrancescAlted
Copy link
Contributor

From [email protected] on April 30, 2012 23:46:07

Calling numexpr.evaluate from two different threads at the same time causes a segfault, unless numexpr.set_num_threads(1) was called prior.

Looking at the code in vm_engine_iter_parallel, it looks to me like it uses a global variable to store thread information, and so chokes when two different threads call that function at the same time.

See attached file for a small code sample that reproduces this problem. And here's what I see when I actually run that script from the shell:

{{{

python -c 'import numexpr; numexpr.test()'
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version: 2.0.1
NumPy version: 1.6.1
Python version: 2.7.2 (default, Dec 13 2011, 14:11:54)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)]
Platform: linux2-x86_64
AMD/Intel CPU? True
VML available? False
Detected cores: 24
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

[... snip ...]

Ran 5103 tests in 3.805s

OK

python numexprCore.py
Starting threads
Waiting
Segmentation fault (core dumped)
}}}

Attachment: numexprCore.py

Original issue: http://code.google.com/p/numexpr/issues/detail?id=80

@jennolsen84
Copy link
Contributor

What if we put a global mutex around numexpr.evaluate? This will make it thread safe, but not thread efficient. It is likely you are using all available cores to evaluate an expression anyway, so it makes sense to limit the parallelism for ne.evaluate() to 1.

Ofcourse, this is non-optimal, especially if the thread holding the lock gets context switched out due to expiration of it's timeslice, but the 1 global mutex would be way better than current state (which just crashes if 2 threads call numexpr.evaluate).

Thoughts?

I can submit a PR, if people are OK with this solution

@FrancescAlted
Copy link
Contributor Author

I am +1 for what you are proposing. The best solution would be to create a context structure to host all the variables that are useful for a thread, create one of these per thread and make each thread to use its own structure. But I agree that your proposal is better than the current situation.

@FrancescAlted
Copy link
Contributor Author

Maybe there is no need for you to provide the PR, as @pitrou already provided a fix #199
Can you have it a try?

@jennolsen84
Copy link
Contributor

yes, I was just about to work on it! I came to GH to clone the repo, and bam, saw the commit. Will check it out.

@jennolsen84
Copy link
Contributor

I think I am getting this now. It is happening in one of my test cases, and only happens if I run that test case after a few others. I am not 100% sure if this is related to numexpr yet. Also, any advice on tracking this down would be helpful. I am thinking about building a docker container and compiling python from scratch using the debug flags, so I can run it under gdb perhaps.

From command line:

Abort trap: 6

In Pycharm:

python(55033,0x1100d5000) malloc: *** error for object 0x101961200: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug

@pitrou
Copy link
Contributor

pitrou commented Jan 21, 2016

Le 21/01/2016 10:21, jennolsen84 a écrit :

Also, any advice on
tracking this down would be helpful. I am thinking about building a
docker container and compiling python from scratch using the debug
flags, so I can run it under gdb perhaps.

You can already run Python under gdb and get a full C backtrace.

You can also use the faulthandler module on Python 3.

@jennolsen84
Copy link
Contributor

File "/myenv/lib/python3.5/site-packages/numexpr/necompiler.py", line 767, in evaluate
    return compiled_ex(*arguments, **kwargs)
  File "utils.py", line 435, in sanitize
    ne.evaluate('where((abs(vals)<1E-7), 0, vals)', out=vals)

vals is a 2D ndarray.

Here is what the threads looked like:

  Id   Target Id         Frame 
  25   Thread 0x7fffc67fc700 (LWP 579) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  24   Thread 0x7fffc6ffd700 (LWP 578) "python" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
* 23   Thread 0x7fffc77fe700 (LWP 577) "python" 0x00007ffff6c1f107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  22   Thread 0x7fffc7fff700 (LWP 576) "python" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  21   Thread 0x7fffe874a700 (LWP 575) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  20   Thread 0x7fffe8f4b700 (LWP 574) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  19   Thread 0x7fffe974c700 (LWP 573) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  18   Thread 0x7fffe9f4d700 (LWP 572) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  9    Thread 0x7fffde2a9700 (LWP 563) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  8    Thread 0x7fffdeaaa700 (LWP 562) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  7    Thread 0x7fffdf2eb700 (LWP 561) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  6    Thread 0x7fffdfaec700 (LWP 560) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  1    Thread 0x7ffff7fed700 (LWP 550) "python" 0x00007ffff6cd0623 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
(gdb) 

top few frames of 23:

#0  0x00007ffff6c1f107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6c204e8 in __GI_abort () at abort.c:89
#2  0x00007ffff6c5d204 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff6d4ffe0 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff6c629de in malloc_printerr (action=1, str=0x7ffff6d50130 "double free or corruption (out)", ptr=<optimized out>) at malloc.c:4996
#4  0x00007ffff6c636e6 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5  0x00007fffe9f948c5 in free_temps_space (params=..., mem=0x7fffa0115040) at numexpr/interpreter.cpp:591
#6  0x00007fffe9fb4e9d in run_interpreter (self=self@entry=0x7fffdd9d2468, iter=iter@entry=0x7fffdd6c7ba0, reduce_iter=reduce_iter@entry=0x0, 
    reduction_outer_loop=reduction_outer_loop@entry=false, need_output_buffering=need_output_buffering@entry=true, pc_error=pc_error@entry=0x7fffc77fb8f4)
    at numexpr/interpreter.cpp:839
#7  0x00007fffe9fc256b in NumExpr_run (self=0x7fffdd9d2468, args=<optimized out>, kwds=<optimized out>) at numexpr/interpreter.cpp:1405
#8  0x00007ffff79261ea in PyObject_Call (func=<numexpr.NumExpr at remote 0x7fffdd9d2468>, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2165
#9  0x00007ffff7a0c708 in ext_do_call (nk=<optimized out>, na=0, flags=<optimized out>, pp_stack=0x7fffc77fcfc8, 
    func=<numexpr.NumExpr at remote 0x7fffdd9d2468>) at Python/ceval.c:4983

I am still looking, but wanted to share what I found so far

@jennolsen84
Copy link
Contributor

Perhaps we should put the mutex on the python side... right here:

if not isinstance(ex, (str, unicode)):

This will also prevent races on _numexpr_cache, and _names_cache

@jennolsen84
Copy link
Contributor

That's what I did in the PR, and it seems to have fixed the crash

@jennolsen84
Copy link
Contributor

FYI, still getting some more crashes, not 100% sure they're related. The callstack is in numexpr, and they didn't happen before... Strangely, these happen kind of randomly (and not on each run), even though same data is used every time.

Here is a stack trace:

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffebda9700 (LWP 310)]
npyiter_goto_iterindex (iter=iter@entry=0x7fffdd0f3878, iterindex=45056) at numpy/core/src/multiarray/nditer_api.c:1811
1811    numpy/core/src/multiarray/nditer_api.c: No such file or directory.
(gdb) bt
#0  npyiter_goto_iterindex (iter=iter@entry=0x7fffdd0f3878, iterindex=45056) at numpy/core/src/multiarray/nditer_api.c:1811
#1  0x00007ffff2365ad0 in NpyIter_Reset (errmsg=<optimized out>, iter=0x7fffdd0f3878) at numpy/core/src/multiarray/nditer_api.c:279
#2  NpyIter_ResetToIterIndexRange (iter=0x7fffdd0f3878, istart=<optimized out>, iend=<optimized out>, errmsg=<optimized out>)
    at numpy/core/src/multiarray/nditer_api.c:402
#3  0x00007fffec931c7c in th_worker (tidptr=<optimized out>) at numexpr/module.cpp:139
#4  0x00007ffff76a30a4 in start_thread (arg=0x7fffebda9700) at pthread_create.c:309
#5  0x00007ffff6cd004d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) info thr
  Id   Target Id         Frame 
  13   Thread 0x7fffbeffd700 (LWP 322) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  12   Thread 0x7fffbf7fe700 (LWP 321) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  11   Thread 0x7fffbffff700 (LWP 320) "python" 0x00007ffff7956587 in func_descr_get (func=0x7ffff3b7fd08, obj=0x7fffddbac470, type=0xebb3a8)
    at Objects/funcobject.c:649
  10   Thread 0x7fffddafa700 (LWP 319) "python" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  9    Thread 0x7fffde43b700 (LWP 318) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  8    Thread 0x7fffdec3c700 (LWP 317) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  7    Thread 0x7fffdf43d700 (LWP 316) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  6    Thread 0x7fffdfc3e700 (LWP 315) "python" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
  5    Thread 0x7fffeada7700 (LWP 312) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7fffeb5a8700 (LWP 311) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 3    Thread 0x7fffebda9700 (LWP 310) "python" npyiter_goto_iterindex (iter=iter@entry=0x7fffdd0f3878, iterindex=45056)
    at numpy/core/src/multiarray/nditer_api.c:1811
  2    Thread 0x7fffec5aa700 (LWP 309) "python" 0x00007fffec8fe44f in vm_engine_iter_task (iter=iter@entry=0x7fffddb1b218, 
    memsteps=memsteps@entry=0x7fffcc001f20, params=..., pc_error=pc_error@entry=0x7fffbf7fb8ec, errmsg=errmsg@entry=0x7fffbf7fb8f8)
    at numexpr/interp_body.cpp:237
  1    Thread 0x7ffff7fed700 (LWP 302) "python" 0x00007ffff6cd0623 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
(gdb)

@pitrou
Copy link
Contributor

pitrou commented Jan 21, 2016

It's a FPU exception, and it seems to happen in iterindex /= shape in npyiter_goto_iterindex. So likely a division by zero... Perhaps you can try to print out the argument's shapes from th_worker?

(the division by zero can occur because of a race condition overwriting the shape's memory)

@pitrou
Copy link
Contributor

pitrou commented Jan 21, 2016

(actually, that would be the iterator's internal copy of the shape)

@jennolsen84
Copy link
Contributor

Unfortunately, I am unable to reproduce the crash I saw earlier today. I tried a few times. I will try to write a torture test to see if I can reproduce it. It would be calling numexpr 1000s of times, in a threadpool. Sometimes using the out= parameter, and sometimes allocating memory.

If someone can think of a better test, please let me know. Otherwise, I think the PR should be accepted though, it does help things.

@jennolsen84
Copy link
Contributor

I think the crash few comments above happened after I botched the install. Now, I am unable to reproduce the FPU exception. So, after adding the fat lock, things are good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants