Skip to content

pd.Series.reindex is not thread safe. #25870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
allComputableThings opened this issue Mar 25, 2019 · 6 comments
Closed

pd.Series.reindex is not thread safe. #25870

allComputableThings opened this issue Mar 25, 2019 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance

Comments

@allComputableThings
Copy link

allComputableThings commented Mar 25, 2019

Code Sample, a copy-pastable example if possible

import traceback
import pandas as pd
import numpy as np
from multiprocessing.pool import ThreadPool

def f(arg):
    s,idx = arg
    try:
        # s.loc[idx]   # No problem
        s.reindex(idx) # Fails
    except Exception:
        traceback.print_exc()
    return None


def gen_args(n=10000):
    a = np.arange(0, 3000000)
    for i in xrange(n):
        if i%1000 == 0:
            # print "?",i
            s = pd.Series(data=a, index=a)
            f((s,a)) # <<< LOOK. IT WORKS HERE!!!
        yield s, np.arange(0,1000)

# for arg in gen_args():
#     f(arg)   # Works just fine

t = ThreadPool(4)
for result in t.imap(f, gen_args(), chunksize=1):
    pass

Problem description

pd.Series.reindex fails in a multi-threaded application.

This is a little surprising since I'm not asking for any writes.

The error also seems bogus: 'cannot reindex from a duplicate axis' ... the series does not have any duplicate axis and I was able to call s.reindex(idx) in the main thread before the same failed in the pool's thread.

  File "<ipython-input-8-4121235a46fa>", line 6, in f
    s.reindex(idx).values # Fails
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2681, in reindex
    return super(Series, self).reindex(index=index, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3023, in reindex
    fill_value, copy).__finalize__(self)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3041, in _reindex_axes
    copy=copy, allow_dups=False)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3145, in _reindex_with_indexers
    copy=copy)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4139, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2944, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

Expected Output

Program should output nothing.

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.candidate.1 python-bits: 64 OS: Linux OS-release: 4.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.22.0 pytest: None pip: 18.1 setuptools: 40.6.2 Cython: 0.29.1 numpy: 1.16.1 scipy: 1.2.0 pyarrow: None xarray: None IPython: 5.0.0 sphinx: None patsy: 0.5.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: 2.6.8 feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.17 pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None ```
@allComputableThings allComputableThings changed the title ps.Series.reindex is not thread safe. pd.Series.reindex is not thread safe. Mar 25, 2019
@jreback
Copy link
Contributor

jreback commented Mar 25, 2019

virtually no pandas functions are threadsafe, becuase .copy() is not, see #2728

@jreback jreback closed this as completed Mar 25, 2019
@jreback jreback added the Performance Memory or execution speed performance label Mar 25, 2019
@jreback jreback added this to the No action milestone Mar 25, 2019
@allComputableThings
Copy link
Author

Not very satisfactory - especially for non-mutating operations.

Since the bug you referenced is still open, could we keep this one open.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2019

Since the bug you referenced is still open, could we keep this one open.

so we will have 1 more issue, what's the purpose? this is a duplicate issue

@jreback jreback added the Duplicate Report Duplicate issue or pull request label Mar 25, 2019
@allComputableThings
Copy link
Author

allComputableThings commented Mar 25, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 25, 2019

your are welcome to submit a PR if you want to provide a test

this is a duplicate of an unfixed issue

we have 2900 issue - would welcome help doing things here - sure reporting bugs is great but pandas is all volunteer for anything else

@allComputableThings
Copy link
Author

allComputableThings commented Mar 25, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants