Skip to content

Fix for issue #21150, using a simple lock to prevent an issue with multiple threads accessing an Index #22006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/azure-windows-27.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ dependencies:
- beautifulsoup4
- bottleneck
- dateutil
- futures
- gcsfs
- html5lib
- jinja2=2.8
Expand Down
1 change: 1 addition & 0 deletions ci/circle-27-compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
dependencies:
- bottleneck=1.0.0
- cython=0.28.2
- futures
- jinja2=2.8
- numexpr=2.4.4 # we test that we correctly don't use an unsupported numexpr
- numpy=1.9.3
Expand Down
1 change: 1 addition & 0 deletions ci/travis-27-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ channels:
dependencies:
- bottleneck=1.0.0
- cython=0.28.2
- futures
- lxml
- matplotlib=1.4.3
- numpy=1.9.3
Expand Down
1 change: 1 addition & 0 deletions ci/travis-27.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- cython=0.28.2
- fastparquet
- feather-format
- futures
- gcsfs
- html5lib
- ipython
Expand Down
31 changes: 23 additions & 8 deletions pandas/_libs/index.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,18 @@ from pandas._libs import algos, hashtable as _hash
from pandas._libs.tslibs import Timestamp, Timedelta, period as periodlib
from pandas._libs.missing import checknull

# Python 2 vs Python 3
try:
from thread import allocate_lock as _thread_allocate_lock
except ImportError:
try:
from _thread import allocate_lock as _thread_allocate_lock
except ImportError:
try:
from dummy_thread import allocate_lock as _thread_allocate_lock
except ImportError:
from _dummy_thread import allocate_lock as _thread_allocate_lock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move this into pandas/compat/__init__.py and import from there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just import it from tslibs.strptime where it is already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we should move that import code to pandas/compat so it can be utilized by both modules (in more semantically logical arrangement IMO).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this? We also recently fixed all our bare excepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger fixed the bare excepts (had a look at recent commits and saw one where many bare excepts were fixed, kudos!!!).

As for the part about moving to pandas/compat; if I recall correctly, I implemented that, and then reverted as that was asked in other comments (you can see in the comments in the issue).

Pushing a change with the fix for bare excepts. But let me know if I should update where it is imported.

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine, I didn't read through all the comments :)

cdef int64_t iNaT = util.get_nat()


Expand Down Expand Up @@ -53,6 +65,9 @@ def get_value_box(arr: ndarray, loc: object) -> object:
# Don't populate hash tables in monotonic indexes larger than this
_SIZE_CUTOFF = 1000000

# Used in _ensure_mapping_populated to ensure is_unique behaves correctly
# in multi-threaded code, see gh-21150
_mapping_populated_lock = _thread_allocate_lock()

cdef class IndexEngine:

Expand Down Expand Up @@ -236,17 +251,17 @@ cdef class IndexEngine:

cdef inline _ensure_mapping_populated(self):
# this populates the mapping
# if its not already populated
# if it is not already populated
# also satisfies the need_unique_check

if not self.is_mapping_populated:

values = self._get_index_values()
self.mapping = self._make_hash_table(len(values))
self._call_map_locations(values)
with _mapping_populated_lock:
if not self.is_mapping_populated:
values = self._get_index_values()
self.mapping = self._make_hash_table(len(values))
self._call_map_locations(values)

if len(self.mapping) == len(values):
self.unique = 1
if len(self.mapping) == len(values):
self.unique = 1

self.need_unique_check = 0

Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
import pandas as pd
from pandas._libs.tslib import Timestamp

from concurrent.futures import ThreadPoolExecutor


class TestIndex(Base):
_holder = Index
Expand Down Expand Up @@ -2509,6 +2511,21 @@ def test_ensure_index_from_sequences(self, data, names, expected):
tm.assert_index_equal(result, expected)


class TestThreadSafety(object):

@pytest.mark.slow
@pytest.mark.parametrize('execution_number', range(7))
def test_isunique(self, execution_number):
"""This test is executed seven times, each time it uses a pool of
two threads to run a test that is very likely to fail without the
fix for gh-21150. It is not a deterministic test, as there is
still a chance it will pass even though the bug exists. But
with the fix, it must always work with not issues."""
x = pd.date_range('2001', '2020')
with ThreadPoolExecutor(2) as p:
assert all(p.map(lambda x: x.is_unique, [x] * 2))


@pytest.mark.parametrize('opname', ['eq', 'ne', 'le', 'lt', 'ge', 'gt',
'add', 'radd', 'sub', 'rsub',
'mul', 'rmul', 'truediv', 'rtruediv',
Expand Down