Skip to content

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Feb 18, 2020

test adapted (at this point kind of loosely) from #28296

@MarcoGorelli MarcoGorelli changed the title BUG: fix in categorical merges Wip BUG: fix in categorical merges Feb 18, 2020
@MarcoGorelli MarcoGorelli changed the title Wip BUG: fix in categorical merges BUG: fix in categorical merges Feb 19, 2020
@MarcoGorelli MarcoGorelli requested a review from WillAyd February 19, 2020 16:17
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OK generally but @TomAugspurger for any more thoughts.

Does this affect other index types like DTI as well?

@@ -781,6 +782,10 @@ def _delegate_method(self, name: str, *args, **kwargs):
return res
return CategoricalIndex(res, name=self.name)

def _wrap_joined_index(self, joined, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate this? Should do that for any new development

@WillAyd WillAyd added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 20, 2020
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Feb 20, 2020

Does this affect other index types like DTI as well?

It doesn't affect DTI, as DTI inherits from DateTimedeltaMixin:

class DatetimeIndex(DatetimeTimedeltaMixin):

and DateTimedeltaMixin already has its own _wrap_joined_index method defined

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex @jbrockmendel question

@MarcoGorelli
Copy link
Member Author

@jbrockmendel have added a test which covers the int16_t case:

In [1]: from pandas import DataFrame, CategoricalIndex                                                                                                                                                                                                        

In [2]: n_categories = 5                                                                                                                                                                                                                                      

In [3]: import numpy as np                                                                                                                                                                                                                                    

In [4]:     left_index = CategoricalIndex([0] + list(range(n_categories))) 
   ...:     df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index) 
   ...:     df2 = DataFrame( 
   ...:         [[6]], 
   ...:         columns=["value"], 
   ...:         index=CategoricalIndex([0], categories=np.arange(n_categories)), 
   ...:     )                                                                                                                                                                                                                                                 

In [5]: df1.index._ndarray_values                                                                                                                                                                                                                             
Out[5]: array([0, 0, 1, 2, 3, 4], dtype=int8)

In [6]: n_categories = 128                                                                                                                                                                                                                                    

In [7]:     left_index = CategoricalIndex([0] + list(range(n_categories))) 
   ...:     df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index) 
   ...:     df2 = DataFrame( 
   ...:         [[6]], 
   ...:         columns=["value"], 
   ...:         index=CategoricalIndex([0], categories=np.arange(n_categories)), 
   ...:     )                                                                                                                                                                                                                                                 

In [8]: df1.index._ndarray_values                                                                                                                                                                                                                             
Out[8]: 
array([  0,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,
        12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,
        25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,
        38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,
        51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
        64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
        77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
        90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102,
       103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
       116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127],
      dtype=int16)

The current test fails on master (for both cases, we get No matching signature found) but passes here:

(pandas-dev) m.gorelli@ws-1808:~/pandas$ git fetch upstream master
From https://github.com/pandas-dev/pandas
 * branch                master     -> FETCH_HEAD
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git diff upstream/master --name-only
doc/source/whatsnew/v1.1.0.rst
pandas/_libs/join.pyx
pandas/core/indexes/category.py
pandas/tests/reshape/merge/test_merge.py
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout upstream/master -- pandas/core/indexes/category.py pandas/_libs/join.pyx
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 8
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
running build_ext
building 'pandas._libs.join' extension
gcc -pthread -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -Ipandas/_libs/src/klib -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/include -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/include/python3.7m -c pandas/_libs/join.c -o build/temp.linux-x86_64-3.7/pandas/_libs/join.o
gcc -pthread -shared -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -L/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,-rpath=/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/pandas/_libs/join.o -o /home/SERILOCAL/m.gorelli/pandas/pandas/_libs/join.cpython-37m-x86_64-linux-gnu.so
(pandas-dev) m.gorelli@ws-1808:~/pandas$ pytest pandas/tests/reshape/merge/test_merge.py::test_categorical_non_unique_monotonic
==================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 2 items                                                                                                                                                                                                                                            

pandas/tests/reshape/merge/test_merge.py FF                                                                                                                                                                                                            [100%]

========================================================================================================================== FAILURES ==========================================================================================================================
__________________________________________________________________________________________________________ test_categorical_non_unique_monotonic[5] __________________________________________________________________________________________________________

n_categories = 5

    @pytest.mark.parametrize("n_categories", [5, 128])
    def test_categorical_non_unique_monotonic(n_categories):
        # GH 28189
        left_index = CategoricalIndex([0] + list(range(n_categories)))
        df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index)
        df2 = DataFrame(
            [[6]],
            columns=["value"],
            index=CategoricalIndex([0], categories=np.arange(n_categories)),
        )
    
>       result = merge(df1, df2, how="left", left_index=True, right_index=True)

pandas/tests/reshape/merge/test_merge.py:2179: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/reshape/merge.py:88: in merge
    return op.get_result()
pandas/core/reshape/merge.py:641: in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
pandas/core/reshape/merge.py:848: in _get_join_info
    right_ax, how=self.how, return_indexers=True, sort=self.sort
pandas/core/indexes/base.py:3507: in join
    other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3816: in _join_monotonic
    join_index, lidx, ridx = self._left_indexer(sv, ov)
pandas/core/indexes/base.py:244: in _left_indexer
    return libjoin.left_join_indexer(left, right)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   def left_join_indexer(ndarray[join_t] left, ndarray[join_t] right):
E   TypeError: No matching signature found

pandas/_libs/join.pyx:301: TypeError
_________________________________________________________________________________________________________ test_categorical_non_unique_monotonic[128] _________________________________________________________________________________________________________

n_categories = 128

    @pytest.mark.parametrize("n_categories", [5, 128])
    def test_categorical_non_unique_monotonic(n_categories):
        # GH 28189
        left_index = CategoricalIndex([0] + list(range(n_categories)))
        df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index)
        df2 = DataFrame(
            [[6]],
            columns=["value"],
            index=CategoricalIndex([0], categories=np.arange(n_categories)),
        )
    
>       result = merge(df1, df2, how="left", left_index=True, right_index=True)

pandas/tests/reshape/merge/test_merge.py:2179: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/reshape/merge.py:88: in merge
    return op.get_result()
pandas/core/reshape/merge.py:641: in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
pandas/core/reshape/merge.py:848: in _get_join_info
    right_ax, how=self.how, return_indexers=True, sort=self.sort
pandas/core/indexes/base.py:3507: in join
    other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3816: in _join_monotonic
    join_index, lidx, ridx = self._left_indexer(sv, ov)
pandas/core/indexes/base.py:244: in _left_indexer
    return libjoin.left_join_indexer(left, right)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   def left_join_indexer(ndarray[join_t] left, ndarray[join_t] right):
E   TypeError: No matching signature found

pandas/_libs/join.pyx:301: TypeError
===================================================================================================================== 2 failed in 1.68s ======================================================================================================================
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout HEAD -- pandas/core/indexes/category.py pandas/_libs/join.pyx
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 8
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
running build_ext
building 'pandas._libs.join' extension
gcc -pthread -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -Ipandas/_libs/src/klib -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/include -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/include/python3.7m -c pandas/_libs/join.c -o build/temp.linux-x86_64-3.7/pandas/_libs/join.o
gcc -pthread -shared -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -L/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,-rpath=/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/pandas/_libs/join.o -o /home/SERILOCAL/m.gorelli/pandas/pandas/_libs/join.cpython-37m-x86_64-linux-gnu.so
(pandas-dev) m.gorelli@ws-1808:~/pandas$ pytest pandas/tests/reshape/merge/test_merge.py::test_categorical_non_unique_monotonic
==================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 2 items                                                                                                                                                                                                                                            

pandas/tests/reshape/merge/test_merge.py ..                                                                                                                                                                                                            [100%]

===================================================================================================================== 2 passed in 0.31s ======================================================================================================================
(pandas-dev) m.gorelli@ws-1808:~/pandas$ 

@jbrockmendel
Copy link
Member

comment requested, otherwise LGTM

@WillAyd WillAyd added this to the 1.1 milestone Feb 27, 2020
@WillAyd WillAyd merged commit ea1d8fa into pandas-dev:master Feb 27, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2020

Thanks @MarcoGorelli !

roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020
@MarcoGorelli MarcoGorelli deleted the 28189 branch October 10, 2020 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index}
3 participants