Skip to content

_libs.hashtable.ismember incorrect with tuples containing nan #41836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Jun 6, 2021 · 4 comments · Fixed by #41952
Closed

_libs.hashtable.ismember incorrect with tuples containing nan #41836

jbrockmendel opened this issue Jun 6, 2021 · 4 comments · Fixed by #41952
Labels
Milestone

Comments

@jbrockmendel
Copy link
Member

It looks like the equality check being done on tuples is checking NA values for identity, so separately instantiated float("nan") objects aren't considered matching.

from pandas.core.algorithms import isin

values = [("a", float("nan")), ("b", 1)]
comps = [("a", float("nan"))]
alt = ("a", values[0][1])

>>> isin(values, comps)
array([False, False])

>>> isin(values, [alt])
array([ True, False])

cc @realead

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@realead
Copy link
Contributor

realead commented Jun 6, 2021

That is true. The reason is that

float("nan") == float("nan")

is False in Python.

We overwrite default __eq__ for floats:

if (result == 0) { // still could be two NaNs
return PyFloat_CheckExact(a) &&
PyFloat_CheckExact(b) &&
Py_IS_NAN(PyFloat_AS_DOUBLE(a)) &&
Py_IS_NAN(PyFloat_AS_DOUBLE(b));
}
, but not other types.

To be consistent, we should do the same for other built-in types (complex, tuple, frozenset.... are there more?). I would not expect too much impact on performace.

I don't see an easy way to ensure the correct behavior for non-built-in classes like

class A:
    def __init__(self, a):
        self.a=a
    def __eq__(self, other):
        return self.a == other.a

and would say it is responsibility of the class' authors to ensure the desired behavior.

None and pd.NaT seems to have no issue, because they are singletones and Python short-cuts comparison of an object with itself.

@realead
Copy link
Contributor

realead commented Jun 9, 2021

@jbrockmendel I'have looked into this problem. The fix is relatively simple for complex and tuple (here is an prototype: https://github.com/realead/pandas/tree/fix_deep_nans).

However, the situation is quite different for frozenset, we need to replace the following comparison in the implementation of the setobject:

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)
{
       ...
       cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
       ....
}

but there are so many implemenation details in this function, that it would be really unwise to duplicate the code.

A sound fix would be to introduce a Py_EQ_FOR_HASH_TABLE to CPython's richcompare (which would be different from PY_EQ for floats as it would say that float("nan") is equal float("nan")). But this obviuosly is out of reach.

So my proposal would be to add special handling of complex and tuple objects and to leave the situation with frozenset as it is.

@jbrockmendel
Copy link
Member Author

Thanks for looking into this. The only case that is a showstopper for me ATM is the tuple, so i think punting on frozenset is totally reasonable.

@realead
Copy link
Contributor

realead commented Jun 10, 2021

Because of python/cpython@a07da09 we need to change the hash-function as well, otherwise different nans will have different hashes for Py3.10 and later.

realead added a commit to realead/pandas that referenced this issue Jun 11, 2021
realead added a commit to realead/pandas that referenced this issue Jun 13, 2021
@jreback jreback added this to the 1.3 milestone Jun 15, 2021
realead added a commit to realead/pandas that referenced this issue Jun 16, 2021
@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants