Skip to content

ENH: add a keyword to the eq method to consider NAs equal #38063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Karl-Wiese opened this issue Nov 25, 2020 · 4 comments
Open

ENH: add a keyword to the eq method to consider NAs equal #38063

Karl-Wiese opened this issue Nov 25, 2020 · 4 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@Karl-Wiese
Copy link

Karl-Wiese commented Nov 25, 2020

Code Sample

This works as expected. Comparing nan with nan evaluates to False.

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(np.nan)
>>> s1.eq(s1)
0    False
dtype: boolean

However, this does not work as expected. Comparing nan with nan evaluates to nan.

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(np.nan).astype(pd.StringDtype())
>>> s1.eq(s1)
0    <NA>
dtype: boolean

Problem description

As in other programming languages I would expect the comparison between nan values for equality evaluates to False. In the first case, this works as expected. However, for the second it does not. I read the docs for the Nullable Boolean data type. It implements the Kleene Logic for Boolean data type. So, the second case is expected behavior, too. The issue is even explicitly stated in the docs. It is just not very consistent between the two cases.

I searched for similar bug reports and was not sure if they are directly related. For reference please look here:

Expected behavior

I would expect that the two cases behave consistently. In which way, is not so important for me.

Output of pd.show_versions()

Details
INSTALLED VERSIONS  
------------------  
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.1.0
Version          : Darwin Kernel Version 20.1.0: Sat Oct 31 00:07:11 PDT 2020; root:xnu-7195.50.7~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8
pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.1.post20201107
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@jorisvandenbossche
Copy link
Member

@Karl-Wiese Given that you read the docs on nullable dtypes, and say yourself that "the second case is expected behaviour, too", I am not sure what actually expect / are looking for.

When introducting the nullable dtypes, we considered which behaviour we wanted for equality of pd.NA, and decided that we wanted to deviate from the existing behaviour with np.nan to return pd.NA and not False (so that pd.NA propagates in comparison operations, like they do in arithmetic operations).
Now, we only did this for the new dtypes. Your first example is still using the current default dtype for string data (object dtype) with np.nan for missing values. This is long-standing behaviour, which we can't just change. So therefore the new data types (with different behaviour) are opt-in for now (and the plan is they will become the default at some point).

See also the docs at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values (the docs about this are quite probably a bit scarce and scattered, that's something we need to improve)

@Karl-Wiese
Copy link
Author

@jorisvandenbossche I expect that the behavior is consistent. Regardless if I use np.nan or pd.NA. I understand that pd.NA is in an experimental state. This issue is not a bug report but rather an enhancement issue, I would say. So, looking forward until "they will become the default at some point" and we can enjoy consistent behavior.

If others come across this "problem" you can use .fillna() and hand over True or False depending on your needs.

The original problem is that I want to compare two DataFrames cell by cell. And I want to accept that nan values are equal. This now needs the following:
(left.eq(right) | (left.isna() & right.isna())).fillna(False)

@jorisvandenbossche
Copy link
Member

I expect that the behavior is consistent. Regardless if I use np.nan or pd.NA.

Well, that's inherently impossible, because we have chosen different behaviour for pd.NA compared to np.nan.

The original problem is that I want to compare two DataFrames cell by cell. And I want to accept that nan values are equal. This now needs the following:
(left.eq(right) | (left.isna() & right.isna())).fillna(False)

But to be clear, you already needed (left.eq(right) | (left.isna() & right.isna())) before as well, right? Only the fillna(False) is new to get the exact same result as before?

Note that the fillna(False) might not be needed depending on what operation you do afterwards with this result. For example if it is used to filter, the NAs will be interpreted as False.
But in general, yes, with the nullable dtypes you might need an explicit extra fillna(True/False) call.

BTW, you might also be interested in the equals method, which considers NaN / NAs in the same location as equal (but returns a single bool, it's not an element-wise method). Maybe we could add a keyword to the eq method to consider NAs equal.

@Karl-Wiese
Copy link
Author

Well, that's inherently impossible, because we have chosen different behaviour for pd.NA compared to np.nan.

It was worth a try ;)

But to be clear, you already needed (left.eq(right) | (left.isna() & right.isna())) before as well, right? Only the fillna(False) is new to get the exact same result as before?

You are absolutely right!

Thanks for your hints. For example, I apply a .all(axis=1) later on. I think I need the explicit fillna(). And equalsunfortunately doesn't do the job. I need an element-wise comparison. At the moment, I just have a little function in an utils.py that is doing the job. Passing a keyword to eq would be indeed a nice solution. Would that handle np.nan and pd.NA the same way?

@jbrockmendel jbrockmendel added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 24, 2021
@mroeschke mroeschke changed the title <NA> and NaN evaluate differently in equality comparison ENH: add a keyword to the eq method to consider NAs equal Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

4 participants