ENH: add a keyword to the eq method to consider NAs equal #38063

Karl-Wiese · 2020-11-25T12:37:42Z

Code Sample

This works as expected. Comparing nan with nan evaluates to False.

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(np.nan)
>>> s1.eq(s1)
0    False
dtype: boolean

However, this does not work as expected. Comparing nan with nan evaluates to nan.

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(np.nan).astype(pd.StringDtype())
>>> s1.eq(s1)
0    <NA>
dtype: boolean

Problem description

As in other programming languages I would expect the comparison between nan values for equality evaluates to False. In the first case, this works as expected. However, for the second it does not. I read the docs for the Nullable Boolean data type. It implements the Kleene Logic for Boolean data type. So, the second case is expected behavior, too. The issue is even explicitly stated in the docs. It is just not very consistent between the two cases.

I searched for similar bug reports and was not sure if they are directly related. For reference please look here:

Expected behavior

I would expect that the two cases behave consistently. In which way, is not so important for me.

Output of pd.show_versions()

Details

INSTALLED VERSIONS  
------------------  
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.1.0
Version          : Darwin Kernel Version 20.1.0: Sat Oct 31 00:07:11 PDT 2020; root:xnu-7195.50.7~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8
pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.1.post20201107
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-11-25T14:09:01Z

@Karl-Wiese Given that you read the docs on nullable dtypes, and say yourself that "the second case is expected behaviour, too", I am not sure what actually expect / are looking for.

When introducting the nullable dtypes, we considered which behaviour we wanted for equality of pd.NA, and decided that we wanted to deviate from the existing behaviour with np.nan to return pd.NA and not False (so that pd.NA propagates in comparison operations, like they do in arithmetic operations).
Now, we only did this for the new dtypes. Your first example is still using the current default dtype for string data (object dtype) with np.nan for missing values. This is long-standing behaviour, which we can't just change. So therefore the new data types (with different behaviour) are opt-in for now (and the plan is they will become the default at some point).

See also the docs at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values (the docs about this are quite probably a bit scarce and scattered, that's something we need to improve)

Karl-Wiese · 2020-11-25T15:11:23Z

@jorisvandenbossche I expect that the behavior is consistent. Regardless if I use np.nan or pd.NA. I understand that pd.NA is in an experimental state. This issue is not a bug report but rather an enhancement issue, I would say. So, looking forward until "they will become the default at some point" and we can enjoy consistent behavior.

If others come across this "problem" you can use .fillna() and hand over True or False depending on your needs.

The original problem is that I want to compare two DataFrames cell by cell. And I want to accept that nan values are equal. This now needs the following:
(left.eq(right) | (left.isna() & right.isna())).fillna(False)

jorisvandenbossche · 2020-11-25T15:25:40Z

I expect that the behavior is consistent. Regardless if I use np.nan or pd.NA.

Well, that's inherently impossible, because we have chosen different behaviour for pd.NA compared to np.nan.

The original problem is that I want to compare two DataFrames cell by cell. And I want to accept that nan values are equal. This now needs the following:
(left.eq(right) | (left.isna() & right.isna())).fillna(False)

But to be clear, you already needed (left.eq(right) | (left.isna() & right.isna())) before as well, right? Only the fillna(False) is new to get the exact same result as before?

Note that the fillna(False) might not be needed depending on what operation you do afterwards with this result. For example if it is used to filter, the NAs will be interpreted as False.
But in general, yes, with the nullable dtypes you might need an explicit extra fillna(True/False) call.

BTW, you might also be interested in the equals method, which considers NaN / NAs in the same location as equal (but returns a single bool, it's not an element-wise method). Maybe we could add a keyword to the eq method to consider NAs equal.

Karl-Wiese · 2020-11-25T16:38:50Z

Well, that's inherently impossible, because we have chosen different behaviour for pd.NA compared to np.nan.

It was worth a try ;)

But to be clear, you already needed (left.eq(right) | (left.isna() & right.isna())) before as well, right? Only the fillna(False) is new to get the exact same result as before?

You are absolutely right!

Thanks for your hints. For example, I apply a .all(axis=1) later on. I think I need the explicit fillna(). And equalsunfortunately doesn't do the job. I need an element-wise comparison. At the moment, I just have a little function in an utils.py that is doing the job. Passing a keyword to eq would be indeed a nice solution. Would that handle np.nan and pd.NA the same way?

jbrockmendel added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 24, 2021

chukarsten mentioned this issue Aug 13, 2021

Patch pre-Release v0.30.1 alteryx/evalml#2626

Merged

mroeschke changed the title ~~<NA> and NaN evaluate differently in equality comparison~~ ENH: add a keyword to the eq method to consider NAs equal Aug 14, 2021

mroeschke added the Enhancement label Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add a keyword to the eq method to consider NAs equal #38063

ENH: add a keyword to the eq method to consider NAs equal #38063

Karl-Wiese commented Nov 25, 2020 •

edited

Loading

jorisvandenbossche commented Nov 25, 2020

Karl-Wiese commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

Karl-Wiese commented Nov 25, 2020

ENH: add a keyword to the eq method to consider NAs equal #38063

ENH: add a keyword to the eq method to consider NAs equal #38063

Comments

Karl-Wiese commented Nov 25, 2020 • edited Loading

Code Sample

Problem description

Expected behavior

Output of pd.show_versions()

jorisvandenbossche commented Nov 25, 2020

Karl-Wiese commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

Karl-Wiese commented Nov 25, 2020

Karl-Wiese commented Nov 25, 2020 •

edited

Loading