Skip to content

DEPR: DataFrame.combine propagates nan values #10734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aktiur opened this issue Aug 3, 2015 · 18 comments
Open

DEPR: DataFrame.combine propagates nan values #10734

aktiur opened this issue Aug 3, 2015 · 18 comments
Labels
Bug Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@aktiur
Copy link

aktiur commented Aug 3, 2015

The documentation for DataFrame.combine claims that the method "do[es] not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value". However, this does not seem to correspond to the actual behaviour of DataFrame.combine.

Sample code:

>>> import pandas as pd
>>> from operator import add
>>> a = pd.DataFrame({
...        'a': pd.Series([1, 3], index=[0, 1]),
...        'b': pd.Series([2, 3], index=[1, 2]),
...    })
>>> b = pd.DataFrame({
...         'a': pd.Series([3, 5], index=[1, 2]),
...         'c': pd.Series([1, 2, 3], index=[2, 3, 4])
...     })
>>> a.combine(b, add)
    a   b   c
0 NaN NaN NaN
1   6 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

In many cases (as in this one), it may be remedied by using the fill_value parameter of DataFrame.combine. However, it might be a problem when there is no acceptable neutral element for the given combining function.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22
numpy: 1.9.2
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.6
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
@aktiur aktiur changed the title DOC: DataFrame.combine propagates nan values DataFrame.combine propagates nan values Aug 3, 2015
@jreback
Copy link
Contributor

jreback commented Aug 3, 2015

you can simply do this:

In [13]: a.add(b)
Out[13]: 
    a   b   c
0 NaN NaN NaN
1   6 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

or

In [31]: a.add(b,fill_value=0)
Out[31]: 
    a   b   c
0   1 NaN NaN
1   6   2 NaN
2   5   3   1
3 NaN NaN   2
4 NaN NaN   3

maybe explain what the issue you are seeing. Essentially you are doing this:

In [32]: a1, b1 = a.align(b)

In [33]: a1
Out[33]: 
    a   b   c
0   1 NaN NaN
1   3   2 NaN
2 NaN   3 NaN
3 NaN NaN NaN
4 NaN NaN NaN

In [34]: b1
Out[34]: 
    a   b   c
0 NaN NaN NaN
1   3 NaN NaN
2   5 NaN   1
3 NaN NaN   2
4 NaN NaN   3

so in this example is IS working as adverisied. The NaN's are coming from the addition.

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 3, 2015
@jorisvandenbossche
Copy link
Member

@jreback you are fully correct that those (a.add(b) or a.add(b, fill_value=0)) are the logical functions to use instead of combine.

But, still, it is not really clear what the docs of combine even mean with:

Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)

What does 'do not propagate NaN values' mean? What if the result of the function at a certain place in the frame is NaN, what should it do? Not propagate? (but this is what it does) Default to the value in other? So this would mean that in a11 + b11 it gives you b11 if a11 is NaN, but NaN if b11 is NaN and a11 not .. (seems like a very strange operation).

In any case, docstring can use an update, as it does not seem correct.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2015

yeh doc-string prob not clear. .combine is not really used much (except by .combine_first). Its the logical equivalent of .apply but for binary functions.

Maybe we should just deprecate this.

@jsevo
Copy link

jsevo commented Mar 9, 2016

+1 to deprecate. This is pretty confusing. It also has no explanation what func ought to be. Is combine with overwrite=True equivalent to update? Not a very clear situation for what must be pretty common operations.

@jreback jreback added this to the 0.19.0 milestone Mar 9, 2016
@jreback jreback added Difficulty Novice Deprecate Functionality to remove in pandas labels Mar 9, 2016
@jreback jreback changed the title DataFrame.combine propagates nan values DataFrame.combine propagates nan values / deprecate Mar 9, 2016
@jreback jreback changed the title DataFrame.combine propagates nan values / deprecate DEPR: DataFrame.combine propagates nan values Mar 9, 2016
@auvipy
Copy link

auvipy commented Mar 18, 2016

I want to depricate this. how to do so? never contributed to pandas before!

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

look at how .combineAdd is done. note that we can't do this until 0.19.0.

@jreback jreback modified the milestones: 0.19.0, 0.20.0 Aug 11, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 12, 2016

Repeating from #13970

Not really sure we should actually deprecate DataFrame.combine. Although I have never used it, it does serve a purpose that is not possible to achieve with "the flexible arithmetic methods like add" when you pass it a custom function (unlike the deprecated combineAdd, which definitely can be done with add)

@sinhrks
Copy link
Member

sinhrks commented Aug 12, 2016

combine 's func spec isn't clear / predictable for users. I think using .align -> arbitrary func is clearer such cases.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2016

I agree with @sinhrks here. To be honest we have seen almost no reports of .combine usage over the years (bugs or otherwise), except those we have done. Its not documented, nor does it have a clear well defined purpose.

Let's deprecate. I suppose if we have a big uprising in the community my opinion could change.

@jorisvandenbossche
Copy link
Member

Example usage: combining two dataframes to take the maximum of both frames at each location:

In [36]: df1 = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})

In [37]: df2 = pd.DataFrame({'b':[6,5,4], 'c':[3,2,1]}, index=[1,2,3])

In [38]: df1
Out[38]:
   a  b
0  1  4
1  2  5
2  3  6

In [39]: df2
Out[39]:
   b  c
1  6  3
2  5  2
3  4  1

In [40]: df1.combine(df2, func=np.maximum)
Out[40]:
    a    b   c
0 NaN  NaN NaN
1 NaN  6.0 NaN
2 NaN  6.0 NaN
3 NaN  NaN NaN

In [41]: df1.combine(df2, func=np.maximum, fill_value=-np.inf)
Out[41]:
     a    b    c
0  1.0  4.0  NaN
1  2.0  6.0  3.0
2  3.0  6.0  2.0
3  NaN  4.0  1.0

Is there another easy way to achieve this?

I am the first to say that we actually have too many methods (that's why we deprecated eg the really useless and unpythonic combineAdd and combineMult), but it's not because something is poorly/wrongly documented, we should deprecate it. The behaviour (as implemented) is actually rather straightforward IMO, and it should be easy to update the docstring to reflect this.

I fully agree that it is probably not used much, and I am also not certain it's worth keeping. But just want to point out that the bad docstring is not a reason to deprecate it.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2016

@jorisvandenbossche not unreasonable, though your example have never been requested :<

In this case

In [40]: df1, df2 = df1.align(df2)

In [41]: df1
Out[41]: 
     a    b   c
0  1.0  4.0 NaN
1  2.0  5.0 NaN
2  3.0  6.0 NaN
3  NaN  NaN NaN

In [42]: df2
Out[42]: 
    a    b    c
0 NaN  NaN  NaN
1 NaN  6.0  3.0
2 NaN  5.0  2.0
3 NaN  4.0  1.0

In [46]: df1.where(df1>df2).combine_first(df1).combine_first(df2)
Out[46]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  3.0
2  3.0  6.0  2.0
3  NaN  4.0  1.0

again a dedicated method for a feature like this may not be entirely useful. Maybe can find some other uses / integrations which make sense. ok so let's discuss some more, @sinhrks can take it out from the PR (and let other bug fixes in).

In particular from your example, having to 'manually' specify the fill_value=-np.inf is really awkward (not that my method is much better :)

@sinhrks
Copy link
Member

sinhrks commented Aug 12, 2016

After df1.align(df2):

np.fmax(df1, df2)
#      a    b    c
# 0  1.0  4.0  NaN
# 1  2.0  6.0  3.0
# 2  3.0  6.0  2.0
# 3  NaN  4.0  1.0

What I also care is its impl. I think following points cannot be explained for users in clear way:

  • the result's dtype is decided by its input dtype, not by what func returns. (thus user func must meet the spec)
  • func signature changes if data contains datetime-like columns.

@sinhrks
Copy link
Member

sinhrks commented Feb 9, 2017

can we discuss it for 0.20, @jorisvandenbossche ?

@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017
@jreback jreback removed this from the 0.20.0 milestone Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

thoughts on what to do about this?

@jreback jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017
@stuarteberg
Copy link
Contributor

FWIW, this confused me for a while today. As far as I can tell, the docs are misleading at best (maybe even just plain wrong?).

Just like @jorisvandenbossche above, I was attempting to compute the element-wise max values between two dataframes (with some missing values). I was surprised to see this:

In [327]: df1
Out[327]:
     A
0  1.0
1  2.0
2  NaN
3  4.0

In [328]: df2
Out[328]:
     A
0  1.0
1  2.0
2  3.0
3  NaN

In [329]: df1.combine(df2, np.maximum)
Out[329]:
     A
0  1.0
1  2.0
2  NaN
3  NaN

Am I missing something?

@SaturnFromTitan
Copy link
Contributor

@jreback @sinhrks @jorisvandenbossche Is this still something one could/should work on? I bumped into it because it's still part of the 1.0 milestone

Reading the arguments here I'd be strongly in favour of deprecating it, but since the conversation is some years old, I'm not sure if all of it is still relevant

@TomAugspurger
Copy link
Contributor

I don't think a decision was made about what to do. I don't have a strong opinion, but I'm removing it from the 1.0 milestone.

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@mroeschke
Copy link
Member

Since there hasn't been much consensus around deprecation and we do have decent test coverage for this function (and I do see the utility of this function), I think it might just be worth fixing this bug and documenting better.

@mroeschke mroeschke added Bug Docs and removed Deprecate Functionality to remove in pandas labels Jun 7, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests