DEPR: DataFrame.combine propagates nan values #10734

aktiur · 2015-08-03T13:33:28Z

The documentation for DataFrame.combine claims that the method "do[es] not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value". However, this does not seem to correspond to the actual behaviour of DataFrame.combine.

Sample code:

>>> import pandas as pd
>>> from operator import add
>>> a = pd.DataFrame({
...        'a': pd.Series([1, 3], index=[0, 1]),
...        'b': pd.Series([2, 3], index=[1, 2]),
...    })
>>> b = pd.DataFrame({
...         'a': pd.Series([3, 5], index=[1, 2]),
...         'c': pd.Series([1, 2, 3], index=[2, 3, 4])
...     })
>>> a.combine(b, add)
    a   b   c
0 NaN NaN NaN
1   6 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

In many cases (as in this one), it may be remedied by using the fill_value parameter of DataFrame.combine. However, it might be a problem when there is no acceptable neutral element for the given combining function.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22
numpy: 1.9.2
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.6
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

jreback · 2015-08-03T21:39:37Z

you can simply do this:

In [13]: a.add(b)
Out[13]: 
    a   b   c
0 NaN NaN NaN
1   6 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

or

In [31]: a.add(b,fill_value=0)
Out[31]: 
    a   b   c
0   1 NaN NaN
1   6   2 NaN
2   5   3   1
3 NaN NaN   2
4 NaN NaN   3

maybe explain what the issue you are seeing. Essentially you are doing this:

In [32]: a1, b1 = a.align(b)

In [33]: a1
Out[33]: 
    a   b   c
0   1 NaN NaN
1   3   2 NaN
2 NaN   3 NaN
3 NaN NaN NaN
4 NaN NaN NaN

In [34]: b1
Out[34]: 
    a   b   c
0 NaN NaN NaN
1   3 NaN NaN
2   5 NaN   1
3 NaN NaN   2
4 NaN NaN   3

so in this example is IS working as adverisied. The NaN's are coming from the addition.

jorisvandenbossche · 2015-08-03T23:47:59Z

@jreback you are fully correct that those (a.add(b) or a.add(b, fill_value=0)) are the logical functions to use instead of combine.

But, still, it is not really clear what the docs of combine even mean with:

Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)

What does 'do not propagate NaN values' mean? What if the result of the function at a certain place in the frame is NaN, what should it do? Not propagate? (but this is what it does) Default to the value in other? So this would mean that in a11 + b11 it gives you b11 if a11 is NaN, but NaN if b11 is NaN and a11 not .. (seems like a very strange operation).

In any case, docstring can use an update, as it does not seem correct.

jreback · 2015-08-04T19:43:06Z

yeh doc-string prob not clear. .combine is not really used much (except by .combine_first). Its the logical equivalent of .apply but for binary functions.

Maybe we should just deprecate this.

jsevo · 2016-03-09T22:05:50Z

+1 to deprecate. This is pretty confusing. It also has no explanation what func ought to be. Is combine with overwrite=True equivalent to update? Not a very clear situation for what must be pretty common operations.

auvipy · 2016-03-18T17:13:12Z

I want to depricate this. how to do so? never contributed to pandas before!

jreback · 2016-03-20T15:23:18Z

look at how .combineAdd is done. note that we can't do this until 0.19.0.

jorisvandenbossche · 2016-08-12T00:00:49Z

Repeating from #13970

Not really sure we should actually deprecate DataFrame.combine. Although I have never used it, it does serve a purpose that is not possible to achieve with "the flexible arithmetic methods like add" when you pass it a custom function (unlike the deprecated combineAdd, which definitely can be done with add)

sinhrks · 2016-08-12T00:32:37Z

combine 's func spec isn't clear / predictable for users. I think using .align -> arbitrary func is clearer such cases.

jreback · 2016-08-12T12:03:54Z

I agree with @sinhrks here. To be honest we have seen almost no reports of .combine usage over the years (bugs or otherwise), except those we have done. Its not documented, nor does it have a clear well defined purpose.

Let's deprecate. I suppose if we have a big uprising in the community my opinion could change.

jorisvandenbossche · 2016-08-12T12:18:27Z

Example usage: combining two dataframes to take the maximum of both frames at each location:

In [36]: df1 = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})

In [37]: df2 = pd.DataFrame({'b':[6,5,4], 'c':[3,2,1]}, index=[1,2,3])

In [38]: df1
Out[38]:
   a  b
0  1  4
1  2  5
2  3  6

In [39]: df2
Out[39]:
   b  c
1  6  3
2  5  2
3  4  1

In [40]: df1.combine(df2, func=np.maximum)
Out[40]:
    a    b   c
0 NaN  NaN NaN
1 NaN  6.0 NaN
2 NaN  6.0 NaN
3 NaN  NaN NaN

In [41]: df1.combine(df2, func=np.maximum, fill_value=-np.inf)
Out[41]:
     a    b    c
0  1.0  4.0  NaN
1  2.0  6.0  3.0
2  3.0  6.0  2.0
3  NaN  4.0  1.0

Is there another easy way to achieve this?

I am the first to say that we actually have too many methods (that's why we deprecated eg the really useless and unpythonic combineAdd and combineMult), but it's not because something is poorly/wrongly documented, we should deprecate it. The behaviour (as implemented) is actually rather straightforward IMO, and it should be easy to update the docstring to reflect this.

I fully agree that it is probably not used much, and I am also not certain it's worth keeping. But just want to point out that the bad docstring is not a reason to deprecate it.

jreback · 2016-08-12T12:57:34Z

@jorisvandenbossche not unreasonable, though your example have never been requested :<

In this case

In [40]: df1, df2 = df1.align(df2)

In [41]: df1
Out[41]: 
     a    b   c
0  1.0  4.0 NaN
1  2.0  5.0 NaN
2  3.0  6.0 NaN
3  NaN  NaN NaN

In [42]: df2
Out[42]: 
    a    b    c
0 NaN  NaN  NaN
1 NaN  6.0  3.0
2 NaN  5.0  2.0
3 NaN  4.0  1.0

In [46]: df1.where(df1>df2).combine_first(df1).combine_first(df2)
Out[46]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  3.0
2  3.0  6.0  2.0
3  NaN  4.0  1.0

again a dedicated method for a feature like this may not be entirely useful. Maybe can find some other uses / integrations which make sense. ok so let's discuss some more, @sinhrks can take it out from the PR (and let other bug fixes in).

In particular from your example, having to 'manually' specify the fill_value=-np.inf is really awkward (not that my method is much better :)

sinhrks · 2016-08-12T14:56:58Z

After df1.align(df2):

np.fmax(df1, df2)
#      a    b    c
# 0  1.0  4.0  NaN
# 1  2.0  6.0  3.0
# 2  3.0  6.0  2.0
# 3  NaN  4.0  1.0

What I also care is its impl. I think following points cannot be explained for users in clear way:

the result's dtype is decided by its input dtype, not by what func returns. (thus user func must meet the spec)
func signature changes if data contains datetime-like columns.

sinhrks · 2017-02-09T08:30:37Z

can we discuss it for 0.20, @jorisvandenbossche ?

jreback · 2017-09-23T21:07:12Z

thoughts on what to do about this?

stuarteberg · 2018-02-01T06:15:26Z

FWIW, this confused me for a while today. As far as I can tell, the docs are misleading at best (maybe even just plain wrong?).

Just like @jorisvandenbossche above, I was attempting to compute the element-wise max values between two dataframes (with some missing values). I was surprised to see this:

In [327]: df1
Out[327]:
     A
0  1.0
1  2.0
2  NaN
3  4.0

In [328]: df2
Out[328]:
     A
0  1.0
1  2.0
2  3.0
3  NaN

In [329]: df1.combine(df2, np.maximum)
Out[329]:
     A
0  1.0
1  2.0
2  NaN
3  NaN

Am I missing something?

SaturnFromTitan · 2019-11-01T14:46:46Z

@jreback @sinhrks @jorisvandenbossche Is this still something one could/should work on? I bumped into it because it's still part of the 1.0 milestone

Reading the arguments here I'd be strongly in favour of deprecating it, but since the conversation is some years old, I'm not sure if all of it is still relevant

TomAugspurger · 2019-12-30T14:12:44Z

I don't think a decision was made about what to do. I don't have a strong opinion, but I'm removing it from the 1.0 milestone.

mroeschke · 2022-06-07T21:43:30Z

Since there hasn't been much consensus around deprecation and we do have decent test coverage for this function (and I do see the utility of this function), I think it might just be worth fixing this bug and documenting better.

aktiur changed the title ~~DOC: DataFrame.combine propagates nan values~~ DataFrame.combine propagates nan values Aug 3, 2015

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 3, 2015

jreback added this to the 0.19.0 milestone Mar 9, 2016

jreback added Difficulty Novice Deprecate Functionality to remove in pandas labels Mar 9, 2016

jreback changed the title ~~DataFrame.combine propagates nan values~~ DataFrame.combine propagates nan values / deprecate Mar 9, 2016

jreback changed the title ~~DataFrame.combine propagates nan values / deprecate~~ DEPR: DataFrame.combine propagates nan values Mar 9, 2016

sinhrks mentioned this issue Aug 11, 2016

BUG/DEPR: combine dtype fixes #13970

Closed

5 tasks

jreback modified the milestones: 0.19.0, 0.20.0 Aug 11, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016

jreback mentioned this issue Sep 20, 2016

DEPR: 0.21 deprecations master issue #14220

Closed

8 tasks

jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017

jreback removed this from the 0.20.0 milestone Mar 23, 2017

jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

jbrockmendel removed the Effort Low label Oct 21, 2019

TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019

mroeschke removed the good first issue label Apr 18, 2021

mroeschke added Bug Docs and removed Deprecate Functionality to remove in pandas labels Jun 7, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: DataFrame.combine propagates nan values #10734

DEPR: DataFrame.combine propagates nan values #10734

aktiur commented Aug 3, 2015

jreback commented Aug 3, 2015

jorisvandenbossche commented Aug 3, 2015

jreback commented Aug 4, 2015

jsevo commented Mar 9, 2016

auvipy commented Mar 18, 2016

jreback commented Mar 20, 2016

jorisvandenbossche commented Aug 12, 2016 •

edited

Loading

sinhrks commented Aug 12, 2016

jreback commented Aug 12, 2016

jorisvandenbossche commented Aug 12, 2016

jreback commented Aug 12, 2016 •

edited

Loading

sinhrks commented Aug 12, 2016 •

edited

Loading

sinhrks commented Feb 9, 2017

jreback commented Sep 23, 2017

stuarteberg commented Feb 1, 2018

SaturnFromTitan commented Nov 1, 2019

TomAugspurger commented Dec 30, 2019

mroeschke commented Jun 7, 2022

DEPR: DataFrame.combine propagates nan values #10734

DEPR: DataFrame.combine propagates nan values #10734

Comments

aktiur commented Aug 3, 2015

jreback commented Aug 3, 2015

jorisvandenbossche commented Aug 3, 2015

jreback commented Aug 4, 2015

jsevo commented Mar 9, 2016

auvipy commented Mar 18, 2016

jreback commented Mar 20, 2016

jorisvandenbossche commented Aug 12, 2016 • edited Loading

sinhrks commented Aug 12, 2016

jreback commented Aug 12, 2016

jorisvandenbossche commented Aug 12, 2016

jreback commented Aug 12, 2016 • edited Loading

sinhrks commented Aug 12, 2016 • edited Loading

sinhrks commented Feb 9, 2017

jreback commented Sep 23, 2017

stuarteberg commented Feb 1, 2018

SaturnFromTitan commented Nov 1, 2019

TomAugspurger commented Dec 30, 2019

mroeschke commented Jun 7, 2022

jorisvandenbossche commented Aug 12, 2016 •

edited

Loading

jreback commented Aug 12, 2016 •

edited

Loading

sinhrks commented Aug 12, 2016 •

edited

Loading