Skip to content

BUG: Series.corr/cov raising with masked dtype #51422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 24, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 16, 2023
from scipy import stats

datetime_series = datetime_series.astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if there are NAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both left and right are filtered for notna here:

valid = notna(a) & notna(b)

so this works:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: ser1 = pd.Series(np.random.randn(100), dtype="Float64")

In [4]: ser2 = pd.Series(np.random.randn(100), dtype="Float64")

In [5]: ser1[1] = pd.NA

In [6]: ser2[5:7] = pd.NA

In [7]: ser1.corr(ser2)
Out[7]: 0.09774253881093414

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. what about if there is an nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an nan within a masked array? e.g.

In [8]: ser1[1] = 0.0

In [9]: ser1 /= (ser1 != 0)

In [10]: ser1
Out[10]: 
0     0.148073
1          NaN
2     0.556972
3    -0.554886
4     1.216938
        ...   
95     0.33919
96    0.528683
97    1.590215
98     0.84015
99    0.333666
Length: 100, dtype: Float64

In [11]: ser1.corr(ser2)
Out[11]: 0.09774253881093414

It works either way since the notna is applied to the ndarray which will capture both np.nan and pd.NA

@phofl phofl added this to the 2.1 milestone Feb 22, 2023
@@ -1230,6 +1230,7 @@ Numeric
- Bug in arithmetic operations on :class:`Series` not propagating mask when combining masked dtypes and numpy dtypes (:issue:`45810`, :issue:`42630`)
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
- Bug in :meth:`Series.corr` and :meth:`Series.cov` raising ``AttributeError`` for masked dtypes (:issue:`51422`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move to 2.1 as well?

@@ -2702,8 +2705,10 @@ def cov(
this, other = self.align(other, join="inner", copy=False)
if len(this) == 0:
return np.nan
this_values = np.asarray(this.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ._values here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. For my knowledge, if we need an ndarray for nanops.nancov, is there a benefit to using np.asarray(this._values) vs np.asarray(this.values)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably harmless in this case but in general .values can be either a) lossy (dt64tz) or b) convert to object (period) while ._values is safe from those failure modes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks! I'll update just in case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values also creates references and is read_only (in the future) for CoW. Better to use ._values

@phofl phofl merged commit 6d2b46d into pandas-dev:main Feb 24, 2023
@phofl
Copy link
Member

phofl commented Feb 24, 2023

thx @lukemanley

@lukemanley lukemanley deleted the series-cov-corr-with-masked-dtype branch March 17, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants