Skip to content

"wrong" covariance matrix returned in the presence of nans #3513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue May 2, 2013 · 8 comments
Closed

"wrong" covariance matrix returned in the presence of nans #3513

jseabold opened this issue May 2, 2013 · 8 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jseabold
Copy link
Contributor

jseabold commented May 2, 2013

See the mailing list "[pydata] Covariance matrix not positive semi-definite."

Currently, a covariance matrix is computed using pairwise available observations ie., if there is missing data at an index but not in the two pairs it still uses those pairs in the pairwise covariance matrix. The result of this computation is not a covariance matrix and can be non positive semi-definite.

What to do in this case? 1) Warn? 2) Raise an error? 3) Only use observations for which all variables are available?

3 is tempting, the resultant covariance matrix will be a true covariance matrix, but it's an inconsistent estimator of the covariance.

My vote is for 2, so that the user is forced to think what they actually want to compute. Ideally, the error message will point to estimators that are appropriate for this situation, but these are not online yet (from statsmodels).

statsmodels/statsmodels#631
statsmodels/statsmodels#303

@jreback
Copy link
Contributor

jreback commented May 2, 2013

easy enough to add a raise_on_nan argument, but default prob should be False.

That said, maybe could have a global option, say stats.strict=True which you could set in order to have things like this defaulted to True?

@jseabold
Copy link
Contributor Author

jseabold commented May 2, 2013

Ultimately, I don't really care if people are doing the wrong thing as long as there's a big honking warning in the documentation, but why do you think the default be False? IMO this isn't just an academic concern. The returned result is not what it's assumed to be and people who need to use a covariance matrix as input to more code down the line should know that they don't have one.

@jreback
Copy link
Contributor

jreback commented May 2, 2013

False mainly because of backward-compat (which could always be broken). I use the min_periods argument to sort of 'avoid' this issue (as prefer to have a min num of obs)

@jseabold
Copy link
Contributor Author

jseabold commented May 2, 2013

Yeah, I guess I'll vote for improved documentation/warning then. "If you have missing values, then the covariance matrix is not guaranteed to be positive semi-definite."

@jseabold
Copy link
Contributor Author

jseabold commented May 2, 2013

Though I think there's still room for improvement here too. Doing something like Matlab's nancov maybe another keyword, missing = "complete" or "pairwise." That way you can at least get a covariance that's PSD and invertible when used in other applications. Even a note in the docs for df.dropna().cov() would be helpful, as I didn't find the documentation about what's done with NaNs to be crystal clear. cc @josef-pkt

@jreback
Copy link
Contributor

jreback commented May 2, 2013

@jseabold this is a good idea, maybe coupled with a warning (if complete, but you have nans for instance)

@josef-pkt
Copy link

just an update from the statsmodels side
finding a close psd matrix is in master http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.correlation_tools.corr_nearest.html plus 2 more functions.

But I just realized, that I only tested with numpy arrays, and it might not work yet for pandas DataFrames.

@jreback
Copy link
Contributor

jreback commented Jul 6, 2017

closing in favor of #16837

@jreback jreback closed this as completed Jul 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

3 participants