-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
"wrong" covariance matrix returned in the presence of nans #3513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
easy enough to add a That said, maybe could have a global option, say |
Ultimately, I don't really care if people are doing the wrong thing as long as there's a big honking warning in the documentation, but why do you think the default be False? IMO this isn't just an academic concern. The returned result is not what it's assumed to be and people who need to use a covariance matrix as input to more code down the line should know that they don't have one. |
|
Yeah, I guess I'll vote for improved documentation/warning then. "If you have missing values, then the covariance matrix is not guaranteed to be positive semi-definite." |
Though I think there's still room for improvement here too. Doing something like Matlab's nancov maybe another keyword, missing = "complete" or "pairwise." That way you can at least get a covariance that's PSD and invertible when used in other applications. Even a note in the docs for df.dropna().cov() would be helpful, as I didn't find the documentation about what's done with NaNs to be crystal clear. cc @josef-pkt |
@jseabold this is a good idea, maybe coupled with a warning (if complete, but you have nans for instance) |
just an update from the statsmodels side But I just realized, that I only tested with numpy arrays, and it might not work yet for pandas DataFrames. |
closing in favor of #16837 |
See the mailing list "[pydata] Covariance matrix not positive semi-definite."
Currently, a covariance matrix is computed using pairwise available observations ie., if there is missing data at an index but not in the two pairs it still uses those pairs in the pairwise covariance matrix. The result of this computation is not a covariance matrix and can be non positive semi-definite.
What to do in this case? 1) Warn? 2) Raise an error? 3) Only use observations for which all variables are available?
3 is tempting, the resultant covariance matrix will be a true covariance matrix, but it's an inconsistent estimator of the covariance.
My vote is for 2, so that the user is forced to think what they actually want to compute. Ideally, the error message will point to estimators that are appropriate for this situation, but these are not online yet (from statsmodels).
statsmodels/statsmodels#631
statsmodels/statsmodels#303
The text was updated successfully, but these errors were encountered: