Skip to content

"wrong" covariance matrix returned in the presence of nans #3513

Closed
@jseabold

Description

@jseabold

See the mailing list "[pydata] Covariance matrix not positive semi-definite."

Currently, a covariance matrix is computed using pairwise available observations ie., if there is missing data at an index but not in the two pairs it still uses those pairs in the pairwise covariance matrix. The result of this computation is not a covariance matrix and can be non positive semi-definite.

What to do in this case? 1) Warn? 2) Raise an error? 3) Only use observations for which all variables are available?

3 is tempting, the resultant covariance matrix will be a true covariance matrix, but it's an inconsistent estimator of the covariance.

My vote is for 2, so that the user is forced to think what they actually want to compute. Ideally, the error message will point to estimators that are appropriate for this situation, but these are not online yet (from statsmodels).

statsmodels/statsmodels#631
statsmodels/statsmodels#303

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions