Improve performance of xarray.corr() on big datasets

**Is your feature request related to a problem? Please describe.**

I calculated correlation coefficients based on datasets with sizes between 90-180 GB using xarray and Dask distributed and experienced very low performance for the `xarray.corr()` function. By observing the Dask dashboard it seemed that during the calculation the whole datasets are loaded from disk several times which, given the size of my datasets, became for some of the calculations a major "performance bottleneck".

**Describe the solution you'd like**

The problem became so annoying that I implemented my own function to calculate the correlation coefficient (thanks @willirath!), which is considerably more performant (especially for the big datasets!), because it only touches the full data once. I have [uploaded a Jupyter notebook](https://gist.github.com/kathoef/2fbdfd19f29a03aed561e0f5f56d445a) that shows the equivalence of the `xarray.corr()` function and my implementation (using an "unaligned data with nan values"-example, which is what `xarray.corr()` covers) and an example based on Dask arrays, which demonstrates the performance problems that I have stated above, and also that the `xarray.corr()` function is not fully lazy. (Which I assume is actually not very desirable?)

At the moment, I think, in terms of improving big data performance, a considerable improvement could be achieved by removing the `if not valid_values.all()` clause [here](https://github.com/pydata/xarray/blob/v0.16.2/xarray/core/computation.py#L1313), because that seems to determine that a call of `xarray.corr()` is not fully lazy and causes the first (of several?) full touches of the datasets? I haven't checked what's going on afterwards, but maybe that is already a useful starting point? :thinking:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve performance of xarray.corr() on big datasets #4804

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve performance of xarray.corr() on big datasets #4804

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions