Skip to content

Sum and prod with min_count forces evaluation #4898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bcbnz opened this issue Feb 12, 2021 · 5 comments · Fixed by #4911
Closed

Sum and prod with min_count forces evaluation #4898

bcbnz opened this issue Feb 12, 2021 · 5 comments · Fixed by #4911

Comments

@bcbnz
Copy link
Contributor

bcbnz commented Feb 12, 2021

If I use the sum method on a lazy array with min_count != None then evaluation is forced. If there is some limitation of the implementation which means it cannot be added to the computation graph for lazy evaluation then this should be mentioned in the docs.

Minimal Complete Verifiable Example:

import numpy as np
import xarray as xr


def worker(da):
    if da.shape == (0, 0):
        return da

    raise RuntimeError("I was evaluated")


da = xr.DataArray(
    np.random.normal(size=(20, 500)),
    dims=("x", "y"),
    coords=(np.arange(20), np.arange(500)),
)

da = da.chunk(dict(x=5))
lazy = da.map_blocks(worker)
result1 = lazy.sum("x", skipna=True)
result2 = lazy.sum("x", skipna=True, min_count=5)

What happened: RuntimeError: I was evaluated

What you expected to happen: No output or exceptions, as the result1 and result2 arrays are not printed or saved.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.1 (default, Feb 6 2021, 06:49:13)
[GCC 10.2.0]
python-bits: 64
OS: Linux
OS-release: 5.10.15-arch1-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.16.2
pandas: 1.2.1
numpy: 1.20.0
scipy: 1.6.0
netCDF4: 1.5.5.1
pydap: None
h5netcdf: 0.9.0
h5py: 3.1.0
Nio: None
zarr: None
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.0
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.4
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: None
setuptools: 53.0.0
pip: 20.3.1
conda: None
pytest: 6.2.1
IPython: 7.19.0
sphinx: 3.4.3

@mathause
Copy link
Collaborator

Thanks for the report, the culprit is likely

if null_mask.any():

We fixed a similar problem in weighted.

@bcbnz bcbnz changed the title Sum with min_count forces evaluation Sum and prod with min_count forces evaluation Feb 12, 2021
@bcbnz
Copy link
Contributor Author

bcbnz commented Feb 12, 2021

grepping the code, the only other function that calls _maybe_null_out is prod, and I can confirm the problem also exists there. Updated the title, MCVE for prod:

import numpy as np
import xarray as xr


def worker(da):
    if da.shape == (0, 0):
        return da

    raise RuntimeError("I was evaluated")


da = xr.DataArray(
    np.random.normal(size=(20, 500)),
    dims=("x", "y"),
    coords=(np.arange(20), np.arange(500)),
)

da = da.chunk(dict(x=5))
lazy = da.map_blocks(worker)
result1 = lazy.prod("x", skipna=True)
result2 = lazy.prod("x", skipna=True, min_count=5)

@dcherian
Copy link
Contributor

Can we use np.where instead of this if condition?

@bcbnz
Copy link
Contributor Author

bcbnz commented Feb 12, 2021

A quick check with the debugger and it is the null_mask.any() call that is causing it to compute.

I think I've found another problem with _maybe_null_out if it is reducing over all dimensions. With the altered MCVE

import numpy as np
import xarray as xr

def worker(da):
    if da.shape == (0, 0):
        return da

    res = xr.full_like(da, np.nan)
    res[0, 0] = 1
    return res


da = xr.DataArray(
    np.random.normal(size=(20, 500)),
    dims=("x", "y"),
    coords=(np.arange(20), np.arange(500)),
)

da = da.chunk(dict(x=5))
lazy = da.map_blocks(worker)
result_allaxes = lazy.sum(skipna=True, min_count=5)
result_allaxes.load()

I would expect result_allaxes to be nan since there are four blocks and therefore four non-nan values, less than min_count. Instead it is 4.

The problem seems to be the dtype check:

elif getattr(result, "dtype", None) not in dtypes.NAT_TYPES:

The test returns True for float64 and so the block isn't run. Another MCVE:

import numpy as np
from xarray.core import dtypes

print(dtypes.NAT_TYPES)
print(np.dtype("float64") in dtypes.NAT_TYPES)

Output:

(numpy.datetime64('NaT'), numpy.timedelta64('NaT'))
True

where I think False would be expected. Should I open a separate issue for this or can we track it here too?

@bcbnz
Copy link
Contributor Author

bcbnz commented Feb 12, 2021

@dcherian it looks like that works. A better test script:

import numpy as np
import xarray as xr
from xarray.tests import raise_if_dask_computes


def worker(da):
    if da.shape == (0, 0):
        return da

    return da.where(da > 1)


np.random.seed(1023)
da = xr.DataArray(
    np.random.normal(size=(20, 500)),
    dims=("x", "y"),
    coords=(np.arange(20), np.arange(500)),
)

da = da.chunk(dict(x=5))
lazy = da.map_blocks(worker)

with raise_if_dask_computes():
    result = lazy.sum("x", skipna=True, min_count=5)

result.load()

assert np.isnan(result[0])
assert not np.isnan(result[6])

If I then remove the if null_mask.any() check and the following block, and replace it with

dtype, fill_value = dtypes.maybe_promote(result.dtype)
result = result.astype(dtype)
result = np.where(null_mask, fill_value, result)

it passes. I can start working on a pull request with these tests and changes if that looks acceptable to you.

How would you suggest handling the possible type promotion from the current dtype, fill_value = dtypes.maybe_promote(result.dtype) line? Currently it only tries promoting if the mask is True anywhere. Always promote, or just use the fill value and hope it works out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants