Skip to content

.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zbarry opened this issue Jul 25, 2019 · 4 comments · Fixed by #3435

Comments

@zbarry
Copy link

zbarry commented Jul 25, 2019

MCVE Code Sample

.mean() on a DataArray pointing to a Dask array returns a Dask array-containing DataArray as expected:

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0')
da
<xarray.DataArray (dim_1: 500, dim_2: 500)>
dask.array<shape=(500, 500), dtype=float64, chunksize=(100, 100)>
Dimensions without coordinates: dim_1, dim_2

Calling .compute() on this result produces the expected result:

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0').compute()
<xarray.DataArray (dim_1: 500, dim_2: 500)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Dimensions without coordinates: dim_1, dim_2

The .reduce() method immediately executes all of the previously queued computations leading up to the new reduce method before even calling the supplied function.

def func(x, axis=None):
    return x

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0').reduce(func)
<xarray.DataArray (dim_1: 500, dim_2: 500)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Dimensions without coordinates: dim_1, dim_2

Expected Output

A Dask array when .reduce(func) isn't followed up by .compute().

Problem Description

When using Dask distributed, the computational graph you are constructing is immediately executed if you call .reduce() instead of adding that function as another node in the DAG. This graph execution happens before the function you pass to reduce is called.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.21.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.3
pandas: 0.25.0
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.1.0
distributed: 2.1.0
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.1
pip: 19.2.1
conda: None
pytest: 5.0.1
IPython: 7.6.1
sphinx: 2.1.2

@shoyer shoyer added the bug label Jul 26, 2019
@shoyer
Copy link
Member

shoyer commented Jul 26, 2019

Yikes, this is pretty bad!

Thanks for the clear code to reproduce it.

@dcherian
Copy link
Contributor

This ends up being because allow_lazy=False

xarray/xarray/core/variable.py

Lines 1412 to 1460 in 3f9069b

def reduce(
self,
func,
dim=None,
axis=None,
keep_attrs=None,
keepdims=False,
allow_lazy=False,
**kwargs
):
"""Reduce this array by applying `func` along some dimension(s).
Parameters
----------
func : function
Function which can be called in the form
`func(x, axis=axis, **kwargs)` to return the result of reducing an
np.ndarray over an integer valued axis.
dim : str or sequence of str, optional
Dimension(s) over which to apply `func`.
axis : int or sequence of int, optional
Axis(es) over which to apply `func`. Only one of the 'dim'
and 'axis' arguments can be supplied. If neither are supplied, then
the reduction is calculated over the flattened array (by calling
`func(x)` without an axis argument).
keep_attrs : bool, optional
If True, the variable's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.
keepdims : bool, default False
If True, the dimensions which are reduced are left in the result
as dimensions of size one
**kwargs : dict
Additional keyword arguments passed on to `func`.
Returns
-------
reduced : Array
Array with summarized data and the indicated dimension(s)
removed.
"""
if dim is common.ALL_DIMS:
dim = None
if dim is not None and axis is not None:
raise ValueError("cannot supply both 'axis' and 'dim' arguments")
if dim is not None:
axis = self.get_axis_num(dim)
input_data = self.data if allow_lazy else self.values

I don't see why we need this kwarg or why it shouldn't be True by default. This kwarg is undocumented in dataset.py

xarray/xarray/core/dataset.py

Lines 3997 to 4029 in 3f9069b

def reduce(
self,
func: Callable,
dim: Union[Hashable, Iterable[Hashable]] = None,
keep_attrs: bool = None,
keepdims: bool = False,
numeric_only: bool = False,
allow_lazy: bool = False,
**kwargs: Any,
) -> "Dataset":
"""Reduce this dataset by applying `func` along some dimension(s).
Parameters
----------
func : callable
Function which can be called in the form
`f(x, axis=axis, **kwargs)` to return the result of reducing an
np.ndarray over an integer valued axis.
dim : str or sequence of str, optional
Dimension(s) over which to apply `func`. By default `func` is
applied over all dimensions.
keep_attrs : bool, optional
If True, the dataset's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.
keepdims : bool, default False
If True, the dimensions which are reduced are left in the result
as dimensions of size one. Coordinates that use these dimensions
are removed.
numeric_only : bool, optional
If True, only apply ``func`` to variables with a numeric dtype.
**kwargs : Any
Additional keyword arguments passed on to ``func``.

and invisible in dataarray.py

def reduce(
self,
func: Callable[..., Any],
dim: Union[None, Hashable, Sequence[Hashable]] = None,
axis: Union[None, int, Sequence[int]] = None,
keep_attrs: bool = None,
keepdims: bool = False,
**kwargs: Any
) -> "DataArray":
"""Reduce this array by applying `func` along some dimension(s).
Parameters
----------
func : function
Function which can be called in the form
`f(x, axis=axis, **kwargs)` to return the result of reducing an
np.ndarray over an integer valued axis.
dim : hashable or sequence of hashables, optional
Dimension(s) over which to apply `func`.
axis : int or sequence of int, optional
Axis(es) over which to repeatedly apply `func`. Only one of the
'dim' and 'axis' arguments can be supplied. If neither are
supplied, then the reduction is calculated over the flattened array
(by calling `f(x)` without an axis argument).
keep_attrs : bool, optional
If True, the variable's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.
keepdims : bool, default False
If True, the dimensions which are reduced are left in the result
as dimensions of size one. Coordinates that use these dimensions
are removed.
**kwargs : dict
Additional keyword arguments passed on to `func`.

@shoyer
Copy link
Member

shoyer commented Oct 15, 2019

I don't remember exactly why I added the allow_lazy argument. I think my original concern was backwards compatibility (when we were first adding dask!) with uses of reduce that expected to be applied to NumPy arrays, not dask arrays.

We do something similar in apply_ufunc with dask='forbidden', but rather than automatically coercing to NumPy arrays we raise an error if a dask array is encountered. This seems much more sensible.

At this point, I think we would probably just remove the argument and always default to allow_lazy=True. Or possibly allow_lazy=False should result is an error instead of automatic coercion, and we should expose/document it as a public argument.

@dcherian
Copy link
Contributor

we would probably just remove the argument and always default to allow_lazy=True.

This seems best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants