.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

zbarry · 2019-07-25T18:09:22Z

MCVE Code Sample

.mean() on a DataArray pointing to a Dask array returns a Dask array-containing DataArray as expected:

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0')
da

<xarray.DataArray (dim_1: 500, dim_2: 500)>
dask.array<shape=(500, 500), dtype=float64, chunksize=(100, 100)>
Dimensions without coordinates: dim_1, dim_2

Calling .compute() on this result produces the expected result:

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0').compute()

<xarray.DataArray (dim_1: 500, dim_2: 500)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Dimensions without coordinates: dim_1, dim_2

The .reduce() method immediately executes all of the previously queued computations leading up to the new reduce method before even calling the supplied function.

def func(x, axis=None):
    return x

da = xr.DataArray(np.zeros((500, 500, 500))).chunk((100, 100, 100)).mean('dim_0').reduce(func)

<xarray.DataArray (dim_1: 500, dim_2: 500)>
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Dimensions without coordinates: dim_1, dim_2

Expected Output

A Dask array when .reduce(func) isn't followed up by .compute().

Problem Description

When using Dask distributed, the computational graph you are constructing is immediately executed if you call .reduce() instead of adding that function as another node in the DAG. This graph execution happens before the function you pass to reduce is called.

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.21.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.3
pandas: 0.25.0
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.1.0
distributed: 2.1.0
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.1
pip: 19.2.1
conda: None
pytest: 5.0.1
IPython: 7.6.1
sphinx: 2.1.2

The text was updated successfully, but these errors were encountered:

shoyer · 2019-07-26T00:05:35Z

Yikes, this is pretty bad!

Thanks for the clear code to reproduce it.

dcherian · 2019-10-15T15:52:43Z

This ends up being because allow_lazy=False

xarray/xarray/core/variable.py

Lines 1412 to 1460 in 3f9069b

    
               def reduce( 
        
                   self, 
        
                   func, 
        
                   dim=None, 
        
                   axis=None, 
        
                   keep_attrs=None, 
        
                   keepdims=False, 
        
                   allow_lazy=False, 
        
                   **kwargs 
        
               ): 
        
                   """Reduce this array by applying `func` along some dimension(s). 
        
                   Parameters 
        
                   ---------- 
        
                   func : function 
        
                       Function which can be called in the form 
        
                       `func(x, axis=axis, **kwargs)` to return the result of reducing an 
        
                       np.ndarray over an integer valued axis. 
        
                   dim : str or sequence of str, optional 
        
                       Dimension(s) over which to apply `func`. 
        
                   axis : int or sequence of int, optional 
        
                       Axis(es) over which to apply `func`. Only one of the 'dim' 
        
                       and 'axis' arguments can be supplied. If neither are supplied, then 
        
                       the reduction is calculated over the flattened array (by calling 
        
                       `func(x)` without an axis argument). 
        
                   keep_attrs : bool, optional 
        
                       If True, the variable's attributes (`attrs`) will be copied from 
        
                       the original object to the new one.  If False (default), the new 
        
                       object will be returned without attributes. 
        
                   keepdims : bool, default False 
        
                       If True, the dimensions which are reduced are left in the result 
        
                       as dimensions of size one 
        
                   **kwargs : dict 
        
                       Additional keyword arguments passed on to `func`. 
        
                   Returns 
        
                   ------- 
        
                   reduced : Array 
        
                       Array with summarized data and the indicated dimension(s) 
        
                       removed. 
        
                   """ 
        
                   if dim is common.ALL_DIMS: 
        
                       dim = None 
        
                   if dim is not None and axis is not None: 
        
                       raise ValueError("cannot supply both 'axis' and 'dim' arguments") 
        
                   if dim is not None: 
        
                       axis = self.get_axis_num(dim) 
        
                   input_data = self.data if allow_lazy else self.values

I don't see why we need this kwarg or why it shouldn't be True by default. This kwarg is undocumented in dataset.py

xarray/xarray/core/dataset.py

Lines 3997 to 4029 in 3f9069b

    
               def reduce( 
        
                   self, 
        
                   func: Callable, 
        
                   dim: Union[Hashable, Iterable[Hashable]] = None, 
        
                   keep_attrs: bool = None, 
        
                   keepdims: bool = False, 
        
                   numeric_only: bool = False, 
        
                   allow_lazy: bool = False, 
        
                   **kwargs: Any, 
        
               ) -> "Dataset": 
        
                   """Reduce this dataset by applying `func` along some dimension(s). 
        
                   Parameters 
        
                   ---------- 
        
                   func : callable 
        
                       Function which can be called in the form 
        
                       `f(x, axis=axis, **kwargs)` to return the result of reducing an 
        
                       np.ndarray over an integer valued axis. 
        
                   dim : str or sequence of str, optional 
        
                       Dimension(s) over which to apply `func`.  By default `func` is 
        
                       applied over all dimensions. 
        
                   keep_attrs : bool, optional 
        
                       If True, the dataset's attributes (`attrs`) will be copied from 
        
                       the original object to the new one.  If False (default), the new 
        
                       object will be returned without attributes. 
        
                   keepdims : bool, default False 
        
                       If True, the dimensions which are reduced are left in the result 
        
                       as dimensions of size one. Coordinates that use these dimensions 
        
                       are removed. 
        
                   numeric_only : bool, optional 
        
                       If True, only apply ``func`` to variables with a numeric dtype. 
        
                   **kwargs : Any 
        
                       Additional keyword arguments passed on to ``func``.

and invisible in dataarray.py

xarray/xarray/core/dataarray.py

Lines 2106 to 2139 in 3f9069b

    
               def reduce( 
        
                   self, 
        
                   func: Callable[..., Any], 
        
                   dim: Union[None, Hashable, Sequence[Hashable]] = None, 
        
                   axis: Union[None, int, Sequence[int]] = None, 
        
                   keep_attrs: bool = None, 
        
                   keepdims: bool = False, 
        
                   **kwargs: Any 
        
               ) -> "DataArray": 
        
                   """Reduce this array by applying `func` along some dimension(s). 
        
                   Parameters 
        
                   ---------- 
        
                   func : function 
        
                       Function which can be called in the form 
        
                       `f(x, axis=axis, **kwargs)` to return the result of reducing an 
        
                       np.ndarray over an integer valued axis. 
        
                   dim : hashable or sequence of hashables, optional 
        
                       Dimension(s) over which to apply `func`. 
        
                   axis : int or sequence of int, optional 
        
                       Axis(es) over which to repeatedly apply `func`. Only one of the 
        
                       'dim' and 'axis' arguments can be supplied. If neither are 
        
                       supplied, then the reduction is calculated over the flattened array 
        
                       (by calling `f(x)` without an axis argument). 
        
                   keep_attrs : bool, optional 
        
                       If True, the variable's attributes (`attrs`) will be copied from 
        
                       the original object to the new one.  If False (default), the new 
        
                       object will be returned without attributes. 
        
                   keepdims : bool, default False 
        
                       If True, the dimensions which are reduced are left in the result 
        
                       as dimensions of size one. Coordinates that use these dimensions 
        
                       are removed. 
        
                   **kwargs : dict 
        
                       Additional keyword arguments passed on to `func`.

shoyer · 2019-10-15T21:46:58Z

I don't remember exactly why I added the allow_lazy argument. I think my original concern was backwards compatibility (when we were first adding dask!) with uses of reduce that expected to be applied to NumPy arrays, not dask arrays.

We do something similar in apply_ufunc with dask='forbidden', but rather than automatically coercing to NumPy arrays we raise an error if a dask array is encountered. This seems much more sensible.

At this point, I think we would probably just remove the argument and always default to allow_lazy=True. Or possibly allow_lazy=False should result is an error instead of automatic coercion, and we should expose/document it as a public argument.

dcherian · 2019-10-16T03:57:15Z

we would probably just remove the argument and always default to allow_lazy=True.

This seems best.

shoyer added the bug label Jul 26, 2019

dcherian added the topic-dask label Aug 3, 2019

dcherian mentioned this issue Oct 22, 2019

Deprecate allow_lazy #3435

Merged

4 tasks

dcherian closed this as completed in #3435 Nov 13, 2019

aulemahal mentioned this issue Jan 6, 2020

Revert #311 and add keywords Ouranosinc/xclim#334

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

zbarry commented Jul 25, 2019

shoyer commented Jul 26, 2019

Uh oh!

dcherian commented Oct 15, 2019

Uh oh!

shoyer commented Oct 15, 2019

Uh oh!

dcherian commented Oct 16, 2019

Uh oh!

.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

.reduce() on a DataArray with Dask distributed immediately executes the preceding portions of the computational graph #3161

Comments

zbarry commented Jul 25, 2019

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

shoyer commented Jul 26, 2019

Uh oh!

dcherian commented Oct 15, 2019

Uh oh!

shoyer commented Oct 15, 2019

Uh oh!

dcherian commented Oct 16, 2019

Uh oh!

Output of `xr.show_versions()`