broadcast() broken on dask backend #978

crusaderky · 2016-08-20T20:56:33Z

>>> a = xarray.DataArray([1,2]).chunk(1)
>>> a
<xarray.DataArray (dim_0: 2)>
dask.array<xarray-..., shape=(2,), dtype=int64, chunksize=(1,)>
Coordinates:
  * dim_0    (dim_0) int64 0 1
>>> xarray.broadcast(a)
(<xarray.DataArray (dim_0: 2)>
 array([1, 2])
 Coordinates:
   * dim_0    (dim_0) int64 0 1,)

The problem is actually somewhere in the constructor of DataArray.
In alignment.py:362, we have return DataArray(data, ...) where data is a Variable with dask backend. The returned DataArray object has a numpy backend.
As a workaround, changing that line to return DataArray(data.data, ...) (thus passing a dask array) fixes the problem.

After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match.

The text was updated successfully, but these errors were encountered:

shoyer · 2016-08-21T00:31:11Z

Oops -- let's add a fix for this and a regression test in test_dask.py.

We should fix broadcast as you mention, but also fix the as_compatible_data function to try coercing data via the .data attribute before using .values:

xarray/xarray/core/variable.py

Line 146 in 584e703

data = getattr(data, 'values', data)

After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match.

This is sort of but not completely right. We use dask.array.broadcast_to to expand dimensions for dask arrays, which under the hood uses numpy.broadcast_to for each chunk. Broadcasting uses a view to insert a new dimensions with stride 0, so it doesn't require any additional storage costs for the original array. But any arrays resulting from arithmetic will indeed require more space.

crusaderky · 2016-09-30T15:12:04Z

looking into this now

crusaderky · 2016-09-30T15:42:02Z

Two-liner for the win #1022

crusaderky · 2016-09-30T16:53:23Z

Rebaselined #1023

crusaderky closed this as completed Dec 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcast() broken on dask backend #978

broadcast() broken on dask backend #978

crusaderky commented Aug 20, 2016

shoyer commented Aug 21, 2016

crusaderky commented Sep 30, 2016

crusaderky commented Sep 30, 2016

crusaderky commented Sep 30, 2016

broadcast() broken on dask backend #978

broadcast() broken on dask backend #978

Comments

crusaderky commented Aug 20, 2016

shoyer commented Aug 21, 2016

crusaderky commented Sep 30, 2016

crusaderky commented Sep 30, 2016

crusaderky commented Sep 30, 2016