Skip to content

broadcast() broken on dask backend #978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crusaderky opened this issue Aug 20, 2016 · 4 comments
Closed

broadcast() broken on dask backend #978

crusaderky opened this issue Aug 20, 2016 · 4 comments

Comments

@crusaderky
Copy link
Contributor

>>> a = xarray.DataArray([1,2]).chunk(1)
>>> a
<xarray.DataArray (dim_0: 2)>
dask.array<xarray-..., shape=(2,), dtype=int64, chunksize=(1,)>
Coordinates:
  * dim_0    (dim_0) int64 0 1
>>> xarray.broadcast(a)
(<xarray.DataArray (dim_0: 2)>
 array([1, 2])
 Coordinates:
   * dim_0    (dim_0) int64 0 1,)

The problem is actually somewhere in the constructor of DataArray.
In alignment.py:362, we have return DataArray(data, ...) where data is a Variable with dask backend. The returned DataArray object has a numpy backend.
As a workaround, changing that line to return DataArray(data.data, ...) (thus passing a dask array) fixes the problem.

After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match.

@shoyer
Copy link
Member

shoyer commented Aug 21, 2016

Oops -- let's add a fix for this and a regression test in test_dask.py.

We should fix broadcast as you mention, but also fix the as_compatible_data function to try coercing data via the .data attribute before using .values:

data = getattr(data, 'values', data)

After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match.

This is sort of but not completely right. We use dask.array.broadcast_to to expand dimensions for dask arrays, which under the hood uses numpy.broadcast_to for each chunk. Broadcasting uses a view to insert a new dimensions with stride 0, so it doesn't require any additional storage costs for the original array. But any arrays resulting from arithmetic will indeed require more space.

@crusaderky
Copy link
Contributor Author

looking into this now

@crusaderky
Copy link
Contributor Author

Two-liner for the win #1022

@crusaderky
Copy link
Contributor Author

Rebaselined #1023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants