[WIP] Add map_blocks. #3258

dcherian · 2019-08-24T00:00:56Z

Related Implementing map_overlap #3147
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

ping @mrocklin @sofroniewn @shanaxel42

mrocklin

This looks great! I left a bunch of tiny suggestions from a Dask Array perspective.

xarray/core/parallel.py

crusaderky · 2019-08-27T13:21:44Z

Hi,

A few design opinions:

could we call it just "map"? It makes sense as this thing would be very useful for non-dask based arrays too. Working routinely with scipy (chiefly with scipy.stats transforms), I tire a lot of writing very verbose xarray.apply_ufunc calls.
could we have it as a method of DataArray and Dataset, to allow for method chaining?

e.g.

myarray.map(func1).chunk().map(func2).sum().compute()

xarray/core/parallel.py

shoyer · 2019-08-27T16:40:32Z

could we call it just "map"? It makes sense as this thing would be very useful for non-dask based arrays too. Working routinely with scipy (chiefly with scipy.stats transforms), I tire a lot of writing very verbose xarray.apply_ufunc calls.

I agree that apply_ufunc is overly verbose. See #1074 and #1618 (and issues linked therein) for discussion about alternative APIs.

I still think this particular set of functionality should be called map_blocks, because it works by applying functions over each block, very similar to dask's map_blocks.

crusaderky · 2019-08-27T18:26:17Z

@shoyer let me rephrase it - apply_ufunc is extremely powerful, and when you need to cope with all possible shape transformations, I suspect its verbosity is quite necessary.
It's just that, when all you need to do is apply an elementwise, embarassingly parallel function (80% of the times in my real life experience), apply_ufunc is overkill.

The thing I have against the name map_blocks is that backends other than dask have no notion of blocks...

shoyer · 2019-08-27T18:30:34Z

apply_ufunc is extremely powerful, and when you need to cope with all possible shape transformations, I suspect its verbosity is quite necessary.
It's just that, when all you need to do is apply an elementwise, embarassingly parallel function (80% of the times in my real life experience), apply_ufunc is overkill.

Yes, 100% agreed! There is a real need for a simpler version of apply_ufunc.

The thing I have against the name map_blocks is that backends other than dask have no notion of blocks...

I think the functionality in this PR is fundamentally dask specific. We shouldn't make a habit of adding backend specific features, but it makes sense in limited cases.

dcherian · 2019-08-28T23:51:43Z

I started prototyping a Dataset version. Here's what I have:

import dask
import xarray as xr

darray = xr.DataArray(np.ones((10, 20)), 
                  dims=['x', 'y'], 
                  coords={'x': np.arange(10), 'y': np.arange(100, 120)})
dset = darray.to_dataset(name='a')
dset['b'] = dset.a + 50
dset['c'] = (dset.x + 20)
dset = dset.chunk({'x': 4, 'y': 5})

The function I'm applying takes a dataset and returns a DataArray because that's easy to test without figuring out how to assemble everything back into a dataset.

import itertools

# function takes dataset and returns dataarray so that I can check that things work without reconstructing a dataset
def function(ds):
    return ds.a + 10

dataset_dims = list(dset.dims)

graph = {}
gname = 'dsnew'

# map dims to list of chunk indexes
# If different variables have different chunking along the same dim
# the call to .chunks will raise an error.
ichunk = {dim: range(len(dset.chunks[dim])) for dim in dataset_dims}

# iterate over all possible chunk combinations
for v in itertools.product(*ichunk.values()):
    chunk_index_dict = dict(zip(dataset_dims, v))
    data_vars = {}
    for name, variable in dset.data_vars.items():
        # why do does dask_keys have an extra level?
        # the [0] is not required for dataarrays
        var_dask_keys = variable.__dask_keys__()[0]
        
        # recursively index into dask_keys nested list
        chunk = var_dask_keys
        for dim in variable.dims:
            chunk = chunk[chunk_index_dict[dim]]
            
        # I have key corresponding to chunk
        # this tuple is in a dictionary passed to xr.Dataset()
        # dask doesn't seem to replace this with a numpy array at execution time.
        data_vars[name] = (variable.dims, chunk)
        
    graph[(gname, ) + v] = (function, (xr.Dataset, data_vars))

final_graph = dask.highlevelgraph.HighLevelGraph.from_collections(name, graph, dependencies=[dset])

Elements of the graph look like

('dsnew', 0, 0): (<function __main__.function(ds)>,
  (xarray.core.dataset.Dataset,
   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),
    'b': (('x', 'y'), ('xarray-b-e2d8d06bb9e5c1f351671a94816bd331', 0, 0)),
    'c': (('x',), ('xarray-c-d90f8b2af715b53f4c170be391239655', 0))}))

This doesn't work because dask doesn't replace the keys by numpy arrays when the xr.Dataset call is executed.

result = dask.array.Array(final_graph, name=gname, chunks=dset.a.data.chunks, meta=dset.a.data._meta)
dask.compute(result)

ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)) to Variable.

The graph is "disconnected":

I'm not sure what I'm doing wrong here. An equivalent version for DataArrays works perfectly.

mrocklin · 2019-08-28T23:54:48Z

Dask doesn't traverse through tuples to find possible keys, so the keys here are hidden from view:

   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),

I recommend changing wrapping tuples with lists:

-   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),
+   {'a': [('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)],

dcherian · 2019-08-30T21:11:28Z

Thanks @mrocklin. Unfortunately that doesn't work with the Dataset constructor. With a list it treats it as array-like

    The following notations are accepted:

    - mapping {var name: DataArray}
    - mapping {var name: Variable}
    - mapping {var name: (dimension name, array-like)}
    - mapping {var name: (tuple of dimension names, array-like)}
    - mapping {dimension name: array-like}
      (it will be automatically moved to coords, see below)

Unless @shoyer has another idea, I guess I can insert creating a DataArray into the graph and then refer to those keys in the Dataset constructor.

mrocklin · 2019-08-30T21:31:49Z

Then you can construct a tuple as a task (1, 2, 3) -> (tuple, [1, 2, 3])

dcherian · 2019-09-02T15:34:21Z

Thanks. That worked. I have a new version up in #3276 that works with both DataArrays and Datasets.

mrocklin · 2019-09-02T15:37:18Z

I'm glad to see progress here. FWIW, I think that many people would be quite happy with a version that just worked for DataArrays, in case that's faster to get in than the full solution with DataSets.

dcherian · 2019-09-08T04:20:19Z

Closing in favour of #3276

dcherian added 2 commits August 23, 2019 16:59

Add map_blocks.

8bd7713

Actually add file :)

c5cfffc

mrocklin reviewed Aug 24, 2019

View reviewed changes

xarray/core/parallel.py Show resolved Hide resolved

xarray/core/parallel.py Show resolved Hide resolved

xarray/core/parallel.py Show resolved Hide resolved

xarray/core/parallel.py Show resolved Hide resolved

crusaderky reviewed Aug 27, 2019

View reviewed changes

xarray/core/parallel.py Show resolved Hide resolved

crusaderky reviewed Aug 27, 2019

View reviewed changes

xarray/core/parallel.py Show resolved Hide resolved

dcherian closed this Sep 8, 2019

dcherian deleted the map_blocks branch September 8, 2019 04:20

Uh oh!

[WIP] Add map_blocks. #3258

[WIP] Add map_blocks. #3258

Uh oh!

Conversation

dcherian commented Aug 24, 2019

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crusaderky commented Aug 27, 2019

Uh oh!

Uh oh!

Uh oh!

shoyer commented Aug 27, 2019

Uh oh!

crusaderky commented Aug 27, 2019

Uh oh!

shoyer commented Aug 27, 2019

Uh oh!

dcherian commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Aug 28, 2019

Uh oh!

dcherian commented Aug 30, 2019

Uh oh!

mrocklin commented Aug 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Sep 2, 2019

Uh oh!

mrocklin commented Sep 2, 2019

Uh oh!

dcherian commented Sep 8, 2019

Uh oh!

Uh oh!

dcherian commented Aug 28, 2019 •

edited

Loading

mrocklin commented Aug 30, 2019 •

edited

Loading