Skip to content

[WIP] Add map_blocks. #3258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[WIP] Add map_blocks. #3258

wants to merge 2 commits into from

Conversation

dcherian
Copy link
Contributor

  • Related Implementing map_overlap #3147
  • Tests added
  • Passes black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

ping @mrocklin @sofroniewn @shanaxel42

Copy link
Contributor

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left a bunch of tiny suggestions from a Dask Array perspective.

@crusaderky
Copy link
Contributor

Hi,

A few design opinions:

  1. could we call it just "map"? It makes sense as this thing would be very useful for non-dask based arrays too. Working routinely with scipy (chiefly with scipy.stats transforms), I tire a lot of writing very verbose xarray.apply_ufunc calls.

  2. could we have it as a method of DataArray and Dataset, to allow for method chaining?

e.g.

myarray.map(func1).chunk().map(func2).sum().compute()

@shoyer
Copy link
Member

shoyer commented Aug 27, 2019

  • could we call it just "map"? It makes sense as this thing would be very useful for non-dask based arrays too. Working routinely with scipy (chiefly with scipy.stats transforms), I tire a lot of writing very verbose xarray.apply_ufunc calls.

I agree that apply_ufunc is overly verbose. See #1074 and #1618 (and issues linked therein) for discussion about alternative APIs.

I still think this particular set of functionality should be called map_blocks, because it works by applying functions over each block, very similar to dask's map_blocks.

@crusaderky
Copy link
Contributor

@shoyer let me rephrase it - apply_ufunc is extremely powerful, and when you need to cope with all possible shape transformations, I suspect its verbosity is quite necessary.
It's just that, when all you need to do is apply an elementwise, embarassingly parallel function (80% of the times in my real life experience), apply_ufunc is overkill.

The thing I have against the name map_blocks is that backends other than dask have no notion of blocks...

@shoyer
Copy link
Member

shoyer commented Aug 27, 2019

apply_ufunc is extremely powerful, and when you need to cope with all possible shape transformations, I suspect its verbosity is quite necessary.
It's just that, when all you need to do is apply an elementwise, embarassingly parallel function (80% of the times in my real life experience), apply_ufunc is overkill.

Yes, 100% agreed! There is a real need for a simpler version of apply_ufunc.

The thing I have against the name map_blocks is that backends other than dask have no notion of blocks...

I think the functionality in this PR is fundamentally dask specific. We shouldn't make a habit of adding backend specific features, but it makes sense in limited cases.

@dcherian
Copy link
Contributor Author

dcherian commented Aug 28, 2019

I started prototyping a Dataset version. Here's what I have:

import dask
import xarray as xr

darray = xr.DataArray(np.ones((10, 20)), 
                  dims=['x', 'y'], 
                  coords={'x': np.arange(10), 'y': np.arange(100, 120)})
dset = darray.to_dataset(name='a')
dset['b'] = dset.a + 50
dset['c'] = (dset.x + 20)
dset = dset.chunk({'x': 4, 'y': 5})

The function I'm applying takes a dataset and returns a DataArray because that's easy to test without figuring out how to assemble everything back into a dataset.

import itertools

# function takes dataset and returns dataarray so that I can check that things work without reconstructing a dataset
def function(ds):
    return ds.a + 10

dataset_dims = list(dset.dims)

graph = {}
gname = 'dsnew'

# map dims to list of chunk indexes
# If different variables have different chunking along the same dim
# the call to .chunks will raise an error.
ichunk = {dim: range(len(dset.chunks[dim])) for dim in dataset_dims}

# iterate over all possible chunk combinations
for v in itertools.product(*ichunk.values()):
    chunk_index_dict = dict(zip(dataset_dims, v))
    data_vars = {}
    for name, variable in dset.data_vars.items():
        # why do does dask_keys have an extra level?
        # the [0] is not required for dataarrays
        var_dask_keys = variable.__dask_keys__()[0]
        
        # recursively index into dask_keys nested list
        chunk = var_dask_keys
        for dim in variable.dims:
            chunk = chunk[chunk_index_dict[dim]]
            
        # I have key corresponding to chunk
        # this tuple is in a dictionary passed to xr.Dataset()
        # dask doesn't seem to replace this with a numpy array at execution time.
        data_vars[name] = (variable.dims, chunk)
        
    graph[(gname, ) + v] = (function, (xr.Dataset, data_vars))

final_graph = dask.highlevelgraph.HighLevelGraph.from_collections(name, graph, dependencies=[dset])

Elements of the graph look like

('dsnew', 0, 0): (<function __main__.function(ds)>,
  (xarray.core.dataset.Dataset,
   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),
    'b': (('x', 'y'), ('xarray-b-e2d8d06bb9e5c1f351671a94816bd331', 0, 0)),
    'c': (('x',), ('xarray-c-d90f8b2af715b53f4c170be391239655', 0))}))

This doesn't work because dask doesn't replace the keys by numpy arrays when the xr.Dataset call is executed.

result = dask.array.Array(final_graph, name=gname, chunks=dset.a.data.chunks, meta=dset.a.data._meta)
dask.compute(result)
ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)) to Variable.

The graph is "disconnected":
image

I'm not sure what I'm doing wrong here. An equivalent version for DataArrays works perfectly.

@mrocklin
Copy link
Contributor

Dask doesn't traverse through tuples to find possible keys, so the keys here are hidden from view:

   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),

I recommend changing wrapping tuples with lists:

-   {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),
+   {'a': [('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)],

@dcherian
Copy link
Contributor Author

Thanks @mrocklin. Unfortunately that doesn't work with the Dataset constructor. With a list it treats it as array-like

    The following notations are accepted:

    - mapping {var name: DataArray}
    - mapping {var name: Variable}
    - mapping {var name: (dimension name, array-like)}
    - mapping {var name: (tuple of dimension names, array-like)}
    - mapping {dimension name: array-like}
      (it will be automatically moved to coords, see below)

Unless @shoyer has another idea, I guess I can insert creating a DataArray into the graph and then refer to those keys in the Dataset constructor.

@mrocklin
Copy link
Contributor

mrocklin commented Aug 30, 2019

Then you can construct a tuple as a task (1, 2, 3) -> (tuple, [1, 2, 3])

@dcherian
Copy link
Contributor Author

dcherian commented Sep 2, 2019

Thanks. That worked. I have a new version up in #3276 that works with both DataArrays and Datasets.

@mrocklin
Copy link
Contributor

mrocklin commented Sep 2, 2019

I'm glad to see progress here. FWIW, I think that many people would be quite happy with a version that just worked for DataArrays, in case that's faster to get in than the full solution with DataSets.

@dcherian
Copy link
Contributor Author

dcherian commented Sep 8, 2019

Closing in favour of #3276

@dcherian dcherian closed this Sep 8, 2019
@dcherian dcherian deleted the map_blocks branch September 8, 2019 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants