-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[WIP] Add map_blocks. #3258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add map_blocks. #3258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left a bunch of tiny suggestions from a Dask Array perspective.
Hi, A few design opinions:
e.g. myarray.map(func1).chunk().map(func2).sum().compute() |
I agree that I still think this particular set of functionality should be called |
@shoyer let me rephrase it - apply_ufunc is extremely powerful, and when you need to cope with all possible shape transformations, I suspect its verbosity is quite necessary. The thing I have against the name map_blocks is that backends other than dask have no notion of blocks... |
Yes, 100% agreed! There is a real need for a simpler version of
I think the functionality in this PR is fundamentally dask specific. We shouldn't make a habit of adding backend specific features, but it makes sense in limited cases. |
I started prototyping a Dataset version. Here's what I have: import dask
import xarray as xr
darray = xr.DataArray(np.ones((10, 20)),
dims=['x', 'y'],
coords={'x': np.arange(10), 'y': np.arange(100, 120)})
dset = darray.to_dataset(name='a')
dset['b'] = dset.a + 50
dset['c'] = (dset.x + 20)
dset = dset.chunk({'x': 4, 'y': 5}) The function I'm applying takes a dataset and returns a DataArray because that's easy to test without figuring out how to assemble everything back into a dataset. import itertools
# function takes dataset and returns dataarray so that I can check that things work without reconstructing a dataset
def function(ds):
return ds.a + 10
dataset_dims = list(dset.dims)
graph = {}
gname = 'dsnew'
# map dims to list of chunk indexes
# If different variables have different chunking along the same dim
# the call to .chunks will raise an error.
ichunk = {dim: range(len(dset.chunks[dim])) for dim in dataset_dims}
# iterate over all possible chunk combinations
for v in itertools.product(*ichunk.values()):
chunk_index_dict = dict(zip(dataset_dims, v))
data_vars = {}
for name, variable in dset.data_vars.items():
# why do does dask_keys have an extra level?
# the [0] is not required for dataarrays
var_dask_keys = variable.__dask_keys__()[0]
# recursively index into dask_keys nested list
chunk = var_dask_keys
for dim in variable.dims:
chunk = chunk[chunk_index_dict[dim]]
# I have key corresponding to chunk
# this tuple is in a dictionary passed to xr.Dataset()
# dask doesn't seem to replace this with a numpy array at execution time.
data_vars[name] = (variable.dims, chunk)
graph[(gname, ) + v] = (function, (xr.Dataset, data_vars))
final_graph = dask.highlevelgraph.HighLevelGraph.from_collections(name, graph, dependencies=[dset]) Elements of the graph look like
This doesn't work because dask doesn't replace the keys by numpy arrays when the
I'm not sure what I'm doing wrong here. An equivalent version for DataArrays works perfectly. |
Dask doesn't traverse through tuples to find possible keys, so the keys here are hidden from view: {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)), I recommend changing wrapping tuples with lists: - {'a': (('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)),
+ {'a': [('x', 'y'), ('xarray-a-f178df193efafa67203f3862b3f9f0f4', 0, 0)], |
Thanks @mrocklin. Unfortunately that doesn't work with the Dataset constructor. With a list it treats it as array-like
Unless @shoyer has another idea, I guess I can insert creating a DataArray into the graph and then refer to those keys in the Dataset constructor. |
Then you can construct a tuple as a task |
Thanks. That worked. I have a new version up in #3276 that works with both DataArrays and Datasets. |
I'm glad to see progress here. FWIW, I think that many people would be quite happy with a version that just worked for DataArrays, in case that's faster to get in than the full solution with DataSets. |
Closing in favour of #3276 |
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIping @mrocklin @sofroniewn @shanaxel42