-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Allow passing coordinates in to_zarr(region=...)
rather than passing indexes
#7702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
to_zarr(region=...)
rather than passing indexesto_zarr(region=...)
rather than passing indexes
Noting that this feature was recently requested on Stack Overflow. https://stackoverflow.com/questions/76738086/how-to-store-a-subset-of-xaray-data-into-zarr |
Could pass DataArrays in the
|
This feature would be very useful to us - in our use-case, we're populating large geospatial zarr datasets by writing slice of values coming from smaller tiles of those datasets. Here is a snippet of what we've been using to translate between coordinate regions and index regions. This allows us to define a function def find_region(data: xr.Dataset | xr.DataArray, data_subset: xr.Dataset | xr.DataArray) -> Mapping[Hashable, slice]: Hopefully, sharing this will be helpful, happy to receive any feedback as well! SnippetThe solution is based on a "coordinate map". def _make_coord_map(coord: xr.DataArray) -> xr.DataArray:
"""Create a coordinate map from a coordinate.
Coordinate map has
- Coordinates: The original coordinate
- Data: The index of the coordinate
This is useful in translating coordinates to indices.
Example:
>>> coord = xr.DataArray([10, 20, 30], dims='x')
>>> coord_map = make_coord_map(coord)
>>> coord_map
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
* x (x) int64 10 20 30
>>> coord_map.sel(x=20).item()
1
"""
if len(coord.dims) != 1:
raise ValueError("Only 1D coordinates are supported")
return xr.DataArray(
np.arange(len(coord)), dims=coord.dims[0], coords={coord.name: coord}
) This seems like a potentially more robust and extensible solution than relying on searching the array directly. Here is the rest of the scaffolding we've used to obtain the Note that the solution
def _make_coord_maps(
coords: Coordinates,
) -> Mapping[Hashable, xr.DataArray]:
"""Create a coordinate map for each coordinate in coords."""
return {name: _make_coord_map(coord) for name, coord in coords.items()}
def _coord_region_to_region(
coord_region: Mapping[Hashable, Sequence[Any]],
coord_maps: Mapping[Hashable, xr.DataArray],
) -> Mapping[Hashable, np.ndarray]:
"""Translate a coordinate region to a region of indices.
Example:
>>> coord_region = {'x': [10, 20, 30]}
>>> coord_maps = {'x': _make_coord_map(xr.DataArray([10, 20, 30], dims='x'))}
>>> _translate_coord_region(coord_region, coord_maps)
{'x': array([0, 1, 2])}
"""
return {
name: coord_maps[name].sel({name: coords}).values
for name, coords in coord_region.items()
}
def find_region(
data: xr.Dataset | xr.DataArray, data_subset: xr.Dataset | xr.DataArray
) -> Mapping[Hashable, slice]:
"""Given a subset of data, find the region of data that corresponds to the subset.
Directly corresponds to the `region` parameter in xr.Dataset.to_zarr
Note: coordinates of subset must be a contiguous subset (a slice) of
the coordinates of data
- This is required, since `region` must be a slice
Example:
>>> ds = xr.Dataset(coords={'x': [10, 20, 30], 'y': [100, 200, 300]})
>>> subset_ds = ds.sel(x=[10, 20], y=[100, 200])
>>> find_region(ds, subset_ds)
{'x': slice(0, 2, None), 'y': slice(0, 2, None)}
Args:
data: The Dataset or DataArray
data_subset: The subset of data to find the region of
Returns:
Dictionary of slices for each coordinate to delineate the region
"""
def _is_dimensional(coord_name: Hashable, xarr: xr.Dataset | xr.DataArray) -> bool:
return coord_name in xarr.dims
coord_maps = _make_coord_maps(data.coords)
coord_region = {
name: coord.values
for name, coord in data_subset.coords.items()
if _is_dimensional(name, data)
}
region = _coord_region_to_region(coord_region, coord_maps)
# Convert to slices, since `region` in `to_zarr` MUST be a dict of slices
def _to_slice(ordered_sequence: NDArray[np.int_]) -> slice:
min_, max_ = (ordered_sequence[0], ordered_sequence[-1])
return slice(min_, max_ + 1)
region_slices: Mapping[Hashable, slice] = {
name: _to_slice(dim_region) for name, dim_region in region.items()
}
# TODO: Check subset + contiguity
return region_slices |
Is your feature request related to a problem?
If I want to write to a region of data in a zarr, I usually have some boilerplate code like this:
Describe the solution you'd like
It would be nice to automate this within
to_zarr
, because having to drop into index-space always feels un-xarray-like to me.There may be pitfalls I'm not thinking of, and I don't know exactly what the API would look like.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: