Skip to content

Allow passing coordinates in to_zarr(region=...) rather than passing indexes #7702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
slevang opened this issue Mar 30, 2023 · 3 comments · Fixed by #8434
Closed

Allow passing coordinates in to_zarr(region=...) rather than passing indexes #7702

slevang opened this issue Mar 30, 2023 · 3 comments · Fixed by #8434
Labels
enhancement topic-zarr Related to zarr storage library

Comments

@slevang
Copy link
Contributor

slevang commented Mar 30, 2023

Is your feature request related to a problem?

If I want to write to a region of data in a zarr, I usually have some boilerplate code like this:

ds_existing = xr.open_zarr(path)
ds_new = xr.Dataset(...) # come up with some new data for a subset region of the existing zarr
start_idx = (ds_existing.time == ds_new.time[0]).argmax()
end_idx = (ds_existing.time == ds_new.time[-1]).argmax()
ds_new.to_zarr(path, region={"time": slice(start_idx, end_idx})

Describe the solution you'd like

It would be nice to automate this within to_zarr, because having to drop into index-space always feels un-xarray-like to me.

There may be pitfalls I'm not thinking of, and I don't know exactly what the API would look like.

ds_new.to_zarr(path, region={"time": "auto"}) # ???

Describe alternatives you've considered

No response

Additional context

No response

@dcherian dcherian changed the title Allow coordinate detection in to_zarr(region=...) rather than passing indexes Allow passing coordinates in to_zarr(region=...) rather than passing indexes Mar 30, 2023
@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label Mar 31, 2023
@rabernat
Copy link
Contributor

Noting that this feature was recently requested on Stack Overflow. https://stackoverflow.com/questions/76738086/how-to-store-a-subset-of-xaray-data-into-zarr

@dcherian
Copy link
Contributor

Could pass DataArrays in the region kwarg

ds_new.to_zarr(path, region={"time": ds_new.time})

@DahnJ
Copy link

DahnJ commented Aug 8, 2023

This feature would be very useful to us - in our use-case, we're populating large geospatial zarr datasets by writing slice of values coming from smaller tiles of those datasets.

Here is a snippet of what we've been using to translate between coordinate regions and index regions. This allows us to define a function

def find_region(data: xr.Dataset | xr.DataArray, data_subset: xr.Dataset | xr.DataArray) -> Mapping[Hashable, slice]:

Hopefully, sharing this will be helpful, happy to receive any feedback as well!

Snippet

The solution is based on a "coordinate map".

def _make_coord_map(coord: xr.DataArray) -> xr.DataArray:
    """Create a coordinate map from a coordinate.

    Coordinate map has
    - Coordinates: The original coordinate
    - Data: The index of the coordinate

    This is useful in translating coordinates to indices.

    Example:
        >>> coord = xr.DataArray([10, 20, 30], dims='x')
        >>> coord_map = make_coord_map(coord)
        >>> coord_map
        <xarray.DataArray (x: 3)>
        array([0, 1, 2])
        Coordinates:
          * x        (x) int64 10 20 30
        >>> coord_map.sel(x=20).item()
        1
    """
    if len(coord.dims) != 1:
        raise ValueError("Only 1D coordinates are supported")

    return xr.DataArray(
        np.arange(len(coord)), dims=coord.dims[0], coords={coord.name: coord}
    )

This seems like a potentially more robust and extensible solution than relying on searching the array directly.

Here is the rest of the scaffolding we've used to obtain the region argument from a subset DataArray.

Note that the solution

  • does not check whether the subset coordinates are a contiguous subset of the coordinates and would fail in that case
  • assumes coordinates are sorted in an increasing order.
def _make_coord_maps(
    coords: Coordinates,  
) -> Mapping[Hashable, xr.DataArray]:
    """Create a coordinate map for each coordinate in coords."""
    return {name: _make_coord_map(coord) for name, coord in coords.items()}


def _coord_region_to_region(
    coord_region: Mapping[Hashable, Sequence[Any]],
    coord_maps: Mapping[Hashable, xr.DataArray],
) -> Mapping[Hashable, np.ndarray]: 
    """Translate a coordinate region to a region of indices.

    Example:
        >>> coord_region = {'x': [10, 20, 30]}
        >>> coord_maps = {'x': _make_coord_map(xr.DataArray([10, 20, 30], dims='x'))}
        >>> _translate_coord_region(coord_region, coord_maps)
        {'x': array([0, 1, 2])}
    """
    return {
        name: coord_maps[name].sel({name: coords}).values
        for name, coords in coord_region.items()
    }


def find_region(
    data: xr.Dataset | xr.DataArray, data_subset: xr.Dataset | xr.DataArray
) -> Mapping[Hashable, slice]:
    """Given a subset of data, find the region of data that corresponds to the subset.

    Directly corresponds to the `region` parameter in xr.Dataset.to_zarr
    Note: coordinates of subset must be a contiguous subset (a slice) of
    the coordinates of data
    - This is required, since `region` must be a slice

    Example:
        >>> ds = xr.Dataset(coords={'x': [10, 20, 30], 'y': [100, 200, 300]})
        >>> subset_ds = ds.sel(x=[10, 20], y=[100, 200])
        >>> find_region(ds, subset_ds)
        {'x': slice(0, 2, None), 'y': slice(0, 2, None)}

    Args:
        data: The Dataset or DataArray
        data_subset: The subset of data to find the region of

    Returns:
        Dictionary of slices for each coordinate to delineate the region
    """

    def _is_dimensional(coord_name: Hashable, xarr: xr.Dataset | xr.DataArray) -> bool:
        return coord_name in xarr.dims

    coord_maps = _make_coord_maps(data.coords)
    coord_region = {
        name: coord.values
        for name, coord in data_subset.coords.items()
        if _is_dimensional(name, data)
    }

    region = _coord_region_to_region(coord_region, coord_maps)

    # Convert to slices, since `region` in `to_zarr` MUST be a dict of slices
    def _to_slice(ordered_sequence: NDArray[np.int_]) -> slice:
        min_, max_ = (ordered_sequence[0], ordered_sequence[-1])
        return slice(min_, max_ + 1)

    region_slices: Mapping[Hashable, slice] = {
        name: _to_slice(dim_region) for name, dim_region in region.items()
    }

    # TODO: Check subset + contiguity
    return region_slices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants