Skip to content

sel by slice not working for multi-index containing float-values #6838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
nunupeke opened this issue Jul 27, 2022 · 3 comments · May be fixed by #7004
Open
4 tasks done

sel by slice not working for multi-index containing float-values #6838

nunupeke opened this issue Jul 27, 2022 · 3 comments · May be fixed by #7004

Comments

@nunupeke
Copy link

nunupeke commented Jul 27, 2022

What happened?

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))

fails with

TypeError: float() argument must be a string or a real number, not 'slice'

What did you expect to happen?

In v2022.3, this yields the correct sliced selection. Also, in v2022.6 this works for Multiindices without float-Values

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.arange(4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 2), z=slice(5, None))

(only that the resulting coordinates look a bit weird, containing slices). Also, the sliced selection for a regular float-based index works in v2202.6

da = xr.DataArray(np.random.rand(4), {'x': np.linspace(0, 1, 4)})
da.sel(x=slice(None, 0.5))

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

da = xr.DataArray(np.random.rand(4), {'x': np.arange(4)})
da = da.assign_coords(y=('x', np.linspace(0, 1, 4)))
da = da.assign_coords(z=('x', np.arange(4) + 4))
da.set_index(x=["y", "z"]).sel(y=slice(None, 0.5))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

File c:\mambaforge\envs\dev\lib\site-packages\xarray\core\dataarray.py:1420, in DataArray.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   1310 def sel(
   1311     self: T_DataArray,
   1312     indexers: Mapping[Any, Any] = None,
   (...)
   1316     **indexers_kwargs: Any,
   1317 ) -> T_DataArray:
   1318     """Return a new DataArray whose data is given by selecting index
   1319     labels along the specified dimension(s).
   1320 
   (...)
   1418     Dimensions without coordinates: points
   1419     """
-> 1420     ds = self._to_temp_dataset().sel(
   1421         indexers=indexers,
   1422         drop=drop,
   1423         method=method,
   1424         tolerance=tolerance,
...
    197     # see https://github.com/pydata/xarray/issues/5727
--> 198     value = np.asarray(value, dtype=dtype)
    199 return value

TypeError: float() argument must be a string or a real number, not 'slice'

Anything else we need to know?

Maybe related to #6836

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 06:57:19) [MSC v.1929 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'cp1252')
libhdf5: None
libnetcdf: None

xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.23.1
scipy: 1.8.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.2
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: None
cupy: None
pint: 0.19.2
sparse: None
flox: None
numpy_groupies: None
setuptools: 63.2.0
pip: 22.2
conda: None
pytest: 7.1.2
IPython: 8.4.0
sphinx: 4.5.0

@nunupeke nunupeke added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022
@dcherian dcherian added topic-indexing regression and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 27, 2022
@benbovy
Copy link
Member

benbovy commented Jul 29, 2022

Thanks for the report @nunupeke. That's definitely a regression. I'm not that surprised actually as the logic behind .sel() is already quite convoluted for the case of (pandas) multi-indexes :-).

@benbovy
Copy link
Member

benbovy commented Sep 7, 2022

only that the resulting coordinates look a bit weird, containing slices

I'm working on this issue right now and I see this too.

I don't think that providing slice objects to multi-index level coordinates in .sel() is something that so far we've really expected to work. We rather expect scalar values, and we explicitly raise a ValueError when trying to pass multiple values:

da
# <xarray.DataArray (x: 4)>
# array([0.30120807, 0.43951659, 0.19163508, 0.57251755])
# Coordinates:
#   * x        (x) object MultiIndex
#   * y        (x) float64 0.0 0.3333 0.6667 1.0
#   * z        (x) int64 4 5 6 7

da.sel(z=[4, 5])
# ValueError: Vectorized selection is not available along coordinate 'z' (multi-index level)

The fact that it used to work with slices looks like a side effect. For example, providing a slice for only one of the level coordinates drops that coordinate in the resulting dataset (v2022.3.0):

da.sel(y=slice(None, 0.5))
# <xarray.DataArray (z: 2)>
# array([0.4004091 , 0.11179854])
# Coordinates:
#  * z        (z) int64 4 5
# 
# The 'y' coord is missing! It should be still there.

I think that we could support slices in a clean way (maybe any sequence of values too?) by reusing pandas.MultiIndex.get_locs() internally. However, in that case it would be hard to automatically collapse one or more multi-index levels, e.g.,

# note the difference between

da.sel(y=0.0, z=slice(None))
# <xarray.DataArray (x: 1)>
# array([0.08696024])
# Coordinates:
#   * x        (x) object MultiIndex
#   * y        (x) float64 0.0
#   * z        (x) int64 4

# and

da.sel(y=0.0)
# <xarray.DataArray (z: 1)>
# array([0.08696024])
# Coordinates:
#  * z        (z) int64 4
#    y        float64 0.0

# the 1st one calls pandas.MultiIndex.get_locs, while
# the 2nd one calls pandas.MultiIndex.get_loc_level

@benbovy benbovy linked a pull request Sep 7, 2022 that will close this issue
3 tasks
@benbovy
Copy link
Member

benbovy commented Sep 7, 2022

@nunupeke The TypeError in your example should be fixed in #7004, which also improves how slice objects are handled in general for a multi-index.

@benbovy benbovy self-assigned this Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants