Skip to content

Rework PandasMultiIndex.sel internals #7004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Sep 7, 2022

This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in .sel().

More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels.

PandasMultiIndex.sel() relies on the underlying pandas.MultiIndex methods like this:

  • use get_loc when all levels are provided with each a scalar label (no slice, no array)
    • always drops the index and returns scalar coordinates for each multi-index level
  • use get_loc_level when only a subset of levels are provided with scalar labels only
    • may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates)
    • if only one level remains: renames the dimension and the corresponding dimension coordinate
  • use get_locs for all other cases.
    • always keeps the multi-index and its coordinates (even if only one item or one level is selected)

This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result.

Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838):

import xarray as xr
import pandas as pd

midx = pd.MultiIndex.from_product([list("abc"), range(4)], names=("one", "two"))
ds = xr.Dataset(coords={"x": midx})    
# <xarray.Dataset>
# Dimensions:  (x: 12)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'
#   * two      (x) int64 0 1 2 3 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
ds.sel(one="a", two=0)

# this PR
#
# <xarray.Dataset>
# Dimensions:  ()
# Coordinates:
#     x        object ('a', 0)
#     one      <U1 'a'
#     two      int64 0
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  ()
# Coordinates:
#     x        object ('a', 0)
# Data variables:
#     *empty*
# 
ds.sel(one="a")

# this PR:
#
# <xarray.Dataset>
# Dimensions:  (two: 4)
# Coordinates:
#  * two      (two) int64 0 1 2 3
#    one      <U1 'a'
# Data variables:
#    *empty*
#

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (two: 4)
# Coordinates:
#   * two      (two) int64 0 1 2 3
# Data variables:
#     *empty*
# 
ds.sel(one=slice("a", "b"))

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 8)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b'
#   * two      (x) int64 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (two: 8)
# Coordinates:
#   * two      (two) int64 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
# 
ds.sel(one="a", two=slice(1, 1))

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 1)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a'
#   * two      (x) int64 1
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (x: 1)
# Coordinates:
#   * x        (x) MultiIndex
#   - one      (x) object 'a'
#   - two      (x) int64 1
# Data variables:
#     *empty*
# 
ds.sel(one=["b", "c"], two=[0, 2])

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'b' 'b' 'c' 'c'
#   * two      (x) int64 0 2 0 2
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level)
# 

Review only the case where labels are provided for index levels.

Allow providing array-like objects as labels.

Handle slices in a cleaner way

pandas MultiIndex methods are used like this:

- use ``pandas.MultiIndex.get_loc`` when all levels are provided with
  each a scalar label (no slice, no array)

- use ``pandas.MultiIndex.get_loc_level`` when only a subset of levels
  are provided with scalar labels

- use ``pandas.MultiIndex.get_locs`` for all other cases.
@benbovy
Copy link
Member Author

benbovy commented Sep 8, 2022

it is now allowed to provide array-like labels.

Hmm not sure if it's a good idea... I find get_locs() a bit confusing like in the example below where a 4-labels array for level "one" returns a 3-items location integer array:

# is the 3rd label ("b") ignored?

midx.get_locs((np.array(["b", "a", "b", "c"]), 0))
# array([4, 0, 8])

That differs too much from the vectorized selection based on single pandas indexes...

Fancy indexing with n-d label arrays doesn't work either:

midx.get_locs((np.array([["a", "a"], ["a", "a"]]), 0))
# InvalidIndexError: [['a' 'a']
#  ['a' 'a']]

And providing Variable or DataArray objects as labels would make things event harder, unless we ignore their dimension names and coordinates (but then it wouldn't be consistent with vectorized selection based on single pandas indexes).

Probably not worth it then?

@mathause
Copy link
Collaborator

It would be nice to be able to preserve the MultiIndex with sel (e.g. ds.sel(one=["a"]) but if it makes the behavior inconsistent it is no good either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sel by slice not working for multi-index containing float-values
2 participants