Skip to content

Commit 7a9e84b

Browse files
Benoit Bovyshoyer
Benoit Bovy
authored andcommitted
Multi-index indexing (#802)
* Use dict-like for indexing on dims with multiindex * renamed drop_levels to drop_level * fixed existing test failures * removed drop_level option * multi-index level drop for DataArray.loc * mindex level drop for DataArray.loc with Ellipsis * fix unnamed indexes * added tests for indexing and selection * added documentation (indexing) * allow multi-index indexing with nested tuples * updated what's new * updated doc * set better default names for multi-index levels * refactored and fixed dim name / coord replacement * fix remap_label_indexers tests * more detailed doc * re-written _replace_indexes * avoid creating temp dataset for dataarray sel/loc * clean up internal function is_nested_tuple * better handling of multi-index level drop * more global handling of unnamed multi-index levels * updated doc * typos and missing details (docstrings, doc) * handle multi-index level drop for scalar labels
1 parent a0a3860 commit 7a9e84b

File tree

11 files changed

+306
-57
lines changed

11 files changed

+306
-57
lines changed

doc/data-structures.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -115,11 +115,9 @@ If you create a ``DataArray`` by supplying a pandas
115115
df
116116
xr.DataArray(df)
117117
118-
xarray does not (yet!) support labeling coordinate values with a
119-
:py:class:`pandas.MultiIndex` (see :issue:`164`).
120-
However, the alternate ``from_series`` constructor will automatically unpack
121-
any hierarchical indexes it encounters by expanding the series into a
122-
multi-dimensional array, as described in :doc:`pandas`.
118+
Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`.
119+
While it handles multi-indexes with unnamed levels, it is recommended that you
120+
explicitly set the names of the levels.
123121

124122
DataArray properties
125123
~~~~~~~~~~~~~~~~~~~~

doc/indexing.rst

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -294,6 +294,51 @@ elements that are fully masked:
294294
295295
arr2.where(arr2.y < 2, drop=True)
296296
297+
.. _multi-level indexing:
298+
299+
Multi-level indexing
300+
--------------------
301+
302+
Just like pandas, advanced indexing on multi-level indexes is possible with
303+
``loc`` and ``sel``. You can slice a multi-index by providing multiple indexers,
304+
i.e., a tuple of slices, labels, list of labels, or any selector allowed by
305+
pandas:
306+
307+
.. ipython:: python
308+
309+
midx = pd.MultiIndex.from_product([list('abc'), [0, 1]],
310+
names=('one', 'two'))
311+
mda = xr.DataArray(np.random.rand(6, 3),
312+
[('x', midx), ('y', range(3))])
313+
mda
314+
mda.sel(x=(list('ab'), [0]))
315+
316+
You can also select multiple elements by providing a list of labels or tuples or
317+
a slice of tuples:
318+
319+
.. ipython:: python
320+
321+
mda.sel(x=[('a', 0), ('b', 1)])
322+
323+
Additionally, xarray supports dictionaries:
324+
325+
.. ipython:: python
326+
327+
mda.sel(x={'one': 'a', 'two': 0})
328+
mda.loc[{'one': 'a'}, ...]
329+
330+
Like pandas, xarray handles partial selection on multi-index (level drop).
331+
As shown in the last example above, it also renames the dimension / coordinate
332+
when the multi-index is reduced to a single index.
333+
334+
Unlike pandas, xarray does not guess whether you provide index levels or
335+
dimensions when using ``loc`` in some ambiguous cases. For example, for
336+
``mda.loc[{'one': 'a', 'two': 0}]`` and ``mda.loc['a', 0]`` xarray
337+
always interprets ('one', 'two') and ('a', 0) as the names and
338+
labels of the 1st and 2nd dimension, respectively. You must specify all
339+
dimensions or use the ellipsis in the ``loc`` specifier, e.g. in the example
340+
above, ``mda.loc[{'one': 'a', 'two': 0}, :]`` or ``mda.loc[('a', 0), ...]``.
341+
297342
Multi-dimensional indexing
298343
--------------------------
299344

doc/whats-new.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ Breaking changes
2626
~~~~~~~~~~~~~~~~
2727

2828
- Dropped support for Python 2.6 (:issue:`855`).
29+
- Indexing on multi-index now drop levels, which is consitent with pandas.
30+
It also changes the name of the dimension / coordinate when the multi-index is
31+
reduced to a single index.
2932

3033
Enhancements
3134
~~~~~~~~~~~~
@@ -45,10 +48,16 @@ Enhancements
4548
attributes are retained in the resampled object. By
4649
`Jeremy McGibbon <https://github.com/mcgibbon>`_.
4750

51+
- Better multi-index support in DataArray and Dataset :py:meth:`sel` and
52+
:py:meth:`loc` methods, which now behave more closely to pandas and which
53+
also accept dictionaries for indexing based on given level names and labels
54+
(see :ref:`multi-level indexing`). By
55+
`Benoit Bovy <https://github.com/benbovy>`_.
56+
4857
- New (experimental) decorators :py:func:`~xarray.register_dataset_accessor` and
4958
:py:func:`~xarray.register_dataarray_accessor` for registering custom xarray
5059
extensions without subclassing. They are described in the new documentation
51-
page on :ref:`internals`. By `Stephan Hoyer <https://github.com/shoyer>`
60+
page on :ref:`internals`. By `Stephan Hoyer <https://github.com/shoyer>`_.
5261

5362
- Round trip boolean datatypes. Previously, writing boolean datatypes to netCDF
5463
formats would raise an error since netCDF does not have a `bool` datatype.

xarray/core/dataarray.py

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -86,24 +86,19 @@ def __init__(self, data_array):
8686
self.data_array = data_array
8787

8888
def _remap_key(self, key):
89-
def lookup_positions(dim, labels):
90-
index = self.data_array.indexes[dim]
91-
return indexing.convert_label_indexer(index, labels)
92-
93-
if utils.is_dict_like(key):
94-
return dict((dim, lookup_positions(dim, labels))
95-
for dim, labels in iteritems(key))
96-
else:
89+
if not utils.is_dict_like(key):
9790
# expand the indexer so we can handle Ellipsis
98-
key = indexing.expanded_indexer(key, self.data_array.ndim)
99-
return tuple(lookup_positions(dim, labels) for dim, labels
100-
in zip(self.data_array.dims, key))
91+
labels = indexing.expanded_indexer(key, self.data_array.ndim)
92+
key = dict(zip(self.data_array.dims, labels))
93+
return indexing.remap_label_indexers(self.data_array, key)
10194

10295
def __getitem__(self, key):
103-
return self.data_array[self._remap_key(key)]
96+
pos_indexers, new_indexes = self._remap_key(key)
97+
return self.data_array[pos_indexers]._replace_indexes(new_indexes)
10498

10599
def __setitem__(self, key, value):
106-
self.data_array[self._remap_key(key)] = value
100+
pos_indexers, _ = self._remap_key(key)
101+
self.data_array[pos_indexers] = value
107102

108103

109104
class _ThisArray(object):
@@ -244,6 +239,23 @@ def _replace_maybe_drop_dims(self, variable, name=__default):
244239
if set(v.dims) <= allowed_dims)
245240
return self._replace(variable, coords, name)
246241

242+
def _replace_indexes(self, indexes):
243+
if not len(indexes):
244+
return self
245+
coords = self._coords.copy()
246+
for name, idx in indexes.items():
247+
coords[name] = Coordinate(name, idx)
248+
obj = self._replace(coords=coords)
249+
250+
# switch from dimension to level names, if necessary
251+
dim_names = {}
252+
for dim, idx in indexes.items():
253+
if not isinstance(idx, pd.MultiIndex) and idx.name != dim:
254+
dim_names[dim] = idx.name
255+
if dim_names:
256+
obj = obj.rename(dim_names)
257+
return obj
258+
247259
__this_array = _ThisArray()
248260

249261
def _to_temp_dataset(self):
@@ -599,8 +611,10 @@ def sel(self, method=None, tolerance=None, **indexers):
599611
Dataset.sel
600612
DataArray.isel
601613
"""
602-
return self.isel(**indexing.remap_label_indexers(
603-
self, indexers, method=method, tolerance=tolerance))
614+
pos_indexers, new_indexes = indexing.remap_label_indexers(
615+
self, indexers, method=method, tolerance=tolerance
616+
)
617+
return self.isel(**pos_indexers)._replace_indexes(new_indexes)
604618

605619
def isel_points(self, dim='points', **indexers):
606620
"""Return a new DataArray whose dataset is given by pointwise integer

xarray/core/dataset.py

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -419,6 +419,23 @@ def _replace_vars_and_dims(self, variables, coord_names=None,
419419
obj = self._construct_direct(variables, coord_names, dims, attrs)
420420
return obj
421421

422+
def _replace_indexes(self, indexes):
423+
if not len(indexes):
424+
return self
425+
variables = self._variables.copy()
426+
for name, idx in indexes.items():
427+
variables[name] = Coordinate(name, idx)
428+
obj = self._replace_vars_and_dims(variables)
429+
430+
# switch from dimension to level names, if necessary
431+
dim_names = {}
432+
for dim, idx in indexes.items():
433+
if not isinstance(idx, pd.MultiIndex) and idx.name != dim:
434+
dim_names[dim] = idx.name
435+
if dim_names:
436+
obj = obj.rename(dim_names)
437+
return obj
438+
422439
def copy(self, deep=False):
423440
"""Returns a copy of this dataset.
424441
@@ -954,7 +971,9 @@ def sel(self, method=None, tolerance=None, **indexers):
954971
Requires pandas>=0.17.
955972
**indexers : {dim: indexer, ...}
956973
Keyword arguments with names matching dimensions and values given
957-
by scalars, slices or arrays of tick labels.
974+
by scalars, slices or arrays of tick labels. For dimensions with
975+
multi-index, the indexer may also be a dict-like object with keys
976+
matching index level names.
958977
959978
Returns
960979
-------
@@ -972,8 +991,10 @@ def sel(self, method=None, tolerance=None, **indexers):
972991
Dataset.isel_points
973992
DataArray.sel
974993
"""
975-
return self.isel(**indexing.remap_label_indexers(
976-
self, indexers, method=method, tolerance=tolerance))
994+
pos_indexers, new_indexes = indexing.remap_label_indexers(
995+
self, indexers, method=method, tolerance=tolerance
996+
)
997+
return self.isel(**pos_indexers)._replace_indexes(new_indexes)
977998

978999
def isel_points(self, dim='points', **indexers):
9791000
"""Returns a new dataset with each array indexed pointwise along the
@@ -1114,8 +1135,9 @@ def sel_points(self, dim='points', method=None, tolerance=None,
11141135
Dataset.isel_points
11151136
DataArray.sel_points
11161137
"""
1117-
pos_indexers = indexing.remap_label_indexers(
1118-
self, indexers, method=method, tolerance=tolerance)
1138+
pos_indexers, _ = indexing.remap_label_indexers(
1139+
self, indexers, method=method, tolerance=tolerance
1140+
)
11191141
return self.isel_points(dim=dim, **pos_indexers)
11201142

11211143
def reindex_like(self, other, method=None, tolerance=None, copy=True):
@@ -1396,9 +1418,6 @@ def unstack(self, dim):
13961418
obj = self.reindex(copy=False, **{dim: full_idx})
13971419

13981420
new_dim_names = index.names
1399-
if any(name is None for name in new_dim_names):
1400-
raise ValueError('cannot unstack dimension with unnamed levels')
1401-
14021421
new_dim_sizes = [lev.size for lev in index.levels]
14031422

14041423
variables = OrderedDict()

xarray/core/indexing.py

Lines changed: 51 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
from . import utils
66
from .pycompat import iteritems, range, dask_array_type, suppress
7-
from .utils import is_full_slice
7+
from .utils import is_full_slice, is_dict_like
88

99

1010
def expanded_indexer(key, ndim):
@@ -135,11 +135,18 @@ def _asarray_tuplesafe(values):
135135
return result
136136

137137

138+
def _is_nested_tuple(possible_tuple):
139+
return (isinstance(possible_tuple, tuple)
140+
and any(isinstance(value, (tuple, list, slice))
141+
for value in possible_tuple))
142+
143+
138144
def convert_label_indexer(index, label, index_name='', method=None,
139145
tolerance=None):
140146
"""Given a pandas.Index and labels (e.g., from __getitem__) for one
141147
dimension, return an indexer suitable for indexing an ndarray along that
142-
dimension
148+
dimension. If `index` is a pandas.MultiIndex and depending on `label`,
149+
return a new pandas.Index or pandas.MultiIndex (otherwise return None).
143150
"""
144151
# backwards compatibility for pandas<0.16 (method) or pandas<0.17
145152
# (tolerance)
@@ -152,6 +159,8 @@ def convert_label_indexer(index, label, index_name='', method=None,
152159
'the tolerance argument requires pandas v0.17 or newer')
153160
kwargs['tolerance'] = tolerance
154161

162+
new_index = None
163+
155164
if isinstance(label, slice):
156165
if method is not None or tolerance is not None:
157166
raise NotImplementedError(
@@ -166,29 +175,63 @@ def convert_label_indexer(index, label, index_name='', method=None,
166175
raise KeyError('cannot represent labeled-based slice indexer for '
167176
'dimension %r with a slice over integer positions; '
168177
'the index is unsorted or non-unique')
178+
179+
elif is_dict_like(label):
180+
is_nested_vals = _is_nested_tuple(tuple(label.values()))
181+
if not isinstance(index, pd.MultiIndex):
182+
raise ValueError('cannot use a dict-like object for selection on a '
183+
'dimension that does not have a MultiIndex')
184+
elif len(label) == index.nlevels and not is_nested_vals:
185+
indexer = index.get_loc(tuple((label[k] for k in index.names)))
186+
else:
187+
indexer, new_index = index.get_loc_level(tuple(label.values()),
188+
level=tuple(label.keys()))
189+
190+
elif isinstance(label, tuple) and isinstance(index, pd.MultiIndex):
191+
if _is_nested_tuple(label):
192+
indexer = index.get_locs(label)
193+
elif len(label) == index.nlevels:
194+
indexer = index.get_loc(label)
195+
else:
196+
indexer, new_index = index.get_loc_level(
197+
label, level=list(range(len(label)))
198+
)
199+
169200
else:
170201
label = _asarray_tuplesafe(label)
171202
if label.ndim == 0:
172-
indexer = index.get_loc(label.item(), **kwargs)
203+
if isinstance(index, pd.MultiIndex):
204+
indexer, new_index = index.get_loc_level(label.item(), level=0)
205+
else:
206+
indexer = index.get_loc(label.item(), **kwargs)
173207
elif label.dtype.kind == 'b':
174208
indexer, = np.nonzero(label)
175209
else:
176210
indexer = index.get_indexer(label, **kwargs)
177211
if np.any(indexer < 0):
178212
raise KeyError('not all values found in index %r'
179213
% index_name)
180-
return indexer
214+
return indexer, new_index
181215

182216

183217
def remap_label_indexers(data_obj, indexers, method=None, tolerance=None):
184218
"""Given an xarray data object and label based indexers, return a mapping
185-
of equivalent location based indexers.
219+
of equivalent location based indexers. Also return a mapping of updated
220+
pandas index objects (in case of multi-index level drop).
186221
"""
187222
if method is not None and not isinstance(method, str):
188223
raise TypeError('``method`` must be a string')
189-
return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label,
190-
dim, method, tolerance))
191-
for dim, label in iteritems(indexers))
224+
225+
pos_indexers, new_indexes = {}, {}
226+
for dim, label in iteritems(indexers):
227+
index = data_obj[dim].to_index()
228+
idxr, new_idx = convert_label_indexer(index, label,
229+
dim, method, tolerance)
230+
pos_indexers[dim] = idxr
231+
if new_idx is not None:
232+
new_indexes[dim] = new_idx
233+
234+
return pos_indexers, new_indexes
192235

193236

194237
def slice_slice(old_slice, applied_slice, size):

xarray/core/variable.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1143,7 +1143,13 @@ def to_index(self):
11431143
# basically free as pandas.Index objects are immutable
11441144
assert self.ndim == 1
11451145
index = self._data_cached().array
1146-
if not isinstance(index, pd.MultiIndex):
1146+
if isinstance(index, pd.MultiIndex):
1147+
# set default names for multi-index unnamed levels so that
1148+
# we can safely rename dimension / coordinate later
1149+
valid_level_names = [name or '{}_level_{}'.format(self.name, i)
1150+
for i, name in enumerate(index.names)]
1151+
index = index.set_names(valid_level_names)
1152+
else:
11471153
index = index.set_names(self.name)
11481154
return index
11491155

0 commit comments

Comments
 (0)