-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Multidimensional groupby #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f5f69b8
4c65a71
016071c
e879e8c
82e4878
b94f1b8
d614246
f957eb8
237fc39
dc50064
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,4 @@ Examples | |
examples/quick-overview | ||
examples/weather-data | ||
examples/monthly-means | ||
examples/multidimensional-coords |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
.. _examples.multidim: | ||
|
||
Working with Multidimensional Coordinates | ||
========================================= | ||
|
||
Author: `Ryan Abernathey <http://github.org/rabernat>`__ | ||
|
||
Many datasets have *physical coordinates* which differ from their | ||
*logical coordinates*. Xarray provides several ways to plot and analyze | ||
such datasets. | ||
|
||
.. code:: python | ||
%matplotlib inline | ||
import numpy as np | ||
import pandas as pd | ||
import xarray as xr | ||
import cartopy.crs as ccrs | ||
from matplotlib import pyplot as plt | ||
print("numpy version : ", np.__version__) | ||
print("pandas version : ", pd.__version__) | ||
print("xarray version : ", xr.version.version) | ||
.. parsed-literal:: | ||
('numpy version : ', '1.11.0') | ||
('pandas version : ', u'0.18.0') | ||
('xarray version : ', '0.7.2-32-gf957eb8') | ||
As an example, consider this dataset from the | ||
`xarray-data <https://github.com/pydata/xarray-data>`__ repository. | ||
|
||
.. code:: python | ||
! curl -L -O https://github.com/pydata/xarray-data/raw/master/RASM_example_data.nc | ||
.. code:: python | ||
ds = xr.open_dataset('RASM_example_data.nc') | ||
ds | ||
.. parsed-literal:: | ||
<xarray.Dataset> | ||
Dimensions: (time: 36, x: 275, y: 205) | ||
Coordinates: | ||
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ... | ||
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ... | ||
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ... | ||
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... | ||
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... | ||
Data variables: | ||
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ... | ||
Attributes: | ||
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc | ||
institution: U.W. | ||
source: RACM R1002RBRxaaa01a | ||
output_frequency: daily | ||
output_mode: averaged | ||
convention: CF-1.4 | ||
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429. | ||
comment: Output from the Variable Infiltration Capacity (VIC) model. | ||
nco_openmp_thread_number: 1 | ||
NCO: 4.3.7 | ||
history: history deleted for brevity | ||
In this example, the *logical coordinates* are ``x`` and ``y``, while | ||
the *physical coordinates* are ``xc`` and ``yc``, which represent the | ||
latitudes and longitude of the data. | ||
|
||
.. code:: python | ||
print(ds.xc.attrs) | ||
print(ds.yc.attrs) | ||
.. parsed-literal:: | ||
OrderedDict([(u'long_name', u'longitude of grid cell center'), (u'units', u'degrees_east'), (u'bounds', u'xv')]) | ||
OrderedDict([(u'long_name', u'latitude of grid cell center'), (u'units', u'degrees_north'), (u'bounds', u'yv')]) | ||
Plotting | ||
-------- | ||
|
||
Let's examine these coordinate variables by plotting them. | ||
|
||
.. code:: python | ||
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14,4)) | ||
ds.xc.plot(ax=ax1) | ||
ds.yc.plot(ax=ax2) | ||
.. parsed-literal:: | ||
<matplotlib.collections.QuadMesh at 0x118688fd0> | ||
.. parsed-literal:: | ||
/Users/rpa/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison | ||
if self._edgecolors == str('face'): | ||
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_8_2.png | ||
|
||
|
||
Note that the variables ``xc`` (longitude) and ``yc`` (latitude) are | ||
two-dimensional scalar fields. | ||
|
||
If we try to plot the data variable ``Tair``, by default we get the | ||
logical coordinates. | ||
|
||
.. code:: python | ||
ds.Tair[0].plot() | ||
.. parsed-literal:: | ||
<matplotlib.collections.QuadMesh at 0x11b6da890> | ||
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_10_1.png | ||
|
||
|
||
In order to visualize the data on a conventional latitude-longitude | ||
grid, we can take advantage of xarray's ability to apply | ||
`cartopy <http://scitools.org.uk/cartopy/index.html>`__ map projections. | ||
|
||
.. code:: python | ||
plt.figure(figsize=(14,6)) | ||
ax = plt.axes(projection=ccrs.PlateCarree()) | ||
ax.set_global() | ||
ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc', add_colorbar=False) | ||
ax.coastlines() | ||
ax.set_ylim([0,90]); | ||
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_12_0.png | ||
|
||
|
||
Multidimensional Groupby | ||
------------------------ | ||
|
||
The above example allowed us to visualize the data on a regular | ||
latitude-longitude grid. But what if we want to do a calculation that | ||
involves grouping over one of these physical coordinates (rather than | ||
the logical coordinates), for example, calculating the mean temperature | ||
at each latitude. This can be achieved using xarray's ``groupby`` | ||
function, which accepts multidimensional variables. By default, | ||
``groupby`` will use every unique value in the variable, which is | ||
probably not what we want. Instead, we can use the ``groupby_bins`` | ||
function to specify the output coordinates of the group. | ||
|
||
.. code:: python | ||
# define two-degree wide latitude bins | ||
lat_bins = np.arange(0,91,2) | ||
# define a label for each bin corresponding to the central latitude | ||
lat_center = np.arange(1,90,2) | ||
# group according to those bins and take the mean | ||
Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean() | ||
# plot the result | ||
Tair_lat_mean.plot() | ||
.. parsed-literal:: | ||
[<matplotlib.lines.Line2D at 0x11cb92e90>] | ||
.. image:: multidimensional_coords_files/xarray_multidimensional_coords_14_1.png | ||
|
||
|
||
Note that the resulting coordinate for the ``groupby_bins`` operation | ||
got the ``_bins`` suffix appended: ``xc_bins``. This help us distinguish | ||
it from the original multidimensional variable ``xc``. |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -130,7 +130,8 @@ class GroupBy(object): | |
Dataset.groupby | ||
DataArray.groupby | ||
""" | ||
def __init__(self, obj, group, squeeze=False, grouper=None): | ||
def __init__(self, obj, group, squeeze=False, grouper=None, bins=None, | ||
cut_kwargs={}): | ||
"""Create a GroupBy object | ||
Parameters | ||
|
@@ -145,14 +146,30 @@ def __init__(self, obj, group, squeeze=False, grouper=None): | |
if the dimension is squeezed out. | ||
grouper : pd.Grouper, optional | ||
Used for grouping values along the `group` array. | ||
bins : array-like, optional | ||
If `bins` is specified, the groups will be discretized into the | ||
specified bins by `pandas.cut`. | ||
cut_kwargs : dict, optional | ||
Extra keyword arguments to pass to `pandas.cut` | ||
""" | ||
from .dataset import as_dataset | ||
from .dataarray import DataArray | ||
|
||
if group.ndim != 1: | ||
# TODO: remove this limitation? | ||
raise ValueError('`group` must be 1 dimensional') | ||
if getattr(group, 'name', None) is None: | ||
raise ValueError('`group` must have a name') | ||
self._stacked_dim = None | ||
if group.ndim != 1: | ||
# try to stack the dims of the group into a single dim | ||
# TODO: figure out how to exclude dimensions from the stacking | ||
# (e.g. group over space dims but leave time dim intact) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what would this look like? |
||
orig_dims = group.dims | ||
stacked_dim_name = 'stacked_' + '_'.join(orig_dims) | ||
# the copy is necessary here, otherwise read only array raises error | ||
# in pandas: https://github.com/pydata/pandas/issues/12813 | ||
group = group.stack(**{stacked_dim_name: orig_dims}).copy() | ||
obj = obj.stack(**{stacked_dim_name: orig_dims}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is too easy! |
||
self._stacked_dim = stacked_dim_name | ||
self._unstacked_dims = orig_dims | ||
if not hasattr(group, 'dims'): | ||
raise ValueError("`group` must have a 'dims' attribute") | ||
group_dim, = group.dims | ||
|
@@ -167,23 +184,31 @@ def __init__(self, obj, group, squeeze=False, grouper=None): | |
'dimension') | ||
full_index = None | ||
|
||
if grouper is not None and bins is not None: | ||
raise TypeError("Can't specify both `grouper` and `bins`.") | ||
if bins is not None: | ||
binned = pd.cut(group.values, bins, **cut_kwargs) | ||
new_dim_name = group.name + '_bins' | ||
group = DataArray(binned, group.coords, name=new_dim_name) | ||
if grouper is not None: | ||
# time-series resampling | ||
index = safe_cast_to_index(group) | ||
if not index.is_monotonic: | ||
# TODO: sort instead of raising an error | ||
raise ValueError('index must be monotonic for resampling') | ||
s = pd.Series(np.arange(index.size), index) | ||
first_items = s.groupby(grouper).first() | ||
if grouper is not None: | ||
first_items = s.groupby(grouper).first() | ||
if first_items.isnull().any(): | ||
full_index = first_items.index | ||
first_items = first_items.dropna() | ||
bins = first_items.values.astype(np.int64) | ||
group_indices = ([slice(i, j) for i, j in zip(bins[:-1], bins[1:])] + | ||
[slice(bins[-1], None)]) | ||
sbins = first_items.values.astype(np.int64) | ||
group_indices = ([slice(i, j) for i, j in zip(sbins[:-1], sbins[1:])] + | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The nature of But I don't think you need to use this code path here. Instead try inserting something like the following just above where we set if bins is None:
group = xr.DataArray(pd.cut(group.values, bins), group.coords, name=group.name) Then I think this can go through the normal code path, which involves calling |
||
[slice(sbins[-1], None)]) | ||
unique_coord = Coordinate(group.name, first_items.index) | ||
elif group.name in obj.dims: | ||
elif group.name in obj.dims and bins is None: | ||
# assume that group already has sorted, unique values | ||
# (if using bins, the group will have the same name as a dimension | ||
# but different values) | ||
if group.dims != (group.name,): | ||
raise ValueError('`group` is required to be a coordinate if ' | ||
'`group.name` is a dimension in `obj`') | ||
|
@@ -276,6 +301,13 @@ def _maybe_restore_empty_groups(self, combined): | |
combined = combined.reindex(**indexers) | ||
return combined | ||
|
||
def _maybe_unstack_array(self, arr): | ||
"""This gets called if we are applying on an array with a | ||
multidimensional group.""" | ||
if self._stacked_dim is not None and self._stacked_dim in arr.dims: | ||
arr = arr.unstack(self._stacked_dim) | ||
return arr | ||
|
||
def fillna(self, value): | ||
"""Fill missing values in this object by group. | ||
|
@@ -394,6 +426,12 @@ def lookup_order(dimension): | |
new_order = sorted(stacked.dims, key=lookup_order) | ||
return stacked.transpose(*new_order) | ||
|
||
def _restore_multiindex(self, combined): | ||
if self._stacked_dim is not None and self._stacked_dim in combined.dims: | ||
stacked_dim = self.group[self._stacked_dim] | ||
combined[self._stacked_dim] = stacked_dim | ||
return combined | ||
|
||
def apply(self, func, shortcut=False, **kwargs): | ||
"""Apply a function over each array in the group and concatenate them | ||
together into a new array. | ||
|
@@ -437,22 +475,22 @@ def apply(self, func, shortcut=False, **kwargs): | |
grouped = self._iter_grouped() | ||
applied = (maybe_wrap_array(arr, func(arr, **kwargs)) for arr in grouped) | ||
combined = self._concat(applied, shortcut=shortcut) | ||
result = self._maybe_restore_empty_groups(combined) | ||
result = self._maybe_restore_empty_groups( | ||
self._maybe_unstack_array(combined)) | ||
return result | ||
|
||
def _concat(self, applied, shortcut=False): | ||
# peek at applied to determine which coordinate to stack over | ||
applied_example, applied = peek_at(applied) | ||
concat_dim, positions = self._infer_concat_args(applied_example) | ||
|
||
if shortcut: | ||
combined = self._concat_shortcut(applied, concat_dim, positions) | ||
else: | ||
combined = concat(applied, concat_dim) | ||
combined = _maybe_reorder(combined, concat_dim, positions) | ||
|
||
if isinstance(combined, type(self.obj)): | ||
combined = self._restore_dim_order(combined) | ||
combined = self._restore_multiindex(combined) | ||
return combined | ||
|
||
def reduce(self, func, dim=None, axis=None, keep_attrs=False, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1304,6 +1304,65 @@ def test_groupby_first_and_last(self): | |
expected = array # should be a no-op | ||
self.assertDataArrayIdentical(expected, actual) | ||
|
||
def make_groupby_multidim_example_array(self): | ||
return DataArray([[[0,1],[2,3]],[[5,10],[15,20]]], | ||
coords={'lon': (['ny', 'nx'], [[30., 40.], [40., 50.]] ), | ||
'lat': (['ny', 'nx'], [[10., 10.], [20., 20.]] ),}, | ||
dims=['time', 'ny', 'nx']) | ||
|
||
def test_groupby_multidim(self): | ||
array = self.make_groupby_multidim_example_array() | ||
for dim, expected_sum in [ | ||
('lon', DataArray([5, 28, 23], coords={'lon': [30., 40., 50.]})), | ||
('lat', DataArray([16, 40], coords={'lat': [10., 20.]}))]: | ||
actual_sum = array.groupby(dim).sum() | ||
self.assertDataArrayIdentical(expected_sum, actual_sum) | ||
|
||
def test_groupby_multidim_apply(self): | ||
array = self.make_groupby_multidim_example_array() | ||
actual = array.groupby('lon').apply( | ||
lambda x : x - x.mean(), shortcut=False) | ||
expected = DataArray([[[-2.5, -6.], [-5., -8.5]], | ||
[[ 2.5, 3.], [ 8., 8.5]]], | ||
coords=array.coords, dims=array.dims) | ||
self.assertDataArrayIdentical(expected, actual) | ||
|
||
def test_groupby_bins(self): | ||
array = DataArray(np.arange(4), dims='dim_0') | ||
# the first value should not be part of any group ("right" binning) | ||
array[0] = 99 | ||
# bins follow conventions for pandas.cut | ||
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html | ||
bins = [0,1.5,5] | ||
bin_coords = ['(0, 1.5]', '(1.5, 5]'] | ||
expected = DataArray([1,5], dims='dim_0_bins', | ||
coords={'dim_0_bins': bin_coords}) | ||
# the problem with this is that it overwrites the dimensions of array! | ||
#actual = array.groupby('dim_0', bins=bins).sum() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what do you mean by these two comments? Do we need this commented out line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is related to the |
||
actual = array.groupby_bins('dim_0', bins).apply( | ||
lambda x : x.sum(), shortcut=False) | ||
self.assertDataArrayIdentical(expected, actual) | ||
# make sure original array dims are unchanged | ||
# (would fail with shortcut=True above) | ||
self.assertEqual(len(array.dim_0), 4) | ||
|
||
def test_groupby_bins_multidim(self): | ||
array = self.make_groupby_multidim_example_array() | ||
bins = [0,15,20] | ||
bin_coords = ['(0, 15]', '(15, 20]'] | ||
expected = DataArray([16, 40], dims='lat_bins', | ||
coords={'lat_bins': bin_coords}) | ||
actual = array.groupby_bins('lat', bins).apply( | ||
lambda x : x.sum(), shortcut=False) | ||
self.assertDataArrayIdentical(expected, actual) | ||
# modify the array coordinates to be non-monotonic after unstacking | ||
array['lat'].data = np.array([[10., 20.], [20., 10.]]) | ||
expected = DataArray([28, 28], dims='lat_bins', | ||
coords={'lat_bins': bin_coords}) | ||
actual = array.groupby_bins('lat', bins).apply( | ||
lambda x : x.sum(), shortcut=False) | ||
self.assertDataArrayIdentical(expected, actual) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we'll need some tests that catch common errors such as |
||
def make_rolling_example_array(self): | ||
times = pd.date_range('2000-01-01', freq='1D', periods=21) | ||
values = np.random.random((21, 4)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that...