Skip to content

Commit ebe0dd0

Browse files
spencerkclarkJoe Hamman
authored and
Joe Hamman
committed
CFTimeIndex (#1252)
* Start on implementing and testing NetCDFTimeIndex * TST Move to using pytest fixtures to structure tests * Address initial review comments * Address second round of review comments * Fix failing python3 tests * Match test method name to method name * First attempts at integrating NetCDFTimeIndex into xarray This is a first pass at the following: - Resetting the logic for decoding datetimes such that `np.datetime64` objects are never used for non-standard calendars - Adding logic to use a `NetCDFTimeIndex` whenever `netcdftime.datetime` objects are used in an array being cast as an index (so if one reads in a Dataset from a netCDF file or creates one in Python, which is indexed by a time coordinate that uses `netcdftime.datetime` objects a NetCDFTimeIndex will be used rather than a generic object-based index) - Adding logic to encode `netcdftime.datetime` objects when saving out to netCDF files * Cleanup * Fix DataFrame and Series test failures for NetCDFTimeIndex These were related to a recent minor upstream change in pandas: https://github.com/pandas-dev/pandas/blame/master/pandas/core/indexing.py#L1433 * First pass at making NetCDFTimeIndex compatible with #1356 * Address initial review comments * Restore test_conventions.py * Fix failing test in test_utils.py * flake8 * Update for standalone netcdftime * Address stickler-ci comments * Skip test_format_netcdftime_datetime if netcdftime not installed * A start on documentation * Fix failing zarr tests related to netcdftime encoding * Simplify test_decode_standard_calendar_single_element_non_ns_range * Address a couple review comments * Use else clause in _maybe_cast_to_netcdftimeindex * Start on adding enable_netcdftimeindex option * Continue parametrizing tests in test_coding_times.py * Update time-series.rst for enable_netcdftimeindex option * Use :py:func: in rst for xarray.set_options * Add a what's new entry and test that resample raises a TypeError * Move what's new entry to the version 0.10.3 section * Add version-dependent pathway for importing netcdftime.datetime * Make NetCDFTimeIndex and date decoding/encoding compatible with datetime.datetime * Remove logic to make NetCDFTimeIndex compatible with datetime.datetime * Documentation edits * Ensure proper enable_netcdftimeindex option is used under lazy decoding Prior to this, opening a dataset with enable_netcdftimeindex set to True and then accessing one of its variables outside the context manager would lead to it being decoded with the default enable_netcdftimeindex (which is False). This makes sure that lazy decoding takes into account the context under which it was called. * Add fix and test for concatenating variables with a NetCDFTimeIndex Previously when concatenating variables indexed by a NetCDFTimeIndex the index would be wrongly converted to a generic pd.Index * Further namespace changes due to netcdftime/cftime renaming * NetCDFTimeIndex -> CFTimeIndex * Documentation updates * Only allow use of CFTimeIndex when using the standalone cftime Also only allow for serialization of cftime.datetime objects when using the standalone cftime package. * Fix errant what's new changes * flake8 * Fix skip logic in test_cftimeindex.py * Use only_use_cftime_datetimes option in num2date * Require standalone cftime library for all new functionality Add tests/fixes for dt accessor with cftime datetimes * Improve skipping logic in test_cftimeindex.py * Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime are not available. Use existing requires_cftime decorator where possible (i.e. only on tests that are not parametrized via pytest.mark.parametrize) * Fix skip logic in Python 3.4 build for test_cftimeindex.py * Improve error messages when for when the standalone cftime is not installed * Tweak skip logic in test_accessors.py * flake8 * Address review comments * Temporarily remove cftime from py27 build environment on windows * flake8 * Install cftime via pip for Python 2.7 on Windows * flake8 * Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex * Restore test case for #2002 in test_coding_times.py I must have inadvertently removed it during a merge. * Tweak dates out of range warning logic slightly to preserve current default * Address review comments
1 parent 2c6bd2d commit ebe0dd0

17 files changed

+2095
-367
lines changed

doc/time-series.rst

Lines changed: 95 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,11 @@ You can manual decode arrays in this form by passing a dataset to
7070
One unfortunate limitation of using ``datetime64[ns]`` is that it limits the
7171
native representation of dates to those that fall between the years 1678 and
7272
2262. When a netCDF file contains dates outside of these bounds, dates will be
73-
returned as arrays of ``netcdftime.datetime`` objects.
73+
returned as arrays of ``cftime.datetime`` objects and a ``CFTimeIndex``
74+
can be used for indexing. The ``CFTimeIndex`` enables only a subset of
75+
the indexing functionality of a ``pandas.DatetimeIndex`` and is only enabled
76+
when using standalone version of ``cftime`` (not the version packaged with
77+
earlier versions ``netCDF4``). See :ref:`CFTimeIndex` for more information.
7478

7579
Datetime indexing
7680
-----------------
@@ -207,3 +211,93 @@ Dataset and DataArray objects with an arbitrary number of dimensions.
207211
208212
For more examples of using grouped operations on a time dimension, see
209213
:ref:`toy weather data`.
214+
215+
216+
.. _CFTimeIndex:
217+
218+
Non-standard calendars and dates outside the Timestamp-valid range
219+
------------------------------------------------------------------
220+
221+
Through the standalone ``cftime`` library and a custom subclass of
222+
``pandas.Index``, xarray supports a subset of the indexing functionality enabled
223+
through the standard ``pandas.DatetimeIndex`` for dates from non-standard
224+
calendars or dates using a standard calendar, but outside the
225+
`Timestamp-valid range`_ (approximately between years 1678 and 2262). This
226+
behavior has not yet been turned on by default; to take advantage of this
227+
functionality, you must have the ``enable_cftimeindex`` option set to
228+
``True`` within your context (see :py:func:`~xarray.set_options` for more
229+
information). It is expected that this will become the default behavior in
230+
xarray version 0.11.
231+
232+
For instance, you can create a DataArray indexed by a time
233+
coordinate with a no-leap calendar within a context manager setting the
234+
``enable_cftimeindex`` option, and the time index will be cast to a
235+
``CFTimeIndex``:
236+
237+
.. ipython:: python
238+
239+
from itertools import product
240+
from cftime import DatetimeNoLeap
241+
242+
dates = [DatetimeNoLeap(year, month, 1) for year, month in
243+
product(range(1, 3), range(1, 13))]
244+
with xr.set_options(enable_cftimeindex=True):
245+
da = xr.DataArray(np.arange(24), coords=[dates], dims=['time'],
246+
name='foo')
247+
248+
.. note::
249+
250+
With the ``enable_cftimeindex`` option activated, a ``CFTimeIndex``
251+
will be used for time indexing if any of the following are true:
252+
253+
- The dates are from a non-standard calendar
254+
- Any dates are outside the Timestamp-valid range
255+
256+
Otherwise a ``pandas.DatetimeIndex`` will be used. In addition, if any
257+
variable (not just an index variable) is encoded using a non-standard
258+
calendar, its times will be decoded into ``cftime.datetime`` objects,
259+
regardless of whether or not they can be represented using
260+
``np.datetime64[ns]`` objects.
261+
262+
For data indexed by a ``CFTimeIndex`` xarray currently supports:
263+
264+
- `Partial datetime string indexing`_ using strictly `ISO 8601-format`_ partial
265+
datetime strings:
266+
267+
.. ipython:: python
268+
269+
da.sel(time='0001')
270+
da.sel(time=slice('0001-05', '0002-02'))
271+
272+
- Access of basic datetime components via the ``dt`` accessor (in this case
273+
just "year", "month", "day", "hour", "minute", "second", "microsecond", and
274+
"season"):
275+
276+
.. ipython:: python
277+
278+
da.time.dt.year
279+
da.time.dt.month
280+
da.time.dt.season
281+
282+
- Group-by operations based on datetime accessor attributes (e.g. by month of
283+
the year):
284+
285+
.. ipython:: python
286+
287+
da.groupby('time.month').sum()
288+
289+
- And serialization:
290+
291+
.. ipython:: python
292+
293+
da.to_netcdf('example.nc')
294+
xr.open_dataset('example.nc')
295+
296+
.. note::
297+
298+
Currently resampling along the time dimension for data indexed by a
299+
``CFTimeIndex`` is not supported.
300+
301+
.. _Timestamp-valid range: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timestamp-limitations
302+
.. _ISO 8601-format: https://en.wikipedia.org/wiki/ISO_8601
303+
.. _partial datetime string indexing: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#partial-string-indexing

doc/whats-new.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,16 @@ v0.10.4 (unreleased)
3434
Enhancements
3535
~~~~~~~~~~~~
3636

37+
- Add an option for using a ``CFTimeIndex`` for indexing times with
38+
non-standard calendars and/or outside the Timestamp-valid range; this index
39+
enables a subset of the functionality of a standard
40+
``pandas.DatetimeIndex`` (:issue:`789`, :issue:`1084`, :issue:`1252`).
41+
By `Spencer Clark <https://github.com/spencerkclark>`_ with help from
42+
`Stephan Hoyer <https://github.com/shoyer>`_.
43+
- Allow for serialization of ``cftime.datetime`` objects (:issue:`789`,
44+
:issue:`1084`, :issue:`2008`, :issue:`1252`) using the standalone ``cftime``
45+
library. By `Spencer Clark
46+
<https://github.com/spencerkclark>`_.
3747
- Support writing lists of strings as netCDF attributes (:issue:`2044`).
3848
By `Dan Nowacki <https://github.com/dnowacki-usgs>`_.
3949
- :py:meth:`~xarray.Dataset.to_netcdf(engine='h5netcdf')` now accepts h5py

xarray/coding/cftimeindex.py

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
from __future__ import absolute_import
2+
import re
3+
from datetime import timedelta
4+
5+
import numpy as np
6+
import pandas as pd
7+
8+
from xarray.core import pycompat
9+
from xarray.core.utils import is_scalar
10+
11+
12+
def named(name, pattern):
13+
return '(?P<' + name + '>' + pattern + ')'
14+
15+
16+
def optional(x):
17+
return '(?:' + x + ')?'
18+
19+
20+
def trailing_optional(xs):
21+
if not xs:
22+
return ''
23+
return xs[0] + optional(trailing_optional(xs[1:]))
24+
25+
26+
def build_pattern(date_sep='\-', datetime_sep='T', time_sep='\:'):
27+
pieces = [(None, 'year', '\d{4}'),
28+
(date_sep, 'month', '\d{2}'),
29+
(date_sep, 'day', '\d{2}'),
30+
(datetime_sep, 'hour', '\d{2}'),
31+
(time_sep, 'minute', '\d{2}'),
32+
(time_sep, 'second', '\d{2}')]
33+
pattern_list = []
34+
for sep, name, sub_pattern in pieces:
35+
pattern_list.append((sep if sep else '') + named(name, sub_pattern))
36+
# TODO: allow timezone offsets?
37+
return '^' + trailing_optional(pattern_list) + '$'
38+
39+
40+
_BASIC_PATTERN = build_pattern(date_sep='', time_sep='')
41+
_EXTENDED_PATTERN = build_pattern()
42+
_PATTERNS = [_BASIC_PATTERN, _EXTENDED_PATTERN]
43+
44+
45+
def parse_iso8601(datetime_string):
46+
for pattern in _PATTERNS:
47+
match = re.match(pattern, datetime_string)
48+
if match:
49+
return match.groupdict()
50+
raise ValueError('no ISO-8601 match for string: %s' % datetime_string)
51+
52+
53+
def _parse_iso8601_with_reso(date_type, timestr):
54+
default = date_type(1, 1, 1)
55+
result = parse_iso8601(timestr)
56+
replace = {}
57+
58+
for attr in ['year', 'month', 'day', 'hour', 'minute', 'second']:
59+
value = result.get(attr, None)
60+
if value is not None:
61+
# Note ISO8601 conventions allow for fractional seconds.
62+
# TODO: Consider adding support for sub-second resolution?
63+
replace[attr] = int(value)
64+
resolution = attr
65+
66+
return default.replace(**replace), resolution
67+
68+
69+
def _parsed_string_to_bounds(date_type, resolution, parsed):
70+
"""Generalization of
71+
pandas.tseries.index.DatetimeIndex._parsed_string_to_bounds
72+
for use with non-standard calendars and cftime.datetime
73+
objects.
74+
"""
75+
if resolution == 'year':
76+
return (date_type(parsed.year, 1, 1),
77+
date_type(parsed.year + 1, 1, 1) - timedelta(microseconds=1))
78+
elif resolution == 'month':
79+
if parsed.month == 12:
80+
end = date_type(parsed.year + 1, 1, 1) - timedelta(microseconds=1)
81+
else:
82+
end = (date_type(parsed.year, parsed.month + 1, 1) -
83+
timedelta(microseconds=1))
84+
return date_type(parsed.year, parsed.month, 1), end
85+
elif resolution == 'day':
86+
start = date_type(parsed.year, parsed.month, parsed.day)
87+
return start, start + timedelta(days=1, microseconds=-1)
88+
elif resolution == 'hour':
89+
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour)
90+
return start, start + timedelta(hours=1, microseconds=-1)
91+
elif resolution == 'minute':
92+
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour,
93+
parsed.minute)
94+
return start, start + timedelta(minutes=1, microseconds=-1)
95+
elif resolution == 'second':
96+
start = date_type(parsed.year, parsed.month, parsed.day, parsed.hour,
97+
parsed.minute, parsed.second)
98+
return start, start + timedelta(seconds=1, microseconds=-1)
99+
else:
100+
raise KeyError
101+
102+
103+
def get_date_field(datetimes, field):
104+
"""Adapted from pandas.tslib.get_date_field"""
105+
return np.array([getattr(date, field) for date in datetimes])
106+
107+
108+
def _field_accessor(name, docstring=None):
109+
"""Adapted from pandas.tseries.index._field_accessor"""
110+
def f(self):
111+
return get_date_field(self._data, name)
112+
113+
f.__name__ = name
114+
f.__doc__ = docstring
115+
return property(f)
116+
117+
118+
def get_date_type(self):
119+
return type(self._data[0])
120+
121+
122+
def assert_all_valid_date_type(data):
123+
import cftime
124+
125+
sample = data[0]
126+
date_type = type(sample)
127+
if not isinstance(sample, cftime.datetime):
128+
raise TypeError(
129+
'CFTimeIndex requires cftime.datetime '
130+
'objects. Got object of {}.'.format(date_type))
131+
if not all(isinstance(value, date_type) for value in data):
132+
raise TypeError(
133+
'CFTimeIndex requires using datetime '
134+
'objects of all the same type. Got\n{}.'.format(data))
135+
136+
137+
class CFTimeIndex(pd.Index):
138+
year = _field_accessor('year', 'The year of the datetime')
139+
month = _field_accessor('month', 'The month of the datetime')
140+
day = _field_accessor('day', 'The days of the datetime')
141+
hour = _field_accessor('hour', 'The hours of the datetime')
142+
minute = _field_accessor('minute', 'The minutes of the datetime')
143+
second = _field_accessor('second', 'The seconds of the datetime')
144+
microsecond = _field_accessor('microsecond',
145+
'The microseconds of the datetime')
146+
date_type = property(get_date_type)
147+
148+
def __new__(cls, data):
149+
result = object.__new__(cls)
150+
assert_all_valid_date_type(data)
151+
result._data = np.array(data)
152+
return result
153+
154+
def _partial_date_slice(self, resolution, parsed):
155+
"""Adapted from
156+
pandas.tseries.index.DatetimeIndex._partial_date_slice
157+
158+
Note that when using a CFTimeIndex, if a partial-date selection
159+
returns a single element, it will never be converted to a scalar
160+
coordinate; this is in slight contrast to the behavior when using
161+
a DatetimeIndex, which sometimes will return a DataArray with a scalar
162+
coordinate depending on the resolution of the datetimes used in
163+
defining the index. For example:
164+
165+
>>> from cftime import DatetimeNoLeap
166+
>>> import pandas as pd
167+
>>> import xarray as xr
168+
>>> da = xr.DataArray([1, 2],
169+
coords=[[DatetimeNoLeap(2001, 1, 1),
170+
DatetimeNoLeap(2001, 2, 1)]],
171+
dims=['time'])
172+
>>> da.sel(time='2001-01-01')
173+
<xarray.DataArray (time: 1)>
174+
array([1])
175+
Coordinates:
176+
* time (time) object 2001-01-01 00:00:00
177+
>>> da = xr.DataArray([1, 2],
178+
coords=[[pd.Timestamp(2001, 1, 1),
179+
pd.Timestamp(2001, 2, 1)]],
180+
dims=['time'])
181+
>>> da.sel(time='2001-01-01')
182+
<xarray.DataArray ()>
183+
array(1)
184+
Coordinates:
185+
time datetime64[ns] 2001-01-01
186+
>>> da = xr.DataArray([1, 2],
187+
coords=[[pd.Timestamp(2001, 1, 1, 1),
188+
pd.Timestamp(2001, 2, 1)]],
189+
dims=['time'])
190+
>>> da.sel(time='2001-01-01')
191+
<xarray.DataArray (time: 1)>
192+
array([1])
193+
Coordinates:
194+
* time (time) datetime64[ns] 2001-01-01T01:00:00
195+
"""
196+
start, end = _parsed_string_to_bounds(self.date_type, resolution,
197+
parsed)
198+
lhs_mask = (self._data >= start)
199+
rhs_mask = (self._data <= end)
200+
return (lhs_mask & rhs_mask).nonzero()[0]
201+
202+
def _get_string_slice(self, key):
203+
"""Adapted from pandas.tseries.index.DatetimeIndex._get_string_slice"""
204+
parsed, resolution = _parse_iso8601_with_reso(self.date_type, key)
205+
loc = self._partial_date_slice(resolution, parsed)
206+
return loc
207+
208+
def get_loc(self, key, method=None, tolerance=None):
209+
"""Adapted from pandas.tseries.index.DatetimeIndex.get_loc"""
210+
if isinstance(key, pycompat.basestring):
211+
return self._get_string_slice(key)
212+
else:
213+
return pd.Index.get_loc(self, key, method=method,
214+
tolerance=tolerance)
215+
216+
def _maybe_cast_slice_bound(self, label, side, kind):
217+
"""Adapted from
218+
pandas.tseries.index.DatetimeIndex._maybe_cast_slice_bound"""
219+
if isinstance(label, pycompat.basestring):
220+
parsed, resolution = _parse_iso8601_with_reso(self.date_type,
221+
label)
222+
start, end = _parsed_string_to_bounds(self.date_type, resolution,
223+
parsed)
224+
if self.is_monotonic_decreasing and len(self):
225+
return end if side == 'left' else start
226+
return start if side == 'left' else end
227+
else:
228+
return label
229+
230+
# TODO: Add ability to use integer range outside of iloc?
231+
# e.g. series[1:5].
232+
def get_value(self, series, key):
233+
"""Adapted from pandas.tseries.index.DatetimeIndex.get_value"""
234+
if not isinstance(key, slice):
235+
return series.iloc[self.get_loc(key)]
236+
else:
237+
return series.iloc[self.slice_indexer(
238+
key.start, key.stop, key.step)]
239+
240+
def __contains__(self, key):
241+
"""Adapted from
242+
pandas.tseries.base.DatetimeIndexOpsMixin.__contains__"""
243+
try:
244+
result = self.get_loc(key)
245+
return (is_scalar(result) or type(result) == slice or
246+
(isinstance(result, np.ndarray) and result.size))
247+
except (KeyError, TypeError, ValueError):
248+
return False
249+
250+
def contains(self, key):
251+
"""Needed for .loc based partial-string indexing"""
252+
return self.__contains__(key)

0 commit comments

Comments
 (0)