Skip to content

Commit 94e44ef

Browse files
pp-mobouweandelaHGWrightdependabot[bot]schlunma
authored
Lazy netcdf saves (#5191)
* Basic functional lazy saving. * Simplify function signature which upsets Sphinx. * Non-lazy saves return nothing. * Now fixed to enable use with process/distributed scheduling. * Remove dask.utils.SerializableLock, which I think was a mistake. * Make DefferedSaveWrapper use _thread_safe_nc. * Fixes for non-lazy save. * Avoid saver error when no deferred writes. * Reorganise locking code, ready for shareable locks. * Remove optional usage of 'filelock' for lazy saves. * Document dask-specific locking; implement differently for threads or distributed schedulers. * Minor fix for unit-tests. * Pin libnetcdf to avoid problems -- see #5187. * Minor test fix. * Move DeferredSaveWrapper into _thread_safe_nc; replicate the NetCDFDataProxy fix; use one lock per Saver; add extra up-scaled test * Update lib/iris/fileformats/netcdf/saver.py Co-authored-by: Bouwe Andela <[email protected]> * Update lib/iris/fileformats/netcdf/_dask_locks.py Co-authored-by: Bouwe Andela <[email protected]> * Update lib/iris/fileformats/netcdf/saver.py Co-authored-by: Bouwe Andela <[email protected]> * Small rename + reformat. * Remove Saver lazy option; all lazy saves are delayed; factor out fillvalue checks and make them delayable. * Repurposed 'test__FillValueMaskCheckAndStoreTarget' to 'test__data_fillvalue_check', since old class is gone. * Disable (temporary) saver debug printouts. * Fix test problems; Saver automatically completes to preserve existing direct usage (which is public API). * Fix docstring error. * Fix spurious error in old saver test. * Fix Saver docstring. * More robust exit for NetCDFWriteProxy operation. * Fix doctests by making the Saver example functional. * Improve docstrings; unify terminology; simplify non-lazy save call. * Moved netcdf cell-method handling into nc_load_rules.helpers, and various tests into more specific test folders. * Fix lockfiles and Makefile process. * Add unit tests for routine _fillvalue_report(). * Remove debug-only code. * Added tests for what the save function does with the 'compute' keyword. * Fix mock-specific problems, small tidy. * Restructure hierarchy of tests.unit.fileformats.netcdf * Tidy test docstrings. * Correct test import. * Avoid incorrect checking of byte data, and a numpy deprecation warning. * Alter parameter names to make test reports clearer. * Test basic behaviour of _lazy_stream_data; make 'Saver._delayed_writes' private. * Add integration tests, and distributed dependency. * Docstring fixes. * Documentation section and whatsnew entry. * Various fixes to whatsnew, docstrings and docs. * Minor review changes, fix doctest. * Arrange tests + results to organise by package-name alone. * Review changes. * Review changes. * Enhance tests + debug. * Support scheduler type 'single-threaded'; allow retries on delayed-save test. * Improve test. * Adding a whatsnew entry for 5224 (#5234) * Adding a whatsnew entry explaining 5224 * Fixing link and format error * Replacing numpy legacy printing with array2string and remaking results for dependent tests * adding a whatsnew entry * configure codecov * remove results creation commit from blame * fixing whatsnew entry * Bump scitools/workflows from 2023.04.1 to 2023.04.2 (#5236) Bumps [scitools/workflows](https://github.com/scitools/workflows) from 2023.04.1 to 2023.04.2. - [Release notes](https://github.com/scitools/workflows/releases) - [Commits](SciTools/workflows@2023.04.1...2023.04.2) --- updated-dependencies: - dependency-name: scitools/workflows dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Use real array for data of of small netCDF variables. (#5229) * Small netCDF variable data is real. * Various test fixes. * More test fixing. * Fix printout in Mesh documentation. * Whatsnew + doctests fix. * Tweak whatsnew. * Handle derived coordinates correctly in `concatenate` (#5096) * First working prototype of concatenate that handels derived coordinates correctly * Added checks for derived coord metadata during concatenation * Added tests * Fixed defaults * Added what's new entry * Optimized test coverage * clarity on whatsnew entry contributors (#5240) * Modernize and simplify iris.analysis._Groupby (#5015) * Modernize and simplify _Groupby * Rename variable to improve readability Co-authored-by: Martin Yeo <[email protected]> * Add a whatsnew entry * Add a type hint to _add_shared_coord * Add a test for iris.analysis._Groupby.__repr__ --------- Co-authored-by: Martin Yeo <[email protected]> * Finalises Lazy Data documentation (#5137) * cube and io lazy data notes added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added comments within analysis, as well as palette and iterate, and what's new * fixed docstrings as requested in @trexfeathers review * reverted cube.py for time being * fixed flake8 issue * Lazy data second batch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated lastest what'snew * I almost hope this wasn't the fix, I'm such a moron * adressed review changes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bill Little <[email protected]> * Fixes to _discontiguity_in_bounds (attempt 2) (#4975) * update ci locks location (#5228) * Updated environment lockfiles (#5211) Co-authored-by: Lockfile bot <[email protected]> * Increase retries. * Change debug to show which elements failed. * update cf standard units (#5244) * update cf standard units * added whatsnew entry * Correct pull number Co-authored-by: Martin Yeo <[email protected]> --------- Co-authored-by: Martin Yeo <[email protected]> * libnetcdf <4.9 pin (#5242) * Pin libnetcdf<4.9 and update lock files. * What's New entry. * libnetcdf not available on PyPI. * Fix for Pandas v2.0. * Fix for Pandas v2.0. * Avoid possible same-file crossover between tests. * Ensure all-different testfiles; load all vars lazy. * Revert changes to testing framework. * Remove repeated line from requirements/py*.yml (?merge error), and re-fix lockfiles. * Revert some more debug changes. * Reorganise test for better code clarity. * Use public 'Dataset.isopen()' instead of '._isopen'. * Create output files in unique temporary directories. * Tests for fileformats.netcdf._dask_locks. * Fix attribution names. * Fixed new py311 lockfile. * Fix typos spotted by codespell. * Add distributed test dep for python 3.11 * Fix lockfile for python 3.11 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bouwe Andela <[email protected]> Co-authored-by: Henry Wright <[email protected]> Co-authored-by: Henry Wright <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Manuel Schlund <[email protected]> Co-authored-by: Bill Little <[email protected]> Co-authored-by: Bouwe Andela <[email protected]> Co-authored-by: Martin Yeo <[email protected]> Co-authored-by: Elias <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: stephenworsley <[email protected]> Co-authored-by: scitools-ci[bot] <107775138+scitools-ci[bot]@users.noreply.github.com> Co-authored-by: Lockfile bot <[email protected]>
1 parent 949b296 commit 94e44ef

39 files changed

+1700
-331
lines changed

docs/src/userguide/real_and_lazy_data.rst

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
import dask.array as da
88
import iris
9+
from iris.cube import CubeList
910
import numpy as np
1011

1112

@@ -227,10 +228,47 @@ coordinates' lazy points and bounds:
227228
Dask Processing Options
228229
-----------------------
229230

230-
Iris uses dask to provide lazy data arrays for both Iris cubes and coordinates,
231-
and for computing deferred operations on lazy arrays.
231+
Iris uses `Dask <https://docs.dask.org/en/stable/>`_ to provide lazy data arrays for
232+
both Iris cubes and coordinates, and for computing deferred operations on lazy arrays.
232233

233234
Dask provides processing options to control how deferred operations on lazy arrays
234235
are computed. This is provided via the ``dask.set_options`` interface. See the
235236
`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
236237
for more information on setting dask processing options.
238+
239+
240+
.. _delayed_netcdf_save:
241+
242+
Delayed NetCDF Saving
243+
---------------------
244+
245+
When saving data to NetCDF files, it is possible to *delay* writing lazy content to the
246+
output file, to be performed by `Dask <https://docs.dask.org/en/stable/>`_ later,
247+
thus enabling parallel save operations.
248+
249+
This works in the following way :
250+
1. an :func:`iris.save` call is made, with a NetCDF file output and the additional
251+
keyword ``compute=False``.
252+
This is currently *only* available when saving to NetCDF, so it is documented in
253+
the Iris NetCDF file format API. See: :func:`iris.fileformats.netcdf.save`.
254+
255+
2. the call creates the output file, but does not fill in variables' data, where
256+
the data is a lazy array in the Iris object. Instead, these variables are
257+
initially created "empty".
258+
259+
3. the :meth:`~iris.save` call returns a ``result`` which is a
260+
:class:`~dask.delayed.Delayed` object.
261+
262+
4. the save can be completed later by calling ``result.compute()``, or by passing it
263+
to the :func:`dask.compute` call.
264+
265+
The benefit of this, is that costly data transfer operations can be performed in
266+
parallel with writes to other data files. Also, where array contents are calculated
267+
from shared lazy input data, these can be computed in parallel efficiently by Dask
268+
(i.e. without re-fetching), similar to what :meth:`iris.cube.CubeList.realise_data`
269+
can do.
270+
271+
.. note::
272+
This feature does **not** enable parallel writes to the *same* NetCDF output file.
273+
That can only be done on certain operating systems, with a specially configured
274+
build of the NetCDF C library, and is not supported by Iris at present.

docs/src/whatsnew/latest.rst

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,33 @@ This document explains the changes made to Iris for this release
3030
✨ Features
3131
===========
3232

33-
#. N/A
33+
#. `@bsherratt`_ added support for plugins - see the corresponding
34+
:ref:`documentation page<community_plugins>` for further information.
35+
(:pull:`5144`)
36+
37+
#. `@rcomer`_ enabled lazy evaluation of :obj:`~iris.analysis.RMS` calcuations
38+
with weights. (:pull:`5017`)
39+
40+
#. `@schlunma`_ allowed the usage of cubes, coordinates, cell measures, or
41+
ancillary variables as weights for cube aggregations
42+
(:meth:`iris.cube.Cube.collapsed`, :meth:`iris.cube.Cube.aggregated_by`, and
43+
:meth:`iris.cube.Cube.rolling_window`). This automatically adapts cube units
44+
if necessary. (:pull:`5084`)
45+
46+
#. `@lbdreyer`_ and `@trexfeathers`_ (reviewer) added :func:`iris.plot.hist`
47+
and :func:`iris.quickplot.hist`. (:pull:`5189`)
48+
49+
#. `@tinyendian`_ edited :func:`~iris.analysis.cartography.rotate_winds` to
50+
enable lazy computation of rotated wind vector components (:issue:`4934`,
51+
:pull:`4972`)
52+
53+
#. `@ESadek-MO`_ updated to the latest CF Standard Names Table v80
54+
(07 February 2023). (:pull:`5244`)
55+
56+
#. `@pp-mo`_ and `@lbdreyer`_ supported delayed saving of lazy data, when writing to
57+
the netCDF file format. See : :ref:`delayed netCDF saves <delayed_netcdf_save>`.
58+
Also with significant input from `@fnattino`_.
59+
(:pull:`5191`)
3460

3561

3662
🐛 Bugs Fixed
@@ -97,7 +123,8 @@ This document explains the changes made to Iris for this release
97123
Whatsnew author names (@github name) in alphabetical order. Note that,
98124
core dev names are automatically included by the common_links.inc:
99125
100-
126+
.. _@fnattino: https://github.com/fnattino
127+
.. _@tinyendian: https://github.com/tinyendian
101128

102129

103130
.. comment

lib/iris/fileformats/_nc_load_rules/helpers.py

Lines changed: 206 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
build routines, and which it does not use.
1414
1515
"""
16+
import re
17+
from typing import List
1618
import warnings
1719

1820
import cf_units
@@ -28,10 +30,6 @@
2830
import iris.exceptions
2931
import iris.fileformats.cf as cf
3032
import iris.fileformats.netcdf
31-
from iris.fileformats.netcdf import (
32-
UnknownCellMethodWarning,
33-
parse_cell_methods,
34-
)
3533
from iris.fileformats.netcdf.loader import _get_cf_var_data
3634
import iris.std_names
3735
import iris.util
@@ -184,6 +182,210 @@
184182
CF_VALUE_STD_NAME_PROJ_Y = "projection_y_coordinate"
185183

186184

185+
################################################################################
186+
# Handling of cell-methods.
187+
188+
_CM_COMMENT = "comment"
189+
_CM_EXTRA = "extra"
190+
_CM_INTERVAL = "interval"
191+
_CM_METHOD = "method"
192+
_CM_NAME = "name"
193+
_CM_PARSE_NAME = re.compile(r"([\w_]+\s*?:\s+)+")
194+
_CM_PARSE = re.compile(
195+
r"""
196+
(?P<name>([\w_]+\s*?:\s+)+)
197+
(?P<method>[\w_\s]+(?![\w_]*\s*?:))\s*
198+
(?:
199+
\(\s*
200+
(?P<extra>.+)
201+
\)\s*
202+
)?
203+
""",
204+
re.VERBOSE,
205+
)
206+
207+
# Cell methods.
208+
_CM_KNOWN_METHODS = [
209+
"point",
210+
"sum",
211+
"mean",
212+
"maximum",
213+
"minimum",
214+
"mid_range",
215+
"standard_deviation",
216+
"variance",
217+
"mode",
218+
"median",
219+
]
220+
221+
222+
def _split_cell_methods(nc_cell_methods: str) -> List[re.Match]:
223+
"""
224+
Split a CF cell_methods attribute string into a list of zero or more cell
225+
methods, each of which is then parsed with a regex to return a list of match
226+
objects.
227+
228+
Args:
229+
230+
* nc_cell_methods: The value of the cell methods attribute to be split.
231+
232+
Returns:
233+
234+
* nc_cell_methods_matches: A list of the re.Match objects associated with
235+
each parsed cell method
236+
237+
Splitting is done based on words followed by colons outside of any brackets.
238+
Validation of anything other than being laid out in the expected format is
239+
left to the calling function.
240+
"""
241+
242+
# Find name candidates
243+
name_start_inds = []
244+
for m in _CM_PARSE_NAME.finditer(nc_cell_methods):
245+
name_start_inds.append(m.start())
246+
247+
# Remove those that fall inside brackets
248+
bracket_depth = 0
249+
for ind, cha in enumerate(nc_cell_methods):
250+
if cha == "(":
251+
bracket_depth += 1
252+
elif cha == ")":
253+
bracket_depth -= 1
254+
if bracket_depth < 0:
255+
msg = (
256+
"Cell methods may be incorrectly parsed due to mismatched "
257+
"brackets"
258+
)
259+
warnings.warn(msg, UserWarning, stacklevel=2)
260+
if bracket_depth > 0 and ind in name_start_inds:
261+
name_start_inds.remove(ind)
262+
263+
# List tuples of indices of starts and ends of the cell methods in the string
264+
method_indices = []
265+
for ii in range(len(name_start_inds) - 1):
266+
method_indices.append((name_start_inds[ii], name_start_inds[ii + 1]))
267+
method_indices.append((name_start_inds[-1], len(nc_cell_methods)))
268+
269+
# Index the string and match against each substring
270+
nc_cell_methods_matches = []
271+
for start_ind, end_ind in method_indices:
272+
nc_cell_method_str = nc_cell_methods[start_ind:end_ind]
273+
nc_cell_method_match = _CM_PARSE.match(nc_cell_method_str.strip())
274+
if not nc_cell_method_match:
275+
msg = (
276+
f"Failed to fully parse cell method string: {nc_cell_methods}"
277+
)
278+
warnings.warn(msg, UserWarning, stacklevel=2)
279+
continue
280+
nc_cell_methods_matches.append(nc_cell_method_match)
281+
282+
return nc_cell_methods_matches
283+
284+
285+
class UnknownCellMethodWarning(Warning):
286+
pass
287+
288+
289+
def parse_cell_methods(nc_cell_methods):
290+
"""
291+
Parse a CF cell_methods attribute string into a tuple of zero or
292+
more CellMethod instances.
293+
294+
Args:
295+
296+
* nc_cell_methods (str):
297+
The value of the cell methods attribute to be parsed.
298+
299+
Returns:
300+
301+
* cell_methods
302+
An iterable of :class:`iris.coords.CellMethod`.
303+
304+
Multiple coordinates, intervals and comments are supported.
305+
If a method has a non-standard name a warning will be issued, but the
306+
results are not affected.
307+
308+
"""
309+
310+
cell_methods = []
311+
if nc_cell_methods is not None:
312+
for m in _split_cell_methods(nc_cell_methods):
313+
d = m.groupdict()
314+
method = d[_CM_METHOD]
315+
method = method.strip()
316+
# Check validity of method, allowing for multi-part methods
317+
# e.g. mean over years.
318+
method_words = method.split()
319+
if method_words[0].lower() not in _CM_KNOWN_METHODS:
320+
msg = "NetCDF variable contains unknown cell method {!r}"
321+
warnings.warn(
322+
msg.format("{}".format(method_words[0])),
323+
UnknownCellMethodWarning,
324+
)
325+
d[_CM_METHOD] = method
326+
name = d[_CM_NAME]
327+
name = name.replace(" ", "")
328+
name = name.rstrip(":")
329+
d[_CM_NAME] = tuple([n for n in name.split(":")])
330+
interval = []
331+
comment = []
332+
if d[_CM_EXTRA] is not None:
333+
#
334+
# tokenise the key words and field colon marker
335+
#
336+
d[_CM_EXTRA] = d[_CM_EXTRA].replace(
337+
"comment:", "<<comment>><<:>>"
338+
)
339+
d[_CM_EXTRA] = d[_CM_EXTRA].replace(
340+
"interval:", "<<interval>><<:>>"
341+
)
342+
d[_CM_EXTRA] = d[_CM_EXTRA].split("<<:>>")
343+
if len(d[_CM_EXTRA]) == 1:
344+
comment.extend(d[_CM_EXTRA])
345+
else:
346+
next_field_type = comment
347+
for field in d[_CM_EXTRA]:
348+
field_type = next_field_type
349+
index = field.rfind("<<interval>>")
350+
if index == 0:
351+
next_field_type = interval
352+
continue
353+
elif index > 0:
354+
next_field_type = interval
355+
else:
356+
index = field.rfind("<<comment>>")
357+
if index == 0:
358+
next_field_type = comment
359+
continue
360+
elif index > 0:
361+
next_field_type = comment
362+
if index != -1:
363+
field = field[:index]
364+
field_type.append(field.strip())
365+
#
366+
# cater for a shared interval over multiple axes
367+
#
368+
if len(interval):
369+
if len(d[_CM_NAME]) != len(interval) and len(interval) == 1:
370+
interval = interval * len(d[_CM_NAME])
371+
#
372+
# cater for a shared comment over multiple axes
373+
#
374+
if len(comment):
375+
if len(d[_CM_NAME]) != len(comment) and len(comment) == 1:
376+
comment = comment * len(d[_CM_NAME])
377+
d[_CM_INTERVAL] = tuple(interval)
378+
d[_CM_COMMENT] = tuple(comment)
379+
cell_method = iris.coords.CellMethod(
380+
d[_CM_METHOD],
381+
coords=d[_CM_NAME],
382+
intervals=d[_CM_INTERVAL],
383+
comments=d[_CM_COMMENT],
384+
)
385+
cell_methods.append(cell_method)
386+
return tuple(cell_methods)
387+
388+
187389
################################################################################
188390
def build_cube_metadata(engine):
189391
"""Add the standard meta data to the cube."""

lib/iris/fileformats/netcdf/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,18 @@
1818
# Note: *must* be done before importing from submodules, as they also use this !
1919
logger = iris.config.get_logger(__name__)
2020

21+
# Note: these probably shouldn't be public, but for now they are.
22+
from .._nc_load_rules.helpers import (
23+
UnknownCellMethodWarning,
24+
parse_cell_methods,
25+
)
2126
from .loader import DEBUG, NetCDFDataProxy, load_cubes
2227
from .saver import (
2328
CF_CONVENTIONS_VERSION,
2429
MESH_ELEMENTS,
2530
SPATIO_TEMPORAL_AXES,
2631
CFNameCoordMap,
2732
Saver,
28-
UnknownCellMethodWarning,
29-
parse_cell_methods,
3033
save,
3134
)
3235

0 commit comments

Comments
 (0)