Skip to content

Commit 289b377

Browse files
authored
xarray.backends refactor (#2261)
* WIP: xarray.backends.file_manager for managing file objects. This is intended to replace both PickleByReconstructionWrapper and DataStorePickleMixin with something more compartmentalized. xref GH2121 * Switch rasterio to use FileManager * lint fixes * WIP: rewrite FileManager to always use an LRUCache * Test coverage * Don't use move_to_end * minor clarification * Switch FileManager.acquire() to a method * Python 2 compat * Update xarray.set_options() to add file_cache_maxsize and validation * Add assert for FILE_CACHE.maxsize * More docstring for FileManager * Add accidentally omited tests for LRUCache * Adapt scipy backend to use FileManager * Stickler fix * Fix failure on Python 2.7 * Finish adjusting backends to use FileManager * Fix bad import * WIP on distributed * More WIP * Fix distributed write tests * Fixes * Minor fixup * whats new * More refactoring: remove state from backends entirely * Cleanup * Fix failing in-memory datastore tests * Fix inaccessible datastore * fix autoclose warnings * Fix PyNIO failures * No longer disable HDF5 file locking We longer need to explicitly HDF5_USE_FILE_LOCKING='FALSE' because we properly close open files. * whats new and default file cache size * Whats new tweak * Refactor default lock logic to backend classes * Rename get_resource_lock -> get_write_lock * Don't acquire unnecessary locks in __getitem__ * Fix bad merge * Fix import * Remove unreachable code
1 parent 5b4d160 commit 289b377

28 files changed

+1496
-983
lines changed

asv_bench/asv.conf.json

+1
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
"scipy": [""],
6565
"bottleneck": ["", null],
6666
"dask": [""],
67+
"distributed": [""],
6768
},
6869

6970

asv_bench/benchmarks/dataset_io.py

+41
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from __future__ import absolute_import, division, print_function
22

3+
import os
4+
35
import numpy as np
46
import pandas as pd
57

@@ -14,6 +16,9 @@
1416
pass
1517

1618

19+
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
20+
21+
1722
class IOSingleNetCDF(object):
1823
"""
1924
A few examples that benchmark reading/writing a single netCDF file with
@@ -405,3 +410,39 @@ def time_open_dataset_scipy_with_time_chunks(self):
405410
with dask.set_options(get=dask.multiprocessing.get):
406411
xr.open_mfdataset(self.filenames_list, engine='scipy',
407412
chunks=self.time_chunks)
413+
414+
415+
def create_delayed_write():
416+
import dask.array as da
417+
vals = da.random.random(300, chunks=(1,))
418+
ds = xr.Dataset({'vals': (['a'], vals)})
419+
return ds.to_netcdf('file.nc', engine='netcdf4', compute=False)
420+
421+
422+
class IOWriteNetCDFDask(object):
423+
timeout = 60
424+
repeat = 1
425+
number = 5
426+
427+
def setup(self):
428+
requires_dask()
429+
self.write = create_delayed_write()
430+
431+
def time_write(self):
432+
self.write.compute()
433+
434+
435+
class IOWriteNetCDFDaskDistributed(object):
436+
def setup(self):
437+
try:
438+
import distributed
439+
except ImportError:
440+
raise NotImplementedError
441+
self.client = distributed.Client()
442+
self.write = create_delayed_write()
443+
444+
def cleanup(self):
445+
self.client.shutdown()
446+
447+
def time_write(self):
448+
self.write.compute()

doc/api.rst

+3
Original file line numberDiff line numberDiff line change
@@ -624,3 +624,6 @@ arguments for the ``from_store`` and ``dump_to_store`` Dataset methods:
624624
backends.H5NetCDFStore
625625
backends.PydapDataStore
626626
backends.ScipyDataStore
627+
backends.FileManager
628+
backends.CachingFileManager
629+
backends.DummyFileManager

doc/whats-new.rst

+16-3
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,27 @@ v0.11.0 (unreleased)
3333
Breaking changes
3434
~~~~~~~~~~~~~~~~
3535

36+
- Xarray's storage backends now automatically open and close files when
37+
necessary, rather than requiring opening a file with ``autoclose=True``. A
38+
global least-recently-used cache is used to store open files; the default
39+
limit of 128 open files should suffice in most cases, but can be adjusted if
40+
necessary with
41+
``xarray.set_options(file_cache_maxsize=...)``. The ``autoclose`` argument
42+
to ``open_dataset`` and related functions has been deprecated and is now a
43+
no-op.
44+
45+
This change, along with an internal refactor of xarray's storage backends,
46+
should significantly improve performance when reading and writing
47+
netCDF files with Dask, especially when working with many files or using
48+
Dask Distributed. By `Stephan Hoyer <https://github.com/shoyer>`_
49+
50+
Documentation
51+
~~~~~~~~~~~~~
3652
- Reduction of :py:meth:`DataArray.groupby` and :py:meth:`DataArray.resample`
3753
without dimension argument will change in the next release.
3854
Now we warn a FutureWarning.
3955
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
4056

41-
Documentation
42-
~~~~~~~~~~~~~
43-
4457
Enhancements
4558
~~~~~~~~~~~~
4659

xarray/backends/__init__.py

+4
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
formats. They should not be used directly, but rather through Dataset objects.
55
"""
66
from .common import AbstractDataStore
7+
from .file_manager import FileManager, CachingFileManager, DummyFileManager
78
from .memory import InMemoryDataStore
89
from .netCDF4_ import NetCDF4DataStore
910
from .pydap_ import PydapDataStore
@@ -15,6 +16,9 @@
1516

1617
__all__ = [
1718
'AbstractDataStore',
19+
'FileManager',
20+
'CachingFileManager',
21+
'DummyFileManager',
1822
'InMemoryDataStore',
1923
'NetCDF4DataStore',
2024
'PydapDataStore',

0 commit comments

Comments
 (0)