Description
Note: this issue is mostly a TODO list for myself, but input would also be welcome.
In my latest PR #12, Dataset
is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).
I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:
- The
dimensions
attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.- Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to
np.arange(data.shape[axis])
).
- Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to
- All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to
pandas.Index
objects when stored in a dataset.- Note: Pandas index objects are
numpy.ndarray
subclasses (or at least API compatible) but are immutable. - When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
- Note: Pandas index objects are
- The public interface to
variables
will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e,dataset['foo'] = foo
ordel dataset['foo']
), which will validate and perform appropriate conversions:- Dimensions are checked for conflicts.
- Coordinate variables are converted into
pandas.Index
objects (as noted above) - If a
DatasetArray
is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]). - If the item assigned is not an
Array
orDatasetArray
, it can be an iterable of the form(dimensions, data)
or(dimensions, data, attributes)
, which is unpacked as the arguments to create a newArray
. - Thus, we can get rid of
set_dimension
,set_variable
,create_variable
, etc.
- We can expose attributes (unprotected) under the attribute
attributes
ormetadata
. - Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
- There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).
Creating a new dataset should be as simple as:
import numpy as np
import pandas as pd
import xray
variables = {'y': ('y', ['a', 'b', 'c']),
't': ('t', pd.date_range('2000-01-01', periods=5)),
'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}
# from scratch
dataset = xray.Dataset(variables, attributes)
# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
dataset[k] = v
dataset.attributes = attributes
[1] This property should probably be renamed from variable
to array
. Also, perhaps we should add values
as an alias of data
(to mirror pandas).
[2] Array
will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.