You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: this issue is mostly a TODO list for myself, but input would also be welcome.
In my latest PR #12, Dataset is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).
I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:
The dimensions attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to np.arange(data.shape[axis])).
All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to pandas.Index objects when stored in a dataset.
Note: Pandas index objects are numpy.ndarray subclasses (or at least API compatible) but are immutable.
When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
The public interface to variables will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e, dataset['foo'] = foo or del dataset['foo']), which will validate and perform appropriate conversions:
Dimensions are checked for conflicts.
Coordinate variables are converted into pandas.Index objects (as noted above)
If a DatasetArray is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).
If the item assigned is not an Array or DatasetArray, it can be an iterable of the form (dimensions, data) or (dimensions, data, attributes), which is unpacked as the arguments to create a new Array.
Thus, we can get rid of set_dimension, set_variable, create_variable, etc.
We can expose attributes (unprotected) under the attribute attributes or metadata.
Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).
[1] This property should probably be renamed from variable to array. Also, perhaps we should add values as an alias of data (to mirror pandas).
[2] Array will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.
The text was updated successfully, but these errors were encountered:
Note: this issue is mostly a TODO list for myself, but input would also be welcome.
In my latest PR #12,
Dataset
is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:
dimensions
attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.np.arange(data.shape[axis])
).pandas.Index
objects when stored in a dataset.numpy.ndarray
subclasses (or at least API compatible) but are immutable.variables
will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e,dataset['foo'] = foo
ordel dataset['foo']
), which will validate and perform appropriate conversions:pandas.Index
objects (as noted above)DatasetArray
is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).Array
orDatasetArray
, it can be an iterable of the form(dimensions, data)
or(dimensions, data, attributes)
, which is unpacked as the arguments to create a newArray
.set_dimension
,set_variable
,create_variable
, etc.attributes
ormetadata
.Creating a new dataset should be as simple as:
[1] This property should probably be renamed from
variable
toarray
. Also, perhaps we should addvalues
as an alias ofdata
(to mirror pandas).[2]
Array
will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.The text was updated successfully, but these errors were encountered: