Skip to content

Simplified Dataset API #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Feb 15, 2014 · 1 comment
Closed

Simplified Dataset API #13

shoyer opened this issue Feb 15, 2014 · 1 comment

Comments

@shoyer
Copy link
Member

shoyer commented Feb 15, 2014

Note: this issue is mostly a TODO list for myself, but input would also be welcome.

In my latest PR #12, Dataset is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).

I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:

  • The dimensions attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
    • Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to np.arange(data.shape[axis])).
  • All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to pandas.Index objects when stored in a dataset.
    • Note: Pandas index objects are numpy.ndarray subclasses (or at least API compatible) but are immutable.
    • When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
  • The public interface to variables will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e, dataset['foo'] = foo or del dataset['foo']), which will validate and perform appropriate conversions:
    • Dimensions are checked for conflicts.
    • Coordinate variables are converted into pandas.Index objects (as noted above)
    • If a DatasetArray is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).
    • If the item assigned is not an Array or DatasetArray, it can be an iterable of the form (dimensions, data) or (dimensions, data, attributes), which is unpacked as the arguments to create a new Array.
    • Thus, we can get rid of set_dimension, set_variable, create_variable, etc.
  • We can expose attributes (unprotected) under the attribute attributes or metadata.
  • Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
  • There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).

Creating a new dataset should be as simple as:

import numpy as np
import pandas as pd
import xray

variables = {'y': ('y', ['a', 'b', 'c']),
             't': ('t', pd.date_range('2000-01-01', periods=5)),
             'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}

# from scratch
dataset = xray.Dataset(variables, attributes)

# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
    dataset[k] = v
dataset.attributes = attributes

[1] This property should probably be renamed from variable to array. Also, perhaps we should add values as an alias of data (to mirror pandas).
[2] Array will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.

shoyer added a commit that referenced this issue Feb 16, 2014
Implements most of GitHub issue #13.
@shoyer
Copy link
Member Author

shoyer commented Feb 23, 2014

This was implemented in the referenced commit (now merged into master).

@shoyer shoyer closed this as completed Feb 23, 2014
jhamman pushed a commit that referenced this issue Oct 15, 2024
skip datatree zarr tests w/ zarr 3 for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant