Skip to content

Simplified Dataset API #13

Closed
Closed
@shoyer

Description

@shoyer

Note: this issue is mostly a TODO list for myself, but input would also be welcome.

In my latest PR #12, Dataset is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).

I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:

  • The dimensions attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
    • Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to np.arange(data.shape[axis])).
  • All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to pandas.Index objects when stored in a dataset.
    • Note: Pandas index objects are numpy.ndarray subclasses (or at least API compatible) but are immutable.
    • When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
  • The public interface to variables will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e, dataset['foo'] = foo or del dataset['foo']), which will validate and perform appropriate conversions:
    • Dimensions are checked for conflicts.
    • Coordinate variables are converted into pandas.Index objects (as noted above)
    • If a DatasetArray is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).
    • If the item assigned is not an Array or DatasetArray, it can be an iterable of the form (dimensions, data) or (dimensions, data, attributes), which is unpacked as the arguments to create a new Array.
    • Thus, we can get rid of set_dimension, set_variable, create_variable, etc.
  • We can expose attributes (unprotected) under the attribute attributes or metadata.
  • Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
  • There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).

Creating a new dataset should be as simple as:

import numpy as np
import pandas as pd
import xray

variables = {'y': ('y', ['a', 'b', 'c']),
             't': ('t', pd.date_range('2000-01-01', periods=5)),
             'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}

# from scratch
dataset = xray.Dataset(variables, attributes)

# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
    dataset[k] = v
dataset.attributes = attributes

[1] This property should probably be renamed from variable to array. Also, perhaps we should add values as an alias of data (to mirror pandas).
[2] Array will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions