Simplified Dataset API

Note: this issue is mostly a TODO list for myself, but input would also be welcome.

In my latest PR #12, `Dataset` is a bit of a mess. It has redundant properties (variables, dimensions, indices), which make it unnecessarily complex to work with. Furthermore, the "datastore" abstraction is leaky, because not all datastores support all the operations (e.g., deleting or renaming variables).

I would like to simply Dataset by restricting its contents to (ordered?) dictionaries of variables and attributes:
- The `dimensions` attribute still exists, but is read-only (in the public API) and always equivalent to the map from all variable dimensions to their shapes. It will no longer be possible to create a dimension without a variable.
  - Thus, if a variable is creating with a dimension that is not already a dataset coordinate, a new variable for the coordinate will be created from scratch (defaulting to `np.arange(data.shape[axis])`).
- All "coordinate" variables (i.e., 1-dimensional with name equal to their dimension) are cast to `pandas.Index` objects when stored in a dataset.
  - Note: Pandas index objects are `numpy.ndarray` subclasses (or at least API compatible) but are immutable.
  - When time axes are converted, their "units" attribute will be removed (since it is no longer descriptive).
- The public interface to `variables` will be read only. To modify a dataset's variables, use item syntax on the dataset (i.e,  `dataset['foo'] = foo` or `del dataset['foo']`), which will validate and perform appropriate conversions:
  - Dimensions are checked for conflicts.
  - Coordinate variables are converted into `pandas.Index` objects (as noted above)
  - If a `DatasetArray` is assigned, it's contents are merged in to the dataset, with an exception raised if the merge cannot be done safely. (To assign only the Array object, use the appropriate property of the dataset array [1]).
  - If the item assigned is not an `Array` or `DatasetArray`, it can be an iterable of the form `(dimensions, data)` or `(dimensions, data, attributes)`, which is unpacked as the arguments to create a new `Array`.
  - Thus, we can get rid of `set_dimension`, `set_variable`, `create_variable`, etc.
- We can expose attributes (unprotected) under the attribute `attributes` or `metadata`.
- Loading and saving datasets in different file formats will have to create a new file from scratch [2]. But it's rare to want to actually mutate a netCDF file in place, and if you do, there are existing tools to do that. I think it's really hard to get the store abstraction right, and we should probably leave that to specialized libraries.
- There is a case to be made that variable and attribute names should be required to be valid identifiers for typical file formats (minimally, strings). This could be implemented in a custom OrderedDict subclass. Or the user could simply be expected to behave responsibly, and exceptions will be raised when trying to save something with invalid identifiers (depending on the file format). I am certainly opposed to extensive validation, which would slow things down (e.g., confirming that values are safe for netCDF3).

Creating a new dataset should be as simple as:

``` python
import numpy as np
import pandas as pd
import xray

variables = {'y': ('y', ['a', 'b', 'c']),
             't': ('t', pd.date_range('2000-01-01', periods=5)),
             'foo': (('t', 'x', 'y'), np.random.randn(5, 3, 10))}
attributes = {'title': 'nonsense'}

# from scratch
dataset = xray.Dataset(variables, attributes)

# or equivalently:
dataset = xray.Dataset()
for k, v in variables.items():
    dataset[k] = v
dataset.attributes = attributes
```

[1] This property should probably be renamed from `variable` to `array`. Also, perhaps we should add `values` as an alias of `data` (to mirror pandas).
[2] `Array` will need to be updated so it always copies on write if the underlying data is not stored as a numpy ndarray.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Simplified Dataset API #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Simplified Dataset API #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions