-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Shaping the future of Backends #8548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
see also #5954 for a previous discussion of the |
Doesn't this already exist as
On the other hand this suggestion seems to be something that could not be immediately handled by the current |
Agree that these "transformers" are called "coders" ATM, linking this quite old proposal! #155 |
Can these transformers/coders just be new zarr codecs? Exposing xarray's decoding logic in a way that follows that interface would allow for zarr to become a "universal reader" - see zarr-developers/zarr-specs#303. |
What is your issue?
Backends in xarray are used to read and write files (or in general objects) and transform them into useful xarray Datasets.
This issue will collect ideas on how to continuously improve them.
Current state
Along the reading and writing process there are many implicit and explicit configuration possibilities. There are many backend specific options and many en-,decoder specific options. Most of them are currently difficult or even impossible to discover.
There is the infamous
open_dataset
method which can do everything, but there are also some specialized methods likeopen_zarr
orto_netcdf
.The only really formalized way to extend xarray capabilities is via the
BackendEntrypoint
. Currently only for reading files.This has proven to work and things are going so well that people are discussing getting rid of the special reading methods (#7495).
A major critique in this thread is again the discoverability of configuration options.
Problems
To name a few:
open_dataset
What already improved
The future
After listing all the problems, lets see how we can improve the situation and make backends an allrounder solution to reading and writing all kinds of files.
What happens behind the scenes
In general the reading and writing of Datasets in xarray is a three-step process.
Probably you could consider combining the chunking and decoding as well as validation and encoding into a single logical step in the pipeline. This view should help decide how to set up a future architecture of backends.
You can see that there is a common middle object in this process, a in-memory representation of the file on disc between en-, decoding and the abstract store. This is actually a
xarray.Dataset
and is internally called a "backend dataset".write_dataset
methodA quite natural extension of backends would be to implement a
write_dataset
method (name pending). This would allow backends to fulfill the complete right side of the pipeline.Transformer class
Due to a lack of a common word for a class that handles "encoding" and "decoding" I will call them transformer here.
The process of en- and decoding is currently done "hardcoded" by the respective
open_dataset
andto_netcdf
methods.One could imagine to introduce the concept of a common class that handles both.
This class could handle the implemented CF or netcdf encoding conventions.
But it would also allow users to define their own storing conventions (Why not create a custom transformer that adds indexes based on variable attributes?)
The possibilities are endless, and an interface that fulfills all the requirements still has to be found.
This would homogenize the reading and writing process to
As a bonus this would increase discoverability of the configuration options of the decoding options (then transformer arguments).
The new interface then could be
while of course still allowing to pass all options simply as kwarg (since this is still the easiest way of telling beginners how to open files)
The final improvement here would be to add additional entrypoints for these transformers ;)
Disclaimer
Now this issue is just a bunch of random ideas that require quite some refinement or they might even turn out to be nonsense.
So lets have a exciting discussion about these things :)
If you have something to add to the above points I will include your ideas as well. This is meant as a collection of ideas on how to improve our backends :)
The text was updated successfully, but these errors were encountered: