Skip to content

ENH: On the fly type conversion #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Nov 30, 2016 · 12 comments
Closed

ENH: On the fly type conversion #94

jakirkham opened this issue Nov 30, 2016 · 12 comments
Labels
enhancement New features or improvements release notes done Automatically applied to PRs which have release notes.
Milestone

Comments

@jakirkham
Copy link
Member

jakirkham commented Nov 30, 2016

h5py's Dataset has a nice feature where one can convert type on the fly as part of reading by using the astype context manager (e.g. int to float). Is there an equivalent way to do this with Zarr?

@alimanfoo
Copy link
Member

alimanfoo commented Nov 30, 2016 via email

@jakirkham
Copy link
Member Author

Was looking at some of the filters to get some ideas. The one point I'm getting stuck on is there are both encode and decode methods. While it is clear what decode means in this context (convert data to the specified type for use), it is not clear what encode means. In fact, we don't really want to ever use this filter to encode data. Is it ok to leave encode undefined?

@alimanfoo
Copy link
Member

alimanfoo commented Dec 5, 2016 via email

@alimanfoo
Copy link
Member

alimanfoo commented Dec 5, 2016 via email

@jakirkham
Copy link
Member Author

Am I just overthinking this? Would something like this work?

>>> import numpy
>>> import zarr
>>>
>>> a = zarr.open_array('data.zarr', mode='r')
>>> a = zarr.array(a, dtype=numpy.float32)

@alimanfoo
Copy link
Member

alimanfoo commented Dec 5, 2016 via email

@jakirkham
Copy link
Member Author

Would it copy all of it at once or only pieces as requested?

@alimanfoo
Copy link
Member

alimanfoo commented Dec 5, 2016 via email

@jakirkham
Copy link
Member Author

To give more context, I have data that is stored to disk as int16. However, before doing any computation with it, I want to make sure it is cast to float32 first. This is needed as we need floating precision and anything above single precision is basically a waste. So allowing things to get converted to double precision is bad.

If doing this conversion with the above pseudo-code will be slightly wasteful due to a small copying overhead, I can live with this for the interim. However, if the above pseudo-code results in copying all data from disk into memory, then that won't work.

Given this context, what do you think? Would this approach work or do I need to play with filters more?

@jakirkham
Copy link
Member Author

jakirkham commented Dec 5, 2016

Oops, sorry, didn't see your comment until I posted that. Yeah was thinking about dask too, but I wanted to weigh Zarr on its own merits first. It's a bit hard to determine the effects on performance if I'm changing too much at once.

@alimanfoo
Copy link
Member

alimanfoo commented Dec 5, 2016 via email

@jakirkham jakirkham mentioned this issue Dec 7, 2016
@jakirkham
Copy link
Member Author

Yeah, I don't care so much about making views of Zarr arrays. Really am only interested in converting data on disk from one format to another when read into memory. Hoping this use case has a place in Zarr.

In any event, I have a rough implementation of something I think should help with this, but will need to play with it more to be sure. Please see PR ( https://github.com/alimanfoo/zarr/issues/96 ) for details.

Had a little trouble with it yesterday, which I have worked out today. I think some of the issues were related to registering a filter external to Zarr and the other issues were subtle errors making sure the right type was in the right place. These tests added in the proposal give me some confidence the latter issues has been solved. With the former case, I may just not know the right way to register filters, which we can discuss in another thread.

@alimanfoo alimanfoo modified the milestone: v2.2 Dec 15, 2016
@alimanfoo alimanfoo added enhancement New features or improvements release notes done Automatically applied to PRs which have release notes. labels Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements release notes done Automatically applied to PRs which have release notes.
Projects
None yet
Development

No branches or pull requests

2 participants