-
-
Notifications
You must be signed in to change notification settings - Fork 330
ENH: On the fly type conversion #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes I think you could do this with filters, although not sure what the
cleanest way would be. Will look at the h5pi API and give it some thought.
…On Wed, 30 Nov 2016 at 22:30, jakirkham ***@***.***> wrote:
h5py's Dataset has a nice feature where one can convert type on the fly as
part of reading by using the astype context manager (e.g. int to float).
Is there an equivalent way to do this with Zarr?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QsEIJmj2IyTAKMgfWf-J_lkm8KWqks5rDfj4gaJpZM4LAw_a>
.
|
Was looking at some of the filters to get some ideas. The one point I'm getting stuck on is there are both |
Yes I think so, encode can be a no-op, just pass buffer through unchanged.
There are other cases where decode is a no-op, e.g., quantize.
…On Mon, 5 Dec 2016 at 15:59, jakirkham ***@***.***> wrote:
Was looking at some of the filters to get some ideas. The one point I'm
getting stuck on is there are both encode and decode methods. While it is
clear what decode means in this context (convert data to the specified type
for use), it is not clear what encode means. In fact, we don't really
want to ever use this filter to encode data. Is it ok to leave encode
undefined?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94#issuecomment-264892784>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtJjDmqX8sLsngaLhRNJIYvuTktlks5rFDT-gaJpZM4LAw_a>
.
|
...although might need some thought around whether you allow the resulting
array to be mutable. If read-only then should be fine for encode to be
no-op.
On Mon, 5 Dec 2016 at 18:24, Alistair Miles <[email protected]>
wrote:
… Yes I think so, encode can be a no-op, just pass buffer through unchanged.
There are other cases where decode is a no-op, e.g., quantize.
On Mon, 5 Dec 2016 at 15:59, jakirkham ***@***.***> wrote:
Was looking at some of the filters to get some ideas. The one point I'm
getting stuck on is there are both encode and decode methods. While it is
clear what decode means in this context (convert data to the specified type
for use), it is not clear what encode means. In fact, we don't really
want to ever use this filter to encode data. Is it ok to leave encode
undefined?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94#issuecomment-264892784>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtJjDmqX8sLsngaLhRNJIYvuTktlks5rFDT-gaJpZM4LAw_a>
.
|
Am I just overthinking this? Would something like this work? >>> import numpy
>>> import zarr
>>>
>>> a = zarr.open_array('data.zarr', mode='r')
>>> a = zarr.array(a, dtype=numpy.float32) |
That would copy the data into a new array.
…On Mon, 5 Dec 2016 at 21:41, jakirkham ***@***.***> wrote:
Am I just overthinking this? Would something like this work?
>>> import numpy>>> import zarr>>>>>> a = zarr.open_array('data.zarr', mode='r')>>> a = zarr.array(a, dtype=numpy.float32)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94#issuecomment-264986104>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjNXVL7K9fVCSvaY2yUZUvy_Rxoqks5rFIUBgaJpZM4LAw_a>
.
|
Would it copy all of it at once or only pieces as requested? |
It would copy all.
I wonder if dask is the solution here. Can't check this works right now but
I'm thinking:
z = zarr.zeros(1000, chunks=100, dtype='f8')
d = da.from_array(z, chunks=z.chunks)
d2 = d.astype('f4')
Then requesting any data via d2 would only convert the region requested and
compute on the fly.
…On Mon, 5 Dec 2016 at 21:46, jakirkham ***@***.***> wrote:
Would it copy all of it at once or only pieces as requested?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94#issuecomment-264987434>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QpxOuJomPfcwtan7uWkgiCw_sVFGks5rFIYegaJpZM4LAw_a>
.
|
To give more context, I have data that is stored to disk as If doing this conversion with the above pseudo-code will be slightly wasteful due to a small copying overhead, I can live with this for the interim. However, if the above pseudo-code results in copying all data from disk into memory, then that won't work. Given this context, what do you think? Would this approach work or do I need to play with filters more? |
Oops, sorry, didn't see your comment until I posted that. Yeah was thinking about dask too, but I wanted to weigh Zarr on its own merits first. It's a bit hard to determine the effects on performance if I'm changing too much at once. |
Fwiw I think something like this could be done in zarr, basically by making
a zarr array which is a view on another zarr array except with an extra
filter performing the astype conversion on data being retrieved. There is
some machinery already in zarr for creating arrays that are a view on
another array's data, some of that could be reused.
However this is starting to cross the line into stuff where it makes sense
to handle at a higher layer in the stack, I.e., via dask.array. There could
be performance costs as well as benefits to using dask here.
Happy to discuss further if you'd like to explore in zarr, should have more
time next week to think about it.
…On Mon, 5 Dec 2016 at 22:06, jakirkham ***@***.***> wrote:
Oops, sorry, didn't see your comment until I posted that. Yeah was
thinking about dask too, but I wanted to weight Zarr on its own merits
first. It's a bit hard to determine the effect on performance if I'm
changing too much at once.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/94#issuecomment-264993066>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qizsmy6QYApnzrtw0mvt-l-as85fks5rFIrhgaJpZM4LAw_a>
.
|
Yeah, I don't care so much about making views of Zarr arrays. Really am only interested in converting data on disk from one format to another when read into memory. Hoping this use case has a place in Zarr. In any event, I have a rough implementation of something I think should help with this, but will need to play with it more to be sure. Please see PR ( https://github.com/alimanfoo/zarr/issues/96 ) for details. Had a little trouble with it yesterday, which I have worked out today. I think some of the issues were related to registering a filter external to Zarr and the other issues were subtle errors making sure the right type was in the right place. These tests added in the proposal give me some confidence the latter issues has been solved. With the former case, I may just not know the right way to register filters, which we can discuss in another thread. |
h5py's Dataset has a nice feature where one can convert type on the fly as part of reading by using the
astype
context manager (e.g.int
tofloat
). Is there an equivalent way to do this with Zarr?The text was updated successfully, but these errors were encountered: