Skip to content

Confused about Zarr2 to Zarr3 conversion #3024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Fafa87 opened this issue Apr 28, 2025 · 2 comments
Open

Confused about Zarr2 to Zarr3 conversion #3024

Fafa87 opened this issue Apr 28, 2025 · 2 comments

Comments

@Fafa87
Copy link

Fafa87 commented Apr 28, 2025

I just wanted to leave a trace about my experience with Zarr2 to Zarr3 conversion (including shards).

Basically I wanted to check the new Zarr version - all for the sharding feature.
The idea was just to grab some Zarr2 array that I have and convert it to Zarr3 with x2 / x4 sharding to test the performance.

Well, it took a while because first thing that I did was just to:

store = zarr.storage.LocalStore(single_zarr_path)
group = zarr.open_group(store)
arrays_from_group = list(group.arrays())
data= arrays_from_group[0][1]

store_3 = zarr.storage.LocalStore(output_zarr3_path)
zarr_from_array = zarr.from_array(store_3 , data=data, zarr_format=3)

It worked and I got working Zarr3. So easy.

Then I thought there is a sharding parameter there so I will just fill it (with x4 shards) in and get sharded Zarr3:

store_shards = zarr.storage.LocalStore(output_zarr3sharded_path)
zarr_from_array = zarr.from_array(store_shards, data=data, shards=(1, 1, 1, 4096, 4096), zarr_format=3)

It worked not - although it looked like it did. When I showed info_complete() I got information that compression factor got from 1.4 to 5 and it was susicious. It turned out that most of the files are just empty. Then I wanted to compute sum() of all pixels and it failed with some error about encoding / compression - so I went into that dead-end.

Finally I got to the point that the correct (is it?) the way to do it is to create a new empty array and copy the data:

zarr_with_sharding = zarr.create_array(store_shards, shape=data.shape, dtype=data.dtype, chunks=(1, 1, 1, 1024, 1024),  shards=(1, 1, 1, 4096, 4096), zarr_format=3, overwrite=True)
zarr_with_sharding[:] = data[:]  # assuming that data is smalle

Is there a guideline for people of how to convert their Zarr2 datasets to new Zarr3 with sharding?
For me I did not find anything on that - the only thing is the legacy (?): https://github.com/ome/ome2024-ngff-challenge

@d-v-b
Copy link
Contributor

d-v-b commented Apr 28, 2025

hi @Fafa87, sorry to hear the confusion here and thanks for writing this up as an issue. What you initially tried (using from_array with the original array, but a new sharding configuration) should work, so I'm thinking we have a bug in from_array.

@Fafa87
Copy link
Author

Fafa87 commented Apr 28, 2025

Oh, then let me add more info about the datasets:

Zarr: 3.0.7

data (original Zarr2):

Type               : Array
 Zarr format        : 2
 Data type          : uint16
 Shape              : (1, 3, 1, 11840, 59200)
 Chunk shape        : (1, 1, 1, 1024, 1024)
 Order              : C
 Read-only          : False
 Store type         : LocalStore
 Filters            : ()
 Compressors        : (Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),)
 No. bytes          : 4205568000 (3.9G)
 No. bytes stored   : 3183305680
 Storage ratio      : 1.3
 Chunks Initialized : 0,

After conversion:

Type               : Array
Zarr format        : 3
Data type          : DataType.uint16
Shape              : (1, 3, 1, 11840, 59200)
Shard shape        : (1, 1, 1, 4096, 4096)
Chunk shape        : (1, 1, 1, 1024, 1024)
Order              : C
Read-only          : False
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 4205568000 (3.9G)
No. bytes stored   : 780577258
Storage ratio      : 5.4  <<<< !!
Shards Initialized : 135

Running sum using dask breaks:

import dask.array as da
group = zarr.open(store_shards)
print("Sum: ", da.from_zarr(group).sum().compute())

--> RuntimeError: Zstd decompression error: b'Src size is incorrect'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants