Skip to content

[V3] v2 -> v3 data migration #1798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d-v-b opened this issue Apr 17, 2024 · 13 comments · Fixed by #2596
Open

[V3] v2 -> v3 data migration #1798

d-v-b opened this issue Apr 17, 2024 · 13 comments · Fixed by #2596
Labels
enhancement New features or improvements
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 17, 2024

We should invest in tools to make the v2 -> v3 conversion simple for people who are motivated to convert their data. A few high-level ideas:

  • A simple CLI that converts an array, or a group (recursive or not) from v2 to v3, in a new location.
    • Someone should investigate how complicated in-place conversions would be. On a local filesystem where mv is cheap, this could be attractive.. V3 is designed to make array conversions easy, requiring only the creation of new metadata.
    • The CLI should use functions that are accessible from scripts that don't use the CLI. We can look at work @normanrz did in Zarrita.
  • Documentation of the key differences between zarr-python v2 and v3, and a migration guide. This should have its own page in the docs.
    • We should consider options for people who don't want to re-save their data. I'm not presently a kerchunk user, but I presume that kerchunk could map v2 to v3, for people who don't want to convert their data? cc @martindurant.
@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 17, 2024

Big +1 on this. I'm working on a conversion tool for large-scale genomics data (100s TB scale) which is usually held in file systems (for the moment, it will probably migrate to object stores later on). A CLI tool that does an in-place migration from v2 to v3 would be a big help. I'm hoping to move to v3 early on, before too many datasets are converted into v2 format and so most users won't ever know about v2.

My assumptions was that the migration was largely a case of writing a new JSON metadata file per-array, and should be possible to do both cheaply and safely?

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

My assumptions was that the migration was largely a case of writing a new JSON metadata file per-array, and should be possible to do both cheaply and safely?

Yes, I think this is right. Besides the metadata, which will live in a completely new JSON document (zarr.json), V3 supports a backwards-compatible layout for the chunks

@jeromekelleher
Copy link
Member

Thanks yes, I've been aiming for v3 forwards compatibility by using "/" as the default dimension separator. Then, iterating over the chunks in the first dimension and renaming to have a "c" prefix should be relatively cheap (I forgot about this difference).

Is there some developer documentation with recommendations for forwards/backwards compatibility?

@normanrz
Copy link
Member

  • Someone should investigate how complicated in-place conversions would be. On a local filesystem where mv is cheap, this could be attractive.

For most cases, the migration only requires adding zarr.json files throughout the hierarchy. There should be no need to even touch the chunk files. zarr.json and .zarray files can also live side-by-side. So, why would a mv be needed?
Only, when using a non-supported codec or filter, chunks need to be rewritten.

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

So, why would a mv be needed?
Only, when using a non-supported codec or filter, chunks need to be rewritten.

This is correct. When I wrote up this issue, I forgot about the v2 chunk key encoding supported by v3 🤦

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

i updated the issue to be more accurate :)

@normanrz
Copy link
Member

I agree that a CLI tool that can convert an entire hierarchy would be great!

@jhamman jhamman added the V3 label Apr 19, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Apr 19, 2024
@jhamman
Copy link
Member

jhamman commented Apr 19, 2024

Today I learned that there is a v1 to v2 migrator in the zarr-python codebase:

zarr-python/zarr/storage.py

Lines 1941 to 1956 in 6105ef2

def migrate_1to2(store):
"""Migrate array metadata in `store` from Zarr format version 1 to
version 2.
Parameters
----------
store : Store
Store to be migrated.
Notes
-----
Version 1 did not support hierarchies, so this migration function will
look for a single array in `store` and migrate the array metadata to
version 2.
"""

@jhamman jhamman modified the milestones: After 3.0.0, 3.0.0 Apr 22, 2024
@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 Apr 22, 2024
@dstansby dstansby removed the V3 label Dec 12, 2024
@dstansby dstansby changed the title [V3] v2 -> v3 migration [V3] v2 -> v3 datamigration Dec 16, 2024
@dstansby dstansby changed the title [V3] v2 -> v3 datamigration [V3] v2 -> v3 data migration Dec 16, 2024
@jhamman jhamman marked this as a duplicate of #2564 Dec 18, 2024
@dstansby dstansby marked this as not a duplicate of #2564 Dec 18, 2024
@jhamman jhamman marked this as not a duplicate of #2564 Dec 19, 2024
@dstansby dstansby added the enhancement New features or improvements label Dec 30, 2024
@jhamman jhamman modified the milestones: 3.0.0, After 3.0.0 Jan 3, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Zarr-Python - 3.0 Jan 3, 2025
@jhamman jhamman reopened this Jan 3, 2025
@jhamman
Copy link
Member

jhamman commented Jan 3, 2025

#2596 was mislabeled as closing this. A migration tool would still be great!

@meteoDaniel
Copy link

I am intersted, too.

Without the opportunity to migrate Data from v2 to v3, it would be a nightmare to deprecate zarr v2 support.
M

@eschechter
Copy link

This gist is a first-pass effort at a migration script. My team has begun to use it to explore converting our existing zarr datasets, but it is currently limited in scope to zarrs stored on s3. I'd welcome any feedback or discussion about how this could be useful to other devs.

@meteoDaniel
Copy link

@eschechter thanks for sharing!

So it is enough to adapt the metadata without touching the data itself?

@eschechter
Copy link

That's correct @meteoDaniel - the zarr v3 conversion can be done with just the creation of zarr.json metadata files. No need to touch the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants