Skip to content

Normalize paths when creating StorePath #2850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 4, 2025

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 19, 2025

In main we don't normalize paths when constructing an instance of StorePath, which means it's possible to create groups with names like a//b. This PR fixes this behavior by ensuring that the path attribute of StorePath is normalized with normalize_path in StorePath.__init__.

There is potentially a problem here if different store classes have different rules about path names. This argues against the existence of a separate StorePath class, and in favor of allowing Store instances to manage their current path.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 19, 2025
@dcherian
Copy link
Contributor

There is potentially a problem here if different store classes have different rules about path names.

The spec controls the characters that can be used, and the "separator" is configurable, so I'm finding it hard to see what other axis of flexbility exists.

# From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#node-names
# 1. must not be the empty string ("")
# 2. must not include the character "/"
# 3. must not be a string composed only of period characters, e.g. "." or ".."
# 4. must not start with the reserved prefix "__"
zarr_key_chars = st.sampled_from(
".-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
)
node_names = st.text(zarr_key_chars, min_size=1).filter(
lambda t: t not in (".", "..") and not t.startswith("__")
)

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 20, 2025

There is potentially a problem here if different store classes have different rules about path names.

The spec controls the characters that can be used, and the "separator" is configurable, so I'm finding it hard to see what other axis of flexbility exists.

# From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#node-names
# 1. must not be the empty string ("")
# 2. must not include the character "/"
# 3. must not be a string composed only of period characters, e.g. "." or ".."
# 4. must not start with the reserved prefix "__"
zarr_key_chars = st.sampled_from(
".-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
)
node_names = st.text(zarr_key_chars, min_size=1).filter(
lambda t: t not in (".", "..") and not t.startswith("__")
)

different key-value storage backends can set different rules about what keys are allowed. e.g., s3 and OSX might have different rules about file name length. It's not terrible if we end up parsing paths twice, but if there's a bug in one parsing path or the other, it would be annoying to look in two different classes to find that bug. I guess my broader complaint is that StorePath is a store with 1 extra property. Any logic we add to StorePath is logic that should exist on the store class itself, and this is duplicative. But we can live with it for now.

@aladinor
Copy link

I guess by merging this it will close #2830 right @d-v-b?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 26, 2025

I believe so

@d-v-b d-v-b requested a review from jhamman February 26, 2025 17:07
@@ -277,6 +278,8 @@ def arrays(
if a.metadata.zarr_format == 3:
assert a.fill_value is not None
assert a.name is not None
assert a.path == normalize_path(array_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we delete line 287 then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this? I think it's blocking DataTree compatibility.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

@dcherian could you review this?

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is this fixing a bug? (from the description it seems so?). If so, it should get a changelog entry explaining to users what's fixed. Also I left one minor test suggestion inline.

"""
Test basic properties of groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the old test here is useful (in addition to the updated test), because it shows explicitly what the values of all the properties are, making it much easier to see what's changed if they change (which they might have done with this PR). So I'd be pro just copy/pasting the old test back here along with the new test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new test is the same as the old test, but parametrized instead of 2 cases copy + pasted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a rule we should avoid tests that follow the "copy + pasted code blocks with minor modifications" pattern. these are huge source of friction when fixing bugs. that friction should be reduced by using parametrization where we can, as I did here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we shouldn't copy/paste with minor modifications, but I think one test that hard codes the results instead of using other properties is worthwhile to have.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Mar 4, 2025
@d-v-b d-v-b requested a review from dstansby March 4, 2025 17:46
@d-v-b d-v-b merged commit ee60792 into zarr-developers:main Mar 4, 2025
29 of 30 checks passed
@d-v-b d-v-b deleted the fix/node-name-resolution branch March 4, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants