Skip to content

Fixed MemoryStore.list_dir #2117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 26, 2024
Merged

Conversation

TomAugspurger
Copy link
Contributor

Ensures that nested children are listed properly.

Closes #2116

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Ensures that nested children are listed properly.
await store.set("foo/c/d/2", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/3", Buffer.from_bytes(b"\x01"))

keys_expected = ["foo"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is failing on the S3 tests.

        keys_expected = ["foo"]
        keys_observed = [k async for k in store.list_dir("")]
>       assert set(keys_observed) == set(keys_expected), keys_observed
E       AssertionError: ['test/foo']

It's including self.path (the bucket name?) in listing. Based on my reading of the tests, that should not be included, but I'm not 100% sure. 90940a0 has that fix

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to get my head around the new logic. Reading #2116, is the goal here to just not return any duplicate values? If so, I wonder if it would be easier to keep track of the values that have been returned, and only yield a value if it hasn't been returned yet (otherwise, check the next value). This way it would also truly generate the values one at a time, instead of the new approach that computes all the keys before yielding them.

@TomAugspurger
Copy link
Contributor Author

Thanks for the review!

is the goal here to just not return any duplicate values?

Effectively yes. Duplicate keys were a consequence of the previous implementation not handling arrays nested in child groups properly.

If so, I wonder if it would be easier to keep track of the values that have been returned, and only yield a value if it hasn't been returned yet (otherwise, check the next value). This way it would also truly generate the values one at a time, instead of the new approach that computes all the keys before yielding them.

3c845d9 has an approach that's a middle ground between my previous fix and your suggestion. I think it's a lot simpler. It does still generate all the keys before yield them, but I'm not too worried about that. This is an in-memory store, so things should be pretty small and fast, and this matches the implementation of the non-prefix case.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good to me. I think if you drop the last bit of the assert statement, pytest is clever and will report the difference between the two sets in it's output when the test fails.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @TomAugspurger!

Comment on lines 194 to 197
await store.set("foo/c/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/2", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/3", Buffer.from_bytes(b"\x01"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its super material to the test but I believe the intent of this test was to have foo be an array c be the subdirectory containing chunks. Perhaps it would be better to change this to something more like:

Suggested change
await store.set("foo/c/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/2", Buffer.from_bytes(b"\x01"))
await store.set("foo/c/d/3", Buffer.from_bytes(b"\x01"))
await store.set("foo/bar/c/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/bar/c/2", Buffer.from_bytes(b"\x01"))
await store.set("foo/spam/c/1", Buffer.from_bytes(b"\x01"))
await store.set("foo/spam/c/2", Buffer.from_bytes(b"\x01"))

Copy link
Contributor Author

@TomAugspurger TomAugspurger Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... IIUC, if you have a key at foo/zarr.json with a node_type="array", that implies you won't have any keys at foo/bar/zarr.json (a group or a node), since arrays can't have children?

I tried to update this in 7414b10, to make sure that the test store we construct is at least compatible with how a Zarr hierarchy would be laid out.

/foo  # group
  /c  # array
/group-0
  /group-1
    /array-0
    /array-1
    /array-2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jhamman jhamman merged commit 61683be into zarr-developers:v3 Aug 26, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MemoryStore.list_dir incorrectly includes non-immediate children
3 participants