Skip to content

Structured Arrays #110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Feb 10, 2017 · 2 comments
Closed

Structured Arrays #110

jakirkham opened this issue Feb 10, 2017 · 2 comments

Comments

@jakirkham
Copy link
Member

Was interested in the possibility of storing structured arrays (a.k.a. record arrays or compound arrays) using Zarr. This is sort of related to PR ( https://github.com/alimanfoo/zarr/pull/84 ), but structured arrays are a simpler type. It also corresponds to a NumPy array type and a HDF5 dataset. So it might make sense to add similar support in Zarr. OTOH in both HDF5 and Zarr it is possible to construct a group that contains the individual arrays and at least with HDF5 this makes it easier to view using HDFView. Am opening this issue to discuss and weigh different options regarding the storage of record arrays using Zarr.

ref: https://docs.scipy.org/doc/numpy/user/basics.rec.html
ref: https://support.hdfgroup.org/HDF5/Tutor/compound.html

@alimanfoo
Copy link
Member

Zarr does support storing structured arrays, e.g.:

In [9]: import numpy as np

In [10]: import zarr

In [11]: a = np.array([(b'a', 1), (b'b', 2)], dtype=[('foo', 'S1'), ('bar', int)])

In [12]: z = zarr.array(a)

In [13]: z
Out[13]: 
Array((2,), [('foo', 'S1'), ('bar', '<i8')], chunks=(2,), order=C)
  nbytes: 18; nbytes_stored: 438; ratio: 0.0; initialized: 1/1
  compressor: Blosc(cname='lz4', clevel=5, shuffle=1)
  store: dict

However, one nice feature that h5py has which zarr doesn't currently have is the ability to load a specific field or fields, e.g., z[:, 'foo']. I have actually wanted this recently.

FWIW regarding the choice of single array with structured dtype versus one array with simple dtype per column, I've generally found the latter (i.e., columnar) storage to be more flexible and more efficient for a variety of uses. However I still do use structured arrays too and so would be interested to support both patterns.

@jakirkham
Copy link
Member Author

jakirkham commented Feb 10, 2017

Oh, that's good to know. Then maybe I'm running into a bug or an unsupported edge case. ( https://github.com/alimanfoo/zarr/issues/111 ) Will close this out.

However, one nice feature that h5py has which zarr doesn't currently have is the ability to load a specific field or fields, e.g., z[:, 'foo']. I have actually wanted this recently.

Indeed, that would be a very nice feature for Zarr. Used this with h5py some myself.

Edit: Opened issue ( https://github.com/alimanfoo/zarr/issues/112 ) on this this point.

...I've generally found the latter (i.e., columnar) storage to be more flexible and more efficient for a variety of uses. However I still do use structured arrays too and so would be interested to support both patterns.

By columnar, I'm assuming you mean having a single type for the array. In which case, I do agree with you. Still sometimes a structured array is just the right data structure for the problem.

jhamman added a commit to jhamman/zarr-python that referenced this issue Apr 20, 2024
…#110)

* feature(store): make list_* methods async generators

* Update src/zarr/v3/store/memory.py

* Apply suggestions from code review

- simplify code comments
- use `removeprefix` instead of `strip`

---------

Co-authored-by: Davis Bennett <[email protected]>
d-v-b added a commit that referenced this issue May 15, 2024
* feat: functional .children method for groups

* changes necessary for correctly generating list of children

* add stand-alone test for group.children

* give type hints a glow-up

* test: use separate assert statements to avoid platform-dependent ordering issues

* test: put fixtures in conftest, add MemoryStore fixture

* docs: release notes

* test: remove prematurely-added mock s3 fixture

* chore: move v3 tests into v3 folder

* chore: type hints

* test: add schema for group method tests

* chore: add type for zarr_formats

* chore: remove localstore for now

* test: add __init__.py to support imports from top-level conftest.py, and add some docstrings, and remove redundant def

* fix: return valid JSON from GroupMetadata.to_bytes for v2 metadata

* fix: don't use a type as a value

* test: add getitem test

* fix: replace reference to nonexistent  method in  with , which does exist

* test: declare v3ness via directory structure, not test file name

* add a docstring to _get, and pass auto_mkdir to _put

* fix: add docstring to LocalStore.get_partial_values; adjust body of LocalStore.get_partial_values to properly handle the byte_range parameter of LocalStore.get.

* test: add tests for localstore init, set, get, get_partial

* fix: Rename children to members; AsyncGroup.members yields tuples of (name, AsyncArray / AsyncGroup) pairs; Group.members repackages these into a dict.

* fix: make Group.members return a tuple of str, Array | Group pairs

* fix: revert changes to synchronization code; this is churn that we need to deal with

* chore: move v3 tests into v3 folder

* chore: type hints

* test: add schema for group method tests

* chore: add type for zarr_formats

* chore: remove localstore for now

* test: add __init__.py to support imports from top-level conftest.py, and add some docstrings, and remove redundant def

* fix: return valid JSON from GroupMetadata.to_bytes for v2 metadata

* fix: don't use a type as a value

* test: add getitem test

* fix: replace reference to nonexistent  method in  with , which does exist

* test: declare v3ness via directory structure, not test file name

* add a docstring to _get, and pass auto_mkdir to _put

* fix: add docstring to LocalStore.get_partial_values; adjust body of LocalStore.get_partial_values to properly handle the byte_range parameter of LocalStore.get.

* test: add tests for localstore init, set, get, get_partial

* fix: remove pre-emptive fetching from group.open

* fix: use removeprefix (removes a substring) instead of strip (removes any member of a set); comment out / avoid tests that cannot pass right now; don't consider implicit groups for v2; check if prefix is present in storage before opening for Group.getitem

* xfail v2 tests that are sure to fail; add delitem tests; partition xfailing tests into subtests

* fix: handle byte_range[0] being None

* fix: adjust test for localstore.get to check that get on nonexistent keys returns None; correctly create intermediate directories when preparing test data in test_local_store_get_partial

* fix: add zarr_format parameter to array creation routines (which raises if zarr_format is not 3), and xfail the tests that will hit this condition. add tests for create_group, create_array, and update_attributes methods of asyncgroup.

* test: add group init test

* feature(store): make list_* methods async generators (#110)

* feature(store): make list_* methods async generators

* Update src/zarr/v3/store/memory.py

* Apply suggestions from code review

- simplify code comments
- use `removeprefix` instead of `strip`

---------

Co-authored-by: Davis Bennett <[email protected]>

* fix: define utility for converting asyncarray to array, and similar for group, largely to appease mypy

* chore: remove checks that only existed because of implicit groups

* chore: clean up docstring and modernize some type hints

* chore: move imports to top-level

* remove fixture files

* remove commented imports

* remove explicit asyncio marks; use __eq__ method of LocalStore for test

* rename test_storage to test_store

* modern type hints

---------

Co-authored-by: Joe Hamman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants