Skip to content

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jun 25, 2025

this PR fixes the issues reported in #3167.

There were two problems driving that issue:

  • zarr python was not using the string "str" to denote a variable length string data type. Instead, in main we use NumPy's behavior and map "str" to a fixed-length UTF-32 string, which is not a useful data type for most people.

    In this PR, when a user requests dtype=str or dtype="str" or dtype="string" they get a VariableLengthUTF8 dtype instead of the nearly-useless U dtype.

  • zarr python could create data types for length-0 scalars. Since these data types cannot be associated with arrays that contain any values, they are useless. See disallow 0-length fixed-size data types #3168. This is a quirk of NumPy that we don't need to support. This PR contains logic to ensure that we don't create data types for length-0 scalars.

note that while debugging this, I found that zarr-python 2.x could create arrays with where the dtype in metadata indicates length 0, but the actual array uses a dtype with size 1. We will need some support for this, but I don't think we need that in this PR.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jun 25, 2025
@d-v-b d-v-b requested a review from rabernat June 25, 2025 17:26
@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 25, 2025

another important change: I altered the zarr v3 name of the VariableLengthUTF8 dtype to be "string" instead of "variable_length_utf8". This should resolve compatibility issues with zarr-python 3.0.8.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jun 25, 2025
scalar_v2_params = ((FixedLengthUTF32(length=1), ""), (FixedLengthUTF32(length=2), "hi"))
scalar_v3_params = (
(FixedLengthUTF32(length=0), ""),
(FixedLengthUTF32(length=1), ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line leaves me a little confused on what the intended behavior for FIxedLengthString with an empty fill value is. or in general does the fill vlaue of a fixed length string need to be the same length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going with numpy's semantics, which assigns the empty string to a sequence of null bytes:

>>> np.array([''], dtype="U1").tobytes()
b'\x00\x00\x00\x00'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a situation where the array data type and the scalar data type (the thing you get when you index that array) are not the same:

>>> np.dtype('U0').type('a').dtype
dtype('<U1')
>>> np.dtype('U0').type('').dtype 
dtype('<U')
>>> np.array([''], dtype="U0").dtype   
dtype('<U1')
>>> np.array([''], dtype="U0")[0].dtype
dtype('<U')

Copy link
Contributor

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good from me! Seems to fix the issue I was seeing with Xarray.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 26, 2025

for posterity, numpy does permit the creation of arrays with a "V0" data type, i.e. raw bytes where each element has length 0:

>>> np.array([b''], dtype="V0").dtype
dtype('V')
>>> np.array([b''], dtype="V0").tobytes()
b''
>>> np.array([b'',b'',b'',b''], dtype="V0").tobytes()
b''

Arrays with this data type are all the empty byte string b"". I don't think this is meaningful for zarr.

The changes in this PR will make this data type unrepresentable in zarr python. I hope we are OK with this.

@d-v-b d-v-b requested a review from jhamman June 26, 2025 14:59
@d-v-b d-v-b merged commit bbdefac into zarr-developers:main Jun 27, 2025
55 of 70 checks passed
@d-v-b d-v-b added this to the 3.1.0 milestone Jun 30, 2025
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.47%. Comparing base (27615fd) to head (1ccfdd9).
⚠️ Report is 130 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3170      +/-   ##
==========================================
- Coverage   94.48%   94.47%   -0.01%     
==========================================
  Files          78       78              
  Lines        8590     8604      +14     
==========================================
+ Hits         8116     8129      +13     
- Misses        474      475       +1     
Files with missing lines Coverage Δ
src/zarr/core/dtype/__init__.py 98.14% <100.00%> (+0.10%) ⬆️
src/zarr/core/dtype/npy/bytes.py 96.87% <100.00%> (+0.10%) ⬆️
src/zarr/core/dtype/npy/string.py 97.05% <100.00%> (+0.04%) ⬆️
src/zarr/core/dtype/npy/structured.py 96.59% <100.00%> (+0.12%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants