Skip to content

serializing not a time (NaT) to metadata #3028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d-v-b opened this issue Apr 30, 2025 · 6 comments · May be fixed by #3031
Open

serializing not a time (NaT) to metadata #3028

d-v-b opened this issue Apr 30, 2025 · 6 comments · May be fixed by #3031

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 30, 2025

Numpy datetime and timedelta dtypes have a nan-like value called NaT. The NaT value internally represented by the 64 bit signed integer -9223372036854775808:

np.int64(-9223372036854775808).view('datetime64')
# np.datetime64('NaT')

The zarr v2 spec does not define a JSON-serializable fill value encoding for datetimes or timedeltas. In lieu of a spec, we can go by the implementation of zarr-python 2.x, which used the integer form of the datetime / timedelta for all fill value, including NaT:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "zarr==2.18",
#     "numcodecs<0.16"
# ]
# ///


import zarr
import json
z = zarr.create((10,), dtype='>M8[s]', fill_value='NaT')
print(json.loads(z.store['.zarray'])['fill_value'])
# -9223372036854775808

But we have a property test that checks if the numpy NaT value is serialized to metadata as the string "NaT". This goes against the behavior of v2, according to which It should be serialized as an int.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 1, 2025

@moradology I think this was added in #2802, was it intentional that we deviate from zarr-python 2 here?

@moradology
Copy link
Contributor

moradology commented May 1, 2025

It was intentional, though this was among the cases that I wasn't entirely certain about and which only came up after I implemented the property-based tests (it was one of the edge cases not handled as things were at the time). I'm afraid I wasn't sure of the best path here, but hopefully the explicitness of expression in the tests will still serve as a single point of reference on these kinds of questions going forward #2802 (comment)

Definitely reasonable to revert to the int-based representation to preserve compatibility with v2 though, if the slight improvement of representation is less important (as it seems to me is the case)

@d-v-b
Copy link
Contributor Author

d-v-b commented May 1, 2025

@moradology no worries, I didn't know until very recently what zarr-python 2 was actually doing in this case, and the real problem is that zarr-python's behavior was never documented properly.

I agree that the string literal "NaT" is a much better JSON serialization than an opaque integer, but I also think we should opt for backwards compatibility with data saved by zarr-python 2. I guess because we released #2802 already, there could be zarr v2 datetime data in the wild using "NaT" as a fill value, so we might have to decide between breaking things for those users vs supporting "NaT" as an alternative fill_value to the integer form (maybe only for reading). Not sure what the best course of action is here, besides taking whatever steps we need to be compatible with data saved by zarr-python 2

@moradology moradology linked a pull request May 1, 2025 that will close this issue
6 tasks
@moradology
Copy link
Contributor

OK, I've got a PR up that solves this problem by relaxing the deserialization behavior. Let me know what you think

@d-v-b
Copy link
Contributor Author

d-v-b commented May 2, 2025

good news @moradology -- I did some testing and it seems like zarr-python 2 works with either the int form of NaT or the string form. So I don't think #2802 introduced any compatibility issues.

Here's a script that demonstrates this:

# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr==2.18",
# "numcodecs < 0.16"
# ]
# ///


import zarr
import json
import numpy as np

store = {}

meta_a = {
    'chunks': [2],
    'compressor': {'blocksize': 0, 'clevel': 5, 'cname': 'lz4', 'id': 'blosc', 'shuffle': 1},
    'dtype': '<M8[s]',
    'fill_value': -9223372036854775808,
    'filters': None,
    'order': 'C',
    'shape': [2],
    'zarr_format': 2
    }

# replace the integer form of nat with the string NaT
meta_b = meta_a | {"fill_value": "NaT"}

store['a/.zarray'] = json.dumps(meta_a).encode()
store['b/.zarray'] = json.dumps(meta_b).encode()

a = zarr.open(store, path='a', mode='a')
b = zarr.open(store, path='b', mode='a')

assert all(np.isnat(a[:]))
assert all(np.isnat(b[:]))

a[0] = np.datetime64(10, 's')
b[0] = np.datetime64(10, 's')

assert a[0] == b[0]

assert json.loads(store['a/.zarray'])['fill_value'] == meta_a['fill_value']
assert json.loads(store['b/.zarray'])["fill_value"] == meta_b['fill_value']
print('success')

@d-v-b
Copy link
Contributor Author

d-v-b commented May 2, 2025

we do still need to ensure that zarr-python 3 remains compatible with the int-encoded fill value, hence your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants