Skip to content

JSON codec reshapes string arrays #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Apr 27, 2018 · 2 comments
Closed

JSON codec reshapes string arrays #76

jeromekelleher opened this issue Apr 27, 2018 · 2 comments
Labels
Milestone

Comments

@jeromekelleher
Copy link
Member

This is carrying on from zarr-developers/zarr-python#258

I've tried to come up with a minimal example, but it's tricky to illustrate without showing the context. Here is an interaction with zarr with some instrumentation in the encode/decode methods for json.

z = zarr.empty(2, dtype=object, object_codec=numcodecs.JSON(), chunks=(1,))
z[0] = ["11"]
z[1] = ["1", "1"]

print(z[:]) # Borks

output:

INPUT: (1,)
INPUT: (1,)
OUTPUT: (1, 1)
OUTPUT: (1, 2)
Traceback (most recent call last):
  File "dev.py", line 34, in <module>
    print(z[:]) # Borks
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1608, in _chunk_getitem
    chunk = self._decode_chunk(cdata)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1751, in _decode_chunk
    chunk = chunk.reshape(self._chunks, order=self._order)
ValueError: cannot reshape array of size 2 into shape (1,)

The INPUT lines are the shapes of the input arrays to encode and the OUTPUT lines are the corresponding output shapes of the arrays from decode.

Problem description

When calling numpy.array([["s1", "s2"], ["s3, "s4"]], dtype=object) numpy is quite aggressive about reshaping the array to store things more efficiently.

I've played around with this a fair bit, and I think the only options are to

  1. Drop the numpy dependency in the encoding and decoding steps for JSON (i.e, don't include the dtype in the JSON encoding), and provide the supplied argument directly to the JSON encoder (and conversely, directly return the value of json.loads() from decode.

  2. Also encode the input array shape in the JSON encoding.

Both of these options are ugly because they break backward compatibility. I'll make a PR for demonstrating option 2 in a minute for discussion.

@alimanfoo
Copy link
Member

Thanks a lot for this. FWIW if there is no way to fix this without changing the encoded format then we can work through that, there's some information in the developer guide which is intended to cover this type of situation. Basically any change to the encoding format should be implemented via a new codec class with a new codec ID, so compatibility is preserved for any existing data using the old codec. But probably best to figure out the technical solution to the problem first, then deal with compatibility.

@alimanfoo alimanfoo added this to the 0.6.0 milestone May 13, 2018
@alimanfoo alimanfoo added the bug label May 13, 2018
@alimanfoo
Copy link
Member

Resolved via #77.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants