Skip to content

Included array shape in JSON encoding. #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 15, 2018

Conversation

jeromekelleher
Copy link
Member

Issue #76

This PR is just for discussion.

@alimanfoo
Copy link
Member

Thanks Jerome. Does this definitely resolve the problem in #76?

@jeromekelleher
Copy link
Member Author

Thanks Jerome. Does this definitely resolve the problem in #76?

It solved it for me anyway. I tried to come up with some test cases, but I couldn't figure out how to do it without calling np.array(data) first (which would do the reshaping first, and so make the test case irrelevant).

I think the problem exists more generally as well, as numpy will be quite aggressive about reshaping arrays when calling np.array(input_data) with other types of object arrays (not just arrays of strings).

@alimanfoo
Copy link
Member

Yeah I'm guessing both JSON and MsgPack codecs will break when an object array contains lists or tuples or other sequence-like things because of this.

@jeromekelleher
Copy link
Member Author

Just to update on this: I've monkey patched my local copy of the JSON class with this update, and it has definitely solved the problem for me. If you think that this solution is technically good @alimanfoo, I'm happy to help out with pushing through the implementation. I'll need some guidance on what needs doing though.

If there's another solution which doesn't involve breaking backward compatibility though, of course this would be better.

@alimanfoo
Copy link
Member

Thanks Jerome, I've been trying to think of a clever way to do this but can't see a way around it. When converting object arrays to lists there is an ambiguity, without also knowing the shape it's not possible to correctly restore the original array. Just to illustrate for the record:

In [21]: a = np.empty((2, 2), dtype=object)

In [22]: a[0, 0] = 0

In [23]: a[0, 1] = 1

In [24]: a[1, 0] = 2

In [25]: a[1, 1] = 3

In [26]: a
Out[26]: 
array([[0, 1],
       [2, 3]], dtype=object)

In [27]: a.tolist()
Out[27]: [[0, 1], [2, 3]]

In [35]: b = np.empty(2, dtype=object)

In [36]: b[0] = [0, 1]

In [37]: b[1] = [2, 3]

In [38]: b
Out[38]: array([list([0, 1]), list([2, 3])], dtype=object)

In [39]: b.tolist()
Out[39]: [[0, 1], [2, 3]]

Your solution to append the shape to the list of items to be encoded seems good to me.

To take this forward it would be good to add some unit tests that expose this problem in the JSON and MsgPack codecs. It should be doable with some manually crafted arrays like above, i.e., the array b when passed through encode() and back through decode() will end up with a different shape and size.

The next thing to figure out is how to handle data format compatibility as gracefully as possible. Might be worth giving this some discussion, to work through the consequences.

The rule I set for numcodecs to maintain data format compatibility is that the encoding format associated with a given codec ID should never change. Codec IDs are associated with codec classes via the codec_id property, e.g., here. The codec ID is used by Zarr in the array metadata, and is the key used by numcodecs' registry which Zarr uses to look up an implementation for a given codec. So if we change the codec format to append the shape, we have to also modify the codec IDs, as these are effectively new codecs. I had thought for cases like this then we could change the codec IDs to something like "json2" and "msgpack2". So, i.e., the numcodecs library would provide implementations of new codecs "json2" and "msgpack2" which store the array shape in the encoded data, as well as continuing to provide implementations of "json" and "msgpack" codecs so old data that doesn't encounter this problem could still be read, although we would warn users not to use "json" or "msgpack" somehow because these are broken for some cases. This is possibly a little ugly, but preserves format compatibility.

There is another approach we could consider here, because we know that the json and msgpack encodings are basically broken. We could change the implementation of "json" and "msgpack" codecs as proposed, but keep the codec ID unchanged. This would break data format compatibility in the backwards direction, in the sense that older versions of numcodecs would not be able to read data created by newer versions of the library with the "json" or "msgpack" codec. To avoid breaking compatibility in the forwards direction, i.e., to allow newer versions of numcodecs to continue reading data created by older versions, we could add some conditional logic inside decode() to check the type of the last item in the encoded list. If it's a string then you know the shape is missing, and attempt to process using the old logic. If it's not a string, assume it's the shape and process with the new logic. This breaks the codec ID rule, but may be better from the point of view of API simplicity (i.e., we can just keep a single implementation of JSON codec).

Both solutions have pros and cons, would be interested to hear views.

cc @jakirkham

@jeromekelleher
Copy link
Member Author

This all sounds sensible to me @alimanfoo. One further option to throw into the mix: make two new codecs Json and MessagePack which implement the new protocol, and just keep the existing JSON and MsgPack ones around as undocumented and deprecated. I think this is reasonably defensible for MessagePack since this is the official name of the protocol anyway. Json perhaps less so, depending on how strongly you feel about capitalisation...

@alimanfoo
Copy link
Member

@jeromekelleher I think that's a variation on the first option. I.e., the first option says, numcodecs defines two codec formats for JSON, a legacy format registered under the "json" ID, and a new format registered under a new ID (e.g., "json2"). Under this option, numcodecs also provides two codec classes, one which implements the legacy "json" format, another which implements the new "json2" format. The next question is about API compatibility, i.e., what should these codec classes be called. Your suggestion is to have a class called Json implementing the new format, and keeping the old class called JSON implementing the old format. There are other choices we could make here, e.g., have a class called JSON implement the new format, and a class called something like LegacyJSON implementing the old format (and probably hidden in the API docs). These would have different consequences regarding how old code would run following a numcodecs upgrade. Not sure what is best yet, just trying to understand the options and consequences.

@jeromekelleher
Copy link
Member Author

That's true @alimanfoo, it is a variant on the first option. The argument is that for renaming the classes implementing the JSON and MsgPack codecs to something approximately the same as the old classes is that

  1. All old code with continue to work. People who currently use the MsgPack encoder without problems can continue to use it.
  2. In the case of MessagePack, new users will only ever see this codec documented, and don't need to know about MsgPack. I think renaming it to MessagePack is much better than MsgPack2 or whatever, as this will be confusing to new users of the API. (It's less clear for JSON->Json)

I don't think the IDs make any difference; we can call them json2, msgpack2 or whatever. Users will never see them, so the won't be confused by this. This strategy would give full forwards and backwards compatibility, which should be the aim I think.

@alimanfoo
Copy link
Member

alimanfoo commented May 2, 2018 via email

@alimanfoo
Copy link
Member

@jakirkham: @jeromekelleher and I just had a chat, we're going to go ahead with the proposal in this comment but let us know if you have any objections or concerns.

@jakirkham
Copy link
Member

Sorry @alimanfoo. Have been a bit swamped of late.

The deprecation plan seems good.

Should we add warnings for the old formats? Migration steps for old to new formats? Anything else along these lines?

Would you like me to think about the implementation detail as well? Or will this discussion happen in a new (or revamped version of this) PR?

@alimanfoo
Copy link
Member

alimanfoo commented May 3, 2018 via email

@jeromekelleher
Copy link
Member Author

I've convinced myself that this isn't a bug in numcodecs at all in fact, and it's actually doing the right thing. If we provide numpy arrays as input, we always round-trip correctly. If we provide non-numpy inputs, then the result we get back is always equal to np.array(input_data), which is surely the correct behaviour. I've added some test cases to illustrate this here.

Great; no need for new codecs!

I think the answer to my problem in zarr is that we should force object arrays to always be 1D (but we should discuss that elsewhere).

@alimanfoo
Copy link
Member

@jeromekelleher what about this case:

In [14]: a = np.empty(2, dtype=object)

In [15]: a[0] = [0, 1]

In [16]: a[1] = [3, 4]

In [17]: codec = numcodecs.JSON()

In [18]: b = codec.decode(codec.encode(a))

In [19]: a
Out[19]: array([list([0, 1]), list([3, 4])], dtype=object)

In [20]: b
Out[20]: 
array([[0, 1],
       [3, 4]], dtype=object)

In [21]: a.shape
Out[21]: (2,)

In [22]: b.shape
Out[22]: (2, 2)

@jeromekelleher
Copy link
Member Author

OK, that clinches it; we definitely need new codecs. Thank @alimanfoo, great test case!

@alimanfoo
Copy link
Member

:-) I was trying to think of a way to to avoid this upstream as you suggested, but this case of a 1D object array of lists (or tuples) where each list (or tuple) is the same length still causes a problem. Funnily enough if the lists are not all the same length, round tripping works fine:

In [32]: a = np.empty(2, dtype=object)

In [33]: a[0] = [0, 1, 2]

In [34]: a[1] = [3, 4]

In [35]: a
Out[35]: array([list([0, 1, 2]), list([3, 4])], dtype=object)

In [36]: codec.decode(codec.encode(a))
Out[36]: array([list([0, 1, 2]), list([3, 4])], dtype=object)

So it is some special logic in np.array that we need to work around.

@jeromekelleher
Copy link
Member Author

I think this should do it for the JSON codec @alimanfoo; what do you think?

I've left the warnings in as a TODO as I'm not sure how you want to handle that. Also, I think you're in a better position than me in terms of documenting this; perhaps you could update this PR with the required docs changes? I'm happy to do the remaining legwork then for the MsgPack codec once we're all happy with the changes in place for JSON.

@alimanfoo
Copy link
Member

Looks great. Happy to do the documentation, will push to this PR.

@alimanfoo
Copy link
Member

We'll need to remember to add the new files generated by the backwards compatibility tests on the new codecs into this PR at some point, but can do that when coding work is done.

@jeromekelleher
Copy link
Member Author

We'll need to remember to add the new files generated by the backwards compatibility tests on the new codecs into this PR at some point, but can do that when coding work is done.

I've just added the json2 files to this commit, as it seems like the logical place to put them.

@alimanfoo alimanfoo mentioned this pull request May 4, 2018
8 tasks
@alimanfoo
Copy link
Member

Hi @jeromekelleher, I pushed some documentation on the JSON classes.

On reflection, I don't think we need to raise a deprecation warning. If someone has already used this codec to encode some data via a previous version of numcodecs, the chances are they haven't hit this issue, so no need to warn.

@alimanfoo alimanfoo added this to the 0.6.0 milestone May 13, 2018
@jeromekelleher
Copy link
Member Author

Sounds good to me @alimanfoo. I'll apply the same changes for MsgPack so and ping you back.

@alimanfoo
Copy link
Member

alimanfoo commented May 14, 2018 via email

@jeromekelleher
Copy link
Member Author

OK, that should do it I think. Over to you @alimanfoo!

@alimanfoo alimanfoo merged commit fc4196f into zarr-developers:master May 15, 2018
@alimanfoo
Copy link
Member

Nice one, thanks @jeromekelleher.

@jeromekelleher jeromekelleher deleted the json-reshape branch May 15, 2018 11:59
@alimanfoo alimanfoo mentioned this pull request Nov 6, 2018
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants