Included array shape in JSON encoding. #77

jeromekelleher · 2018-04-27T15:40:34Z

Issue #76

This PR is just for discussion.

alimanfoo · 2018-04-27T16:13:45Z

Thanks Jerome. Does this definitely resolve the problem in #76?

jeromekelleher · 2018-04-27T16:42:56Z

Thanks Jerome. Does this definitely resolve the problem in #76?

It solved it for me anyway. I tried to come up with some test cases, but I couldn't figure out how to do it without calling np.array(data) first (which would do the reshaping first, and so make the test case irrelevant).

I think the problem exists more generally as well, as numpy will be quite aggressive about reshaping arrays when calling np.array(input_data) with other types of object arrays (not just arrays of strings).

alimanfoo · 2018-04-27T17:07:43Z

Yeah I'm guessing both JSON and MsgPack codecs will break when an object array contains lists or tuples or other sequence-like things because of this.

jeromekelleher · 2018-04-30T15:28:04Z

Just to update on this: I've monkey patched my local copy of the JSON class with this update, and it has definitely solved the problem for me. If you think that this solution is technically good @alimanfoo, I'm happy to help out with pushing through the implementation. I'll need some guidance on what needs doing though.

If there's another solution which doesn't involve breaking backward compatibility though, of course this would be better.

alimanfoo · 2018-04-30T17:00:22Z

Thanks Jerome, I've been trying to think of a clever way to do this but can't see a way around it. When converting object arrays to lists there is an ambiguity, without also knowing the shape it's not possible to correctly restore the original array. Just to illustrate for the record:

In [21]: a = np.empty((2, 2), dtype=object)

In [22]: a[0, 0] = 0

In [23]: a[0, 1] = 1

In [24]: a[1, 0] = 2

In [25]: a[1, 1] = 3

In [26]: a
Out[26]: 
array([[0, 1],
       [2, 3]], dtype=object)

In [27]: a.tolist()
Out[27]: [[0, 1], [2, 3]]

In [35]: b = np.empty(2, dtype=object)

In [36]: b[0] = [0, 1]

In [37]: b[1] = [2, 3]

In [38]: b
Out[38]: array([list([0, 1]), list([2, 3])], dtype=object)

In [39]: b.tolist()
Out[39]: [[0, 1], [2, 3]]

Your solution to append the shape to the list of items to be encoded seems good to me.

To take this forward it would be good to add some unit tests that expose this problem in the JSON and MsgPack codecs. It should be doable with some manually crafted arrays like above, i.e., the array b when passed through encode() and back through decode() will end up with a different shape and size.

The next thing to figure out is how to handle data format compatibility as gracefully as possible. Might be worth giving this some discussion, to work through the consequences.

The rule I set for numcodecs to maintain data format compatibility is that the encoding format associated with a given codec ID should never change. Codec IDs are associated with codec classes via the codec_id property, e.g., here. The codec ID is used by Zarr in the array metadata, and is the key used by numcodecs' registry which Zarr uses to look up an implementation for a given codec. So if we change the codec format to append the shape, we have to also modify the codec IDs, as these are effectively new codecs. I had thought for cases like this then we could change the codec IDs to something like "json2" and "msgpack2". So, i.e., the numcodecs library would provide implementations of new codecs "json2" and "msgpack2" which store the array shape in the encoded data, as well as continuing to provide implementations of "json" and "msgpack" codecs so old data that doesn't encounter this problem could still be read, although we would warn users not to use "json" or "msgpack" somehow because these are broken for some cases. This is possibly a little ugly, but preserves format compatibility.

There is another approach we could consider here, because we know that the json and msgpack encodings are basically broken. We could change the implementation of "json" and "msgpack" codecs as proposed, but keep the codec ID unchanged. This would break data format compatibility in the backwards direction, in the sense that older versions of numcodecs would not be able to read data created by newer versions of the library with the "json" or "msgpack" codec. To avoid breaking compatibility in the forwards direction, i.e., to allow newer versions of numcodecs to continue reading data created by older versions, we could add some conditional logic inside decode() to check the type of the last item in the encoded list. If it's a string then you know the shape is missing, and attempt to process using the old logic. If it's not a string, assume it's the shape and process with the new logic. This breaks the codec ID rule, but may be better from the point of view of API simplicity (i.e., we can just keep a single implementation of JSON codec).

Both solutions have pros and cons, would be interested to hear views.

cc @jakirkham

jeromekelleher · 2018-05-01T07:42:27Z

This all sounds sensible to me @alimanfoo. One further option to throw into the mix: make two new codecs Json and MessagePack which implement the new protocol, and just keep the existing JSON and MsgPack ones around as undocumented and deprecated. I think this is reasonably defensible for MessagePack since this is the official name of the protocol anyway. Json perhaps less so, depending on how strongly you feel about capitalisation...

alimanfoo · 2018-05-01T09:55:02Z

@jeromekelleher I think that's a variation on the first option. I.e., the first option says, numcodecs defines two codec formats for JSON, a legacy format registered under the "json" ID, and a new format registered under a new ID (e.g., "json2"). Under this option, numcodecs also provides two codec classes, one which implements the legacy "json" format, another which implements the new "json2" format. The next question is about API compatibility, i.e., what should these codec classes be called. Your suggestion is to have a class called Json implementing the new format, and keeping the old class called JSON implementing the old format. There are other choices we could make here, e.g., have a class called JSON implement the new format, and a class called something like LegacyJSON implementing the old format (and probably hidden in the API docs). These would have different consequences regarding how old code would run following a numcodecs upgrade. Not sure what is best yet, just trying to understand the options and consequences.

jeromekelleher · 2018-05-01T10:33:47Z

That's true @alimanfoo, it is a variant on the first option. The argument is that for renaming the classes implementing the JSON and MsgPack codecs to something approximately the same as the old classes is that

All old code with continue to work. People who currently use the MsgPack encoder without problems can continue to use it.
In the case of MessagePack, new users will only ever see this codec documented, and don't need to know about MsgPack. I think renaming it to MessagePack is much better than MsgPack2 or whatever, as this will be confusing to new users of the API. (It's less clear for JSON->Json)

I don't think the IDs make any difference; we can call them json2, msgpack2 or whatever. Users will never see them, so the won't be confused by this. This strategy would give full forwards and backwards compatibility, which should be the aim I think.

alimanfoo · 2018-05-02T10:45:08Z

What about the following... Rename the JSON codec class to LegacyJSON. In the registry, map the codec ID "json" to the LegacyJSON class. Add a new JSON codec class which implements the new encoding format. In the registry, map "json2" to this new class. After upgrading numcodecs, any previously written code using the JSON codec will continue to work without modification. Any data previously written with the "json" codec can be read. Any new data written after the upgrade will be written with the "json2" codec, so all potential consumers of the data would also need to upgrade numcodecs, otherwise they will get an error regarding "json2" codec not being found. We'd need to clearly document this all of course. Do something analogous for MsgPack. I think I prefer this to e.g. keeping JSON class implementing "json" format, and adding new class with a new name implementing "json2" format. This is because it requires no code change, just package upgrade. The "json" encoding is basically broken, so we want people to start using the new codec. I also would be a bit concerned about having classes named like "JSON" and "Json" in the API at the same time, even if documentation is clear the names are very similar and it would be easy to get confused about which is the newer one and which is the deprecated older one. Similar for "MessagePack" and "MsgPack", even if "MessagePack" is a better name - we should have called it that to start with. Very happy to discuss though.

…

On Tue, 1 May 2018, 11:33 Jerome Kelleher, ***@***.***> wrote: That's true @alimanfoo <https://github.com/alimanfoo>, it is a variant on the first option. The argument is that for renaming the classes implementing the JSON and MsgPack codecs to something approximately the same as the old classes is that 1. All old code with continue to work. People who currently use the MsgPack encoder without problems can continue to use it. 2. In the case of MessagePack, new users will only ever see this codec documented, and don't need to know about MsgPack. I think renaming it to MessagePack is much better than MsgPack2 or whatever, as this will be confusing to new users of the API. (It's less clear for JSON->Json) I don't think the IDs make any difference; we can call them json2, msgpack2 or whatever. Users will never see them, so the won't be confused by this. This strategy would give full forwards and backwards compatibility, which should be the aim I think. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QrivryhLfxSNxPjCVjuSd1awXTFWks5tuDoMgaJpZM4Tql_R> .

alimanfoo · 2018-05-02T14:54:53Z

@jakirkham: @jeromekelleher and I just had a chat, we're going to go ahead with the proposal in this comment but let us know if you have any objections or concerns.

jakirkham · 2018-05-02T18:11:18Z

Sorry @alimanfoo. Have been a bit swamped of late.

The deprecation plan seems good.

Should we add warnings for the old formats? Migration steps for old to new formats? Anything else along these lines?

Would you like me to think about the implementation detail as well? Or will this discussion happen in a new (or revamped version of this) PR?

alimanfoo · 2018-05-03T08:37:43Z

@jakirkham no problem at all, thanks for taking a look. A deprecation warning would probably be good. Best way to implement might need a little thought. Definitely want to warn if someone tries to write new data with the old codec, we want them to switch to the new. I guess easiest place to raise a Deprecation Warning would be in the LegacyJSON constructor. Hopefully that wouldn't be too noisy. Re migration, I think it should be enough to just upgrade numcodecs. Anyone who previously used these codecs probably did not hit this issue, so no need to migrate any previously written data. Upgrading numcodecs will just ensure any new data is written with the new codec. I think @jeromekelleher will push to this PR. Would be good to have your review when the proposed solution is implemented.

…

On Wed, 2 May 2018, 19:11 jakirkham, ***@***.***> wrote: Sorry @alimanfoo <https://github.com/alimanfoo>. Have been a bit swamped of late. The deprecation plan seems good. Should we add warnings for the old formats? Migration steps for old to new formats? Anything else along these lines? Would you like me to think about the implementation detail as well? Or will this discussion happen in a new (or revamped version of this) PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QtAa8mvUwcut6HyAFr90jMSq6scSks5tufbGgaJpZM4Tql_R> .

jeromekelleher · 2018-05-03T09:56:17Z

I've convinced myself that this isn't a bug in numcodecs at all in fact, and it's actually doing the right thing. If we provide numpy arrays as input, we always round-trip correctly. If we provide non-numpy inputs, then the result we get back is always equal to np.array(input_data), which is surely the correct behaviour. I've added some test cases to illustrate this here.

Great; no need for new codecs!

I think the answer to my problem in zarr is that we should force object arrays to always be 1D (but we should discuss that elsewhere).

alimanfoo · 2018-05-03T10:11:04Z

@jeromekelleher what about this case:

In [14]: a = np.empty(2, dtype=object)

In [15]: a[0] = [0, 1]

In [16]: a[1] = [3, 4]

In [17]: codec = numcodecs.JSON()

In [18]: b = codec.decode(codec.encode(a))

In [19]: a
Out[19]: array([list([0, 1]), list([3, 4])], dtype=object)

In [20]: b
Out[20]: 
array([[0, 1],
       [3, 4]], dtype=object)

In [21]: a.shape
Out[21]: (2,)

In [22]: b.shape
Out[22]: (2, 2)

jeromekelleher · 2018-05-03T10:21:10Z

OK, that clinches it; we definitely need new codecs. Thank @alimanfoo, great test case!

alimanfoo · 2018-05-03T10:34:51Z

:-) I was trying to think of a way to to avoid this upstream as you suggested, but this case of a 1D object array of lists (or tuples) where each list (or tuple) is the same length still causes a problem. Funnily enough if the lists are not all the same length, round tripping works fine:

In [32]: a = np.empty(2, dtype=object)

In [33]: a[0] = [0, 1, 2]

In [34]: a[1] = [3, 4]

In [35]: a
Out[35]: array([list([0, 1, 2]), list([3, 4])], dtype=object)

In [36]: codec.decode(codec.encode(a))
Out[36]: array([list([0, 1, 2]), list([3, 4])], dtype=object)

So it is some special logic in np.array that we need to work around.

jeromekelleher · 2018-05-03T12:06:11Z

I think this should do it for the JSON codec @alimanfoo; what do you think?

I've left the warnings in as a TODO as I'm not sure how you want to handle that. Also, I think you're in a better position than me in terms of documenting this; perhaps you could update this PR with the required docs changes? I'm happy to do the remaining legwork then for the MsgPack codec once we're all happy with the changes in place for JSON.

alimanfoo · 2018-05-03T16:24:01Z

Looks great. Happy to do the documentation, will push to this PR.

alimanfoo · 2018-05-03T16:27:56Z

We'll need to remember to add the new files generated by the backwards compatibility tests on the new codecs into this PR at some point, but can do that when coding work is done.

jeromekelleher · 2018-05-04T07:56:51Z

We'll need to remember to add the new files generated by the backwards compatibility tests on the new codecs into this PR at some point, but can do that when coding work is done.

I've just added the json2 files to this commit, as it seems like the logical place to put them.

alimanfoo · 2018-05-13T21:01:18Z

Hi @jeromekelleher, I pushed some documentation on the JSON classes.

On reflection, I don't think we need to raise a deprecation warning. If someone has already used this codec to encode some data via a previous version of numcodecs, the chances are they haven't hit this issue, so no need to warn.

jeromekelleher · 2018-05-14T07:44:21Z

Sounds good to me @alimanfoo. I'll apply the same changes for MsgPack so and ping you back.

alimanfoo · 2018-05-14T08:10:19Z

Awesome, thanks.

…

On Mon, 14 May 2018, 08:44 Jerome Kelleher, ***@***.***> wrote: Sounds good to me @alimanfoo <https://github.com/alimanfoo>. I'll apply the same changes for MsgPack so and ping you back. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QjKriFlxw1IK2O9IjIWho2Dlm6ABks5tyTXWgaJpZM4Tql_R> .

jeromekelleher · 2018-05-14T19:37:15Z

OK, that should do it I think. Over to you @alimanfoo!

alimanfoo · 2018-05-15T11:40:53Z

Nice one, thanks @jeromekelleher.

jeromekelleher force-pushed the json-reshape branch from 59e21b3 to ddd4d8a Compare May 3, 2018 09:51

jeromekelleher force-pushed the json-reshape branch from ddd4d8a to 8cc7bc3 Compare May 3, 2018 10:10

jeromekelleher force-pushed the json-reshape branch from e0432ca to 1aeb277 Compare May 4, 2018 07:55

alimanfoo mentioned this pull request May 4, 2018

Fix get codec #79

Merged

8 tasks

alimanfoo added this to the 0.6.0 milestone May 13, 2018

jeromekelleher added 2 commits May 14, 2018 20:16

Added tests to verify handling of non-numpy inputs.

c937833

Basic changes needed for json2 codec.

260946b

document changes/deprecation

71e8b3a

jeromekelleher force-pushed the json-reshape branch from 67ff0fd to 71e8b3a Compare May 14, 2018 19:16

Added msgpack2 codec and LegacyMsgPack class.

36d6624

alimanfoo added 2 commits May 15, 2018 09:49

generalise docstring

afb11fb

added release notes

6e8893e

alimanfoo merged commit fc4196f into zarr-developers:master May 15, 2018

alimanfoo mentioned this pull request May 15, 2018

JSON codec reshapes string arrays #76

Closed

jeromekelleher deleted the json-reshape branch May 15, 2018 11:59

alimanfoo mentioned this pull request Nov 6, 2018

Fix msgpack issues and warnings #98

Merged

8 tasks

Included array shape in JSON encoding. #77

Included array shape in JSON encoding. #77

Uh oh!

Conversation

jeromekelleher commented Apr 27, 2018

Uh oh!

alimanfoo commented Apr 27, 2018

Uh oh!

jeromekelleher commented Apr 27, 2018

Uh oh!

alimanfoo commented Apr 27, 2018

Uh oh!

jeromekelleher commented Apr 30, 2018

Uh oh!

alimanfoo commented Apr 30, 2018

Uh oh!

jeromekelleher commented May 1, 2018

Uh oh!

alimanfoo commented May 1, 2018

Uh oh!

jeromekelleher commented May 1, 2018

Uh oh!

alimanfoo commented May 2, 2018 via email

Uh oh!

alimanfoo commented May 2, 2018

Uh oh!

jakirkham commented May 2, 2018

Uh oh!

alimanfoo commented May 3, 2018 via email

Uh oh!

jeromekelleher commented May 3, 2018

Uh oh!

alimanfoo commented May 3, 2018

Uh oh!

jeromekelleher commented May 3, 2018

Uh oh!

alimanfoo commented May 3, 2018

Uh oh!

jeromekelleher commented May 3, 2018

Uh oh!

alimanfoo commented May 3, 2018

Uh oh!

alimanfoo commented May 3, 2018

Uh oh!

jeromekelleher commented May 4, 2018

Uh oh!

alimanfoo commented May 13, 2018

Uh oh!

jeromekelleher commented May 14, 2018

Uh oh!

alimanfoo commented May 14, 2018 via email

Uh oh!

jeromekelleher commented May 14, 2018

Uh oh!

alimanfoo commented May 15, 2018

Uh oh!

Uh oh!