-
Notifications
You must be signed in to change notification settings - Fork 97
Included array shape in JSON encoding. #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Included array shape in JSON encoding. #77
Conversation
Thanks Jerome. Does this definitely resolve the problem in #76? |
It solved it for me anyway. I tried to come up with some test cases, but I couldn't figure out how to do it without calling I think the problem exists more generally as well, as numpy will be quite aggressive about reshaping arrays when calling np.array(input_data) with other types of object arrays (not just arrays of strings). |
Yeah I'm guessing both JSON and MsgPack codecs will break when an object array contains lists or tuples or other sequence-like things because of this. |
Just to update on this: I've monkey patched my local copy of the JSON class with this update, and it has definitely solved the problem for me. If you think that this solution is technically good @alimanfoo, I'm happy to help out with pushing through the implementation. I'll need some guidance on what needs doing though. If there's another solution which doesn't involve breaking backward compatibility though, of course this would be better. |
Thanks Jerome, I've been trying to think of a clever way to do this but can't see a way around it. When converting object arrays to lists there is an ambiguity, without also knowing the shape it's not possible to correctly restore the original array. Just to illustrate for the record: In [21]: a = np.empty((2, 2), dtype=object)
In [22]: a[0, 0] = 0
In [23]: a[0, 1] = 1
In [24]: a[1, 0] = 2
In [25]: a[1, 1] = 3
In [26]: a
Out[26]:
array([[0, 1],
[2, 3]], dtype=object)
In [27]: a.tolist()
Out[27]: [[0, 1], [2, 3]]
In [35]: b = np.empty(2, dtype=object)
In [36]: b[0] = [0, 1]
In [37]: b[1] = [2, 3]
In [38]: b
Out[38]: array([list([0, 1]), list([2, 3])], dtype=object)
In [39]: b.tolist()
Out[39]: [[0, 1], [2, 3]] Your solution to append the shape to the list of items to be encoded seems good to me. To take this forward it would be good to add some unit tests that expose this problem in the JSON and MsgPack codecs. It should be doable with some manually crafted arrays like above, i.e., the array The next thing to figure out is how to handle data format compatibility as gracefully as possible. Might be worth giving this some discussion, to work through the consequences. The rule I set for numcodecs to maintain data format compatibility is that the encoding format associated with a given codec ID should never change. Codec IDs are associated with codec classes via the codec_id property, e.g., here. The codec ID is used by Zarr in the array metadata, and is the key used by numcodecs' registry which Zarr uses to look up an implementation for a given codec. So if we change the codec format to append the shape, we have to also modify the codec IDs, as these are effectively new codecs. I had thought for cases like this then we could change the codec IDs to something like "json2" and "msgpack2". So, i.e., the numcodecs library would provide implementations of new codecs "json2" and "msgpack2" which store the array shape in the encoded data, as well as continuing to provide implementations of "json" and "msgpack" codecs so old data that doesn't encounter this problem could still be read, although we would warn users not to use "json" or "msgpack" somehow because these are broken for some cases. This is possibly a little ugly, but preserves format compatibility. There is another approach we could consider here, because we know that the json and msgpack encodings are basically broken. We could change the implementation of "json" and "msgpack" codecs as proposed, but keep the codec ID unchanged. This would break data format compatibility in the backwards direction, in the sense that older versions of numcodecs would not be able to read data created by newer versions of the library with the "json" or "msgpack" codec. To avoid breaking compatibility in the forwards direction, i.e., to allow newer versions of numcodecs to continue reading data created by older versions, we could add some conditional logic inside decode() to check the type of the last item in the encoded list. If it's a string then you know the shape is missing, and attempt to process using the old logic. If it's not a string, assume it's the shape and process with the new logic. This breaks the codec ID rule, but may be better from the point of view of API simplicity (i.e., we can just keep a single implementation of JSON codec). Both solutions have pros and cons, would be interested to hear views. cc @jakirkham |
This all sounds sensible to me @alimanfoo. One further option to throw into the mix: make two new codecs |
@jeromekelleher I think that's a variation on the first option. I.e., the first option says, numcodecs defines two codec formats for JSON, a legacy format registered under the "json" ID, and a new format registered under a new ID (e.g., "json2"). Under this option, numcodecs also provides two codec classes, one which implements the legacy "json" format, another which implements the new "json2" format. The next question is about API compatibility, i.e., what should these codec classes be called. Your suggestion is to have a class called |
That's true @alimanfoo, it is a variant on the first option. The argument is that for renaming the classes implementing the JSON and MsgPack codecs to something approximately the same as the old classes is that
I don't think the IDs make any difference; we can call them json2, msgpack2 or whatever. Users will never see them, so the won't be confused by this. This strategy would give full forwards and backwards compatibility, which should be the aim I think. |
What about the following...
Rename the JSON codec class to LegacyJSON. In the registry, map the codec
ID "json" to the LegacyJSON class.
Add a new JSON codec class which implements the new encoding format. In the
registry, map "json2" to this new class.
After upgrading numcodecs, any previously written code using the JSON codec
will continue to work without modification. Any data previously written
with the "json" codec can be read. Any new data written after the upgrade
will be written with the "json2" codec, so all potential consumers of the
data would also need to upgrade numcodecs, otherwise they will get an error
regarding "json2" codec not being found. We'd need to clearly document this
all of course.
Do something analogous for MsgPack.
I think I prefer this to e.g. keeping JSON class implementing "json"
format, and adding new class with a new name implementing "json2" format.
This is because it requires no code change, just package upgrade. The
"json" encoding is basically broken, so we want people to start using the
new codec.
I also would be a bit concerned about having classes named like "JSON" and
"Json" in the API at the same time, even if documentation is clear the
names are very similar and it would be easy to get confused about which is
the newer one and which is the deprecated older one. Similar for
"MessagePack" and "MsgPack", even if "MessagePack" is a better name - we
should have called it that to start with.
Very happy to discuss though.
…On Tue, 1 May 2018, 11:33 Jerome Kelleher, ***@***.***> wrote:
That's true @alimanfoo <https://github.com/alimanfoo>, it is a variant on
the first option. The argument is that for renaming the classes
implementing the JSON and MsgPack codecs to something approximately the
same as the old classes is that
1. All old code with continue to work. People who currently use the
MsgPack encoder without problems can continue to use it.
2. In the case of MessagePack, new users will only ever see this codec
documented, and don't need to know about MsgPack. I think renaming it to
MessagePack is much better than MsgPack2 or whatever, as this will be
confusing to new users of the API. (It's less clear for JSON->Json)
I don't think the IDs make any difference; we can call them json2,
msgpack2 or whatever. Users will never see them, so the won't be confused
by this. This strategy would give full forwards and backwards
compatibility, which should be the aim I think.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QrivryhLfxSNxPjCVjuSd1awXTFWks5tuDoMgaJpZM4Tql_R>
.
|
@jakirkham: @jeromekelleher and I just had a chat, we're going to go ahead with the proposal in this comment but let us know if you have any objections or concerns. |
Sorry @alimanfoo. Have been a bit swamped of late. The deprecation plan seems good. Should we add warnings for the old formats? Migration steps for old to new formats? Anything else along these lines? Would you like me to think about the implementation detail as well? Or will this discussion happen in a new (or revamped version of this) PR? |
@jakirkham no problem at all, thanks for taking a look.
A deprecation warning would probably be good. Best way to implement might
need a little thought. Definitely want to warn if someone tries to write
new data with the old codec, we want them to switch to the new. I guess
easiest place to raise a Deprecation Warning would be in the LegacyJSON
constructor. Hopefully that wouldn't be too noisy.
Re migration, I think it should be enough to just upgrade numcodecs. Anyone
who previously used these codecs probably did not hit this issue, so no
need to migrate any previously written data. Upgrading numcodecs will just
ensure any new data is written with the new codec.
I think @jeromekelleher will push to this PR. Would be good to have your
review when the proposed solution is implemented.
…On Wed, 2 May 2018, 19:11 jakirkham, ***@***.***> wrote:
Sorry @alimanfoo <https://github.com/alimanfoo>. Have been a bit swamped
of late.
The deprecation plan seems good.
Should we add warnings for the old formats? Migration steps for old to new
formats? Anything else along these lines?
Would you like me to think about the implementation detail as well? Or
will this discussion happen in a new (or revamped version of this) PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtAa8mvUwcut6HyAFr90jMSq6scSks5tufbGgaJpZM4Tql_R>
.
|
59e21b3
to
ddd4d8a
Compare
I've convinced myself that this isn't a bug in numcodecs at all in fact, and it's actually doing the right thing. If we provide numpy arrays as input, we always round-trip correctly. If we provide non-numpy inputs, then the result we get back is always equal to Great; no need for new codecs! I think the answer to my problem in zarr is that we should force object arrays to always be 1D (but we should discuss that elsewhere). |
ddd4d8a
to
8cc7bc3
Compare
@jeromekelleher what about this case: In [14]: a = np.empty(2, dtype=object)
In [15]: a[0] = [0, 1]
In [16]: a[1] = [3, 4]
In [17]: codec = numcodecs.JSON()
In [18]: b = codec.decode(codec.encode(a))
In [19]: a
Out[19]: array([list([0, 1]), list([3, 4])], dtype=object)
In [20]: b
Out[20]:
array([[0, 1],
[3, 4]], dtype=object)
In [21]: a.shape
Out[21]: (2,)
In [22]: b.shape
Out[22]: (2, 2) |
OK, that clinches it; we definitely need new codecs. Thank @alimanfoo, great test case! |
:-) I was trying to think of a way to to avoid this upstream as you suggested, but this case of a 1D object array of lists (or tuples) where each list (or tuple) is the same length still causes a problem. Funnily enough if the lists are not all the same length, round tripping works fine: In [32]: a = np.empty(2, dtype=object)
In [33]: a[0] = [0, 1, 2]
In [34]: a[1] = [3, 4]
In [35]: a
Out[35]: array([list([0, 1, 2]), list([3, 4])], dtype=object)
In [36]: codec.decode(codec.encode(a))
Out[36]: array([list([0, 1, 2]), list([3, 4])], dtype=object) So it is some special logic in np.array that we need to work around. |
I think this should do it for the JSON codec @alimanfoo; what do you think? I've left the warnings in as a TODO as I'm not sure how you want to handle that. Also, I think you're in a better position than me in terms of documenting this; perhaps you could update this PR with the required docs changes? I'm happy to do the remaining legwork then for the MsgPack codec once we're all happy with the changes in place for JSON. |
Looks great. Happy to do the documentation, will push to this PR. |
We'll need to remember to add the new files generated by the backwards compatibility tests on the new codecs into this PR at some point, but can do that when coding work is done. |
e0432ca
to
1aeb277
Compare
I've just added the json2 files to this commit, as it seems like the logical place to put them. |
Hi @jeromekelleher, I pushed some documentation on the JSON classes. On reflection, I don't think we need to raise a deprecation warning. If someone has already used this codec to encode some data via a previous version of numcodecs, the chances are they haven't hit this issue, so no need to warn. |
Sounds good to me @alimanfoo. I'll apply the same changes for MsgPack so and ping you back. |
Awesome, thanks.
…On Mon, 14 May 2018, 08:44 Jerome Kelleher, ***@***.***> wrote:
Sounds good to me @alimanfoo <https://github.com/alimanfoo>. I'll apply
the same changes for MsgPack so and ping you back.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjKriFlxw1IK2O9IjIWho2Dlm6ABks5tyTXWgaJpZM4Tql_R>
.
|
67ff0fd
to
71e8b3a
Compare
OK, that should do it I think. Over to you @alimanfoo! |
Nice one, thanks @jeromekelleher. |
Issue #76
This PR is just for discussion.