Fixing blosc encode error handling #81

jeromekelleher · 2018-05-09T09:12:53Z

WIP. Closes #80.

Some problems:

Provoking the error requires making a 2G array, which CI providers might not like. I guess we'll have to see how this works out as it runs.
More seriously, there is some problem with mutexes when blosc.use_threads is true. It looks like there is some library cleanup that should be performed after an error occurs?
Currently the error message is written to stderr/stdout, which isn't ideal. There should some mechanism for telling blosc not to write out error messages, and instead retrieve the error message so we can put it into the exception.

TODO:

Unit tests and/or doctests in docstrings
tox -e py36 passes locally
tox -e py27 passes locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
tox -e docs passes locally
AppVeyor and Travis CI passes
Test coverage to 100% (Coveralls passes)

jeromekelleher · 2018-05-09T09:34:12Z

Looks like CI providers are OK with making a 2GB array, so that's good. We'll need to take a bit of care on 32 bit builds on Windows, but this seems manageable.

The bigger issue seems that there are hard limits on the buffer size for other compressors too (LZ4 has anyway, haven't gone through the others). LZ4 fails correctly, but doesn't give any error message. This is poor from a user perspective.

I guess the only good way to do this is to have a max_buffer_size for each codec, and check this ourselves, raising an explanatory exception.

alimanfoo · 2018-05-09T09:47:36Z

Thanks @jeromekelleher.

Adding a max_buffer_size for each codec and doing our own checks sounds like a practical way forward.

Re point (2), I haven't seen that before, @FrancescAlted is there something we should be doing to clean up after a blosc internal error has occurred, when using blosc with multiple threads and global state (blosc_compress(), blosc_descompress())?

Re point (3), @FrancescAlted could we discuss how blosc reports errors and if there is a better way to enable applications like numcodecs to capture and propagate appropriately? Happy to raise an issue on the c-blosc repo if you think appropriate.

FrancescAlted · 2018-05-09T17:39:35Z

Hi. I don't have that much experience with crash recovery (Blosc crashes very few times, if any, for me), but I'd say that a blosc_destroy() followed by blosc_init() would be enough for a 'recovery'.

Regarding the error messages, yes, currently a description (more or less accurate) is sent to stderr and a negative value is returned. Fixing this, while feasible, would require the introduction of a couple of APIs and a better catalog for the different error (and messages) that can occur internally. While not difficult, this would take quite a bit of time, so happy if anybody would be interested in doing a PR for fixing this.

jeromekelleher · 2018-05-10T07:28:23Z

Thanks for the clarifications @FrancescAlted, this is very helpful.

I agree with @alimanfoo, in that I think the best way forward here is to keep the max_buffer_size for each codec and check for it. Other errors from blosc seem pretty unlikely, so let's not worry too much about how to generate error messages and so on.

jeromekelleher · 2018-05-10T07:32:11Z

ps. I'm happy to do the coding for this, but might let some other PRs get merged before taking it up again.

alimanfoo · 2018-05-10T10:25:45Z

Thanks @jeromekelleher, SGTM.

alimanfoo · 2018-11-06T15:58:20Z

Hi @jeromekelleher, would you be interested in taking this up again? We're pushing towards a release and would be great to include this.

jakirkham · 2018-11-06T20:51:35Z

We're discussing possibly dropping Windows 32-bit in issue ( #97 ). That might free us from figuring out 32-bit errors here.

Edit: Though we could just xfail these new tests on Windows 32-bit.

jeromekelleher · 2018-11-15T15:25:26Z

OK, I can take a look here @alimanfoo and see if there's something I can do cleanly fairly quickly. What's the timeline for the release?

alimanfoo · 2018-11-15T16:34:18Z

Thanks @jeromekelleher. No specific timeline but there's only this and a couple of other maintenance issues then I think would be at a nice point to release.

jeromekelleher · 2018-11-23T14:53:29Z

I've made a pass at handling this in a reasonably general way @alimanfoo. We could embed the check down in the C code or in the guts of each codec separately, which would be simpler and more efficient. However, it would be very difficult to test this. The testing hack I have here is a bit messy, but at least we get coverage on the actual mechanism. Having to create > 2GiB to provoke this will be a mess on CI, as we'll regularly have failures when the tests happen to run on a machine that's a bit more memory constrained or whatever.

If you like the approach, I'd need to add calls to self._check_buffer_size(buf) to all the other codecs as well and generalise the test case somehow. We should also find out what the limits for the other codecs are, but I bet 2GiB is a safe and sensible limit.

jeromekelleher · 2018-11-23T15:04:55Z

Ah, good ole Python 2. I'm sure this is fixable anyway if we like the general approach.

alimanfoo · 2018-11-23T15:07:56Z

Hi @jeromekelleher, I think the approach looks great, thanks.

I think it would be worth holding further work on this until we get either of #128 or #121 merged, those PRs are both attempts to simplify the handling and normalisation of different possible input types. Once one of those is merged, we could then rebase this PR, which would simplify a little and provide a consistent base to work from across other codecs. It would also enable a fix to the PY27 failures on travis about .nbytes being missing on memoryview, as we will be normalising all inputs to numpy arrays, so you can be sure an .nbytes property is present.

jeromekelleher · 2018-11-23T15:09:42Z

SGTM @alimanfoo, would you mind pinging me when the infrastructure is in place?

alimanfoo · 2018-11-23T15:10:47Z

@jeromekelleher will do, thanks.

jakirkham · 2018-11-27T23:42:41Z

Went ahead and merged master into this PR and updated it to use ensure_ndarray instead of memoryview to see if that works. Hope that is ok.

alimanfoo · 2018-11-28T00:11:39Z

Thanks @jakirkham.

FWIW I think it would be OK to restrict this to just Blosc for the current PR. Playing around with some other codecs now, looks like at least Zlib can handle larger inputs, but no idea how big.

alimanfoo · 2018-11-28T00:29:08Z

For the record...

LZ4 has a max input buffer size of 0x7E000000 ~~(somewhere around 2**31 - 40000000, not sure exactly where)~~. A larger buffer gives RuntimeError: LZ4 compression error: 0. Probably would be worth adding a buffer size check and giving a better err.

Zstd, Zlib, LZMA, BZ2 all work for buffers larger than 2**31, don't know what is max.

jeromekelleher · 2018-11-28T08:59:00Z

Thanks for the update @jakirkham, this is much neater.

@alimanfoo, how about we have a default max_buffer_size of 2**63 - 1 (practically unlimited) and set the buffer sizes for LZ4 and Blosc appropriately? I think it is good to keep this in the ABC so that we can test the limit checking properly, otherwise it's a real mess trying to provoke the error conditions.

alimanfoo · 2018-11-28T09:06:23Z

SGTM

…

On Wed, 28 Nov 2018, 08:59 Jerome Kelleher ***@***.*** wrote: Thanks for the update @jakirkham <https://github.com/jakirkham>, this is much neater. @alimanfoo <https://github.com/alimanfoo>, how about we have a default max_buffer_size of 2**63 - 1 (practically unlimited) and set the buffer sizes for LZ4 and Blosc appropriately? I think it is good to keep this in the ABC so that we can test the limit checking properly, otherwise it's a real mess trying to provoke the error conditions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qv8jiCpANUSdLcsG5M3ZmSAQCIFkks5uzlBUgaJpZM4T3-bO> .

jeromekelleher · 2018-11-28T09:12:54Z

Cool. I'm happy to finish this one up so. One quick question: how do I make a test that will run for all of the different encoders?

alimanfoo · 2018-11-28T10:29:51Z

Thanks @jeromekelleher. Currently, any tests to be run on multiple codecs are defined in numcodecs/tests/common.py, e.g., check_config() etc. To run them still needs an explicit test in the test module for each codec, e.g., each codec's test module has a test called test_config() which internally calls check_config().

jeromekelleher · 2018-11-28T15:38:39Z

I had a change of heart here @alimanfoo, and changed this to only check the max_buffer_size in the LZ4 and Blosc codecs. The reasoning here was that it seemed like a lot of code changes for no real benefit, since we can probably assume that any Python interfaces to these codecs are already checking for this sort of thing. Also, there's some overhead in calling check_ndarray over and over again, which also seems pointless. To avoid keep everything in one place, I added a max_buffer_size argument to ensure_ndarray which we use in these C codecs.

What do you think?

jakirkham · 2018-11-28T15:41:52Z

Maybe this check should live in ensure_contiguous_ndarray? That's where other similar checks are already handled.

jeromekelleher · 2018-11-28T15:52:02Z

Maybe this check should live in ensure_contiguous_ndarray? That's where other similar checks are already handled.

Good point --- presumably these codecs are requiring contiguity in the C code anyway, so we're not making any extra requirements. @alimanfoo?

alimanfoo · 2018-11-28T22:19:50Z

Sounds like a good solution to me.

…

On Wed, 28 Nov 2018, 15:52 Jerome Kelleher ***@***.*** wrote: Maybe this check should live in ensure_contiguous_ndarray? That's where other similar checks are already handled. Good point --- presumably these codecs are requiring contiguity in the C code anyway, so we're not making any extra requirements. @alimanfoo <https://github.com/alimanfoo>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QlL0aBwp5XwmYN5KNRPrlFHzfy25ks5uzrEjgaJpZM4T3-bO> .

Reverted earlier changes to the ABC layer as converting to ndarray and checking buffer size seems pointless and inefficient.

jeromekelleher · 2018-11-29T09:01:08Z

OK, I think we're ready to go @alimanfoo and @jakirkham. I can squash down to one commit if you prefer (seemed impolite to nuke @jakirkham's commit!).

alimanfoo · 2018-11-29T09:25:49Z

Many thanks @jeromekelleher.

There is one small thing worth mentioning. Currently a call to ensure_contiguous_array() is being made down inside the Buffer convenience class here which is used inside the blosc, lz4 and zstd cython modules. If we add a call to ensure_contiguous_array() higher up within the codec's encode() and decode() methods, then doing so within the Buffer class becomes redundant.

Probably worth keeping things tight and avoiding duplicated calls. Suggest the simplest thing to do would be to remove the ensure_contiguous_array() call inside the Buffer class constructor. A call to ensure_contiguous_array() would need to be added inside the Zstd class encode() and decode() methods.

alimanfoo · 2018-11-29T09:27:25Z

Many thanks @jeromekelleher.

There is one small thing worth mentioning. Currently a call to ensure_contiguous_array() is being made down inside the Buffer convenience class here which is used inside the blosc, lz4 and zstd cython modules. If we add a call to ensure_contiguous_array() higher up within the codec's encode() and decode() methods, then doing so within the Buffer class becomes redundant.

Probably worth keeping things tight and avoiding duplicated calls. Suggest the simplest thing to do would be to remove the ensure_contiguous_array() call inside the Buffer class constructor. A call to ensure_contiguous_array() would need to be added inside the Zstd class encode() and decode() methods.

Otherwise I think it's good to go.

jeromekelleher · 2018-11-29T10:13:58Z

OK, that's done @alimanfoo. There was a slight complication in decoding into a buffer also needed to convert the contiguous array, which was automatically being done by Buffer. I just put in explicit calls to fix it up, and it seems fine now.

jeromekelleher · 2018-11-29T10:27:47Z

Hmm, seems like some PY2 test specialisations aren't needed either now. Nice side effect I guess.

alimanfoo · 2018-11-29T10:34:08Z

Actually the PY2 test failures are an indicator that the vlen codecs also need to make use of ensure_contiguous_array() during decode. Admittedly the test failures are cryptic, and the actual test that would expose the need for this are not currently there (e.g., what happens if you pass array.array on PY2 to vlen codec decode() methods - these do not expose new-style buffer interface).

I think a decent solution to this would be to add ensure_contiguous_array() calls into the decode methods on the three vlen codec classes, right before instantiating Buffer, and leave the tests as they are.

jeromekelleher · 2018-11-29T12:01:41Z

I see, thanks @alimanfoo. That's done now.

Out of curiosity, is there some quirk in Cython that won't let you use superclasses? The decode methods look identical here and look like they could be refactored into a VlenCodec superclass easily enough.

alimanfoo · 2018-11-29T12:31:32Z

Thanks @jeromekelleher.

You're right about the code duplication, the decode methods are almost identical, except for one line where they construct an item to be placed in the output array. AFAIK super-classes are fine, and there is clearly an opportunity for refactoring here. Suggest we deal with that as a separate issue.

jeromekelleher force-pushed the fix-blosc-errors branch from db1b956 to af83108 Compare May 9, 2018 09:30

alimanfoo added this to the 0.6.0 milestone May 13, 2018

jeromekelleher force-pushed the fix-blosc-errors branch from af83108 to f1cbde4 Compare November 23, 2018 14:47

jeromekelleher force-pushed the fix-blosc-errors branch from f1cbde4 to 5545096 Compare November 23, 2018 14:54

alimanfoo changed the title ~~First pass for fixing blosc encode error handling.~~ Fixing blosc encode error handling Nov 28, 2018

jeromekelleher and others added 3 commits November 28, 2018 14:35

First pass for fixing blosc encode error handling.

19df961

Add a max_buffer_size attribute to codecs.

48582a5

Switch ABC to use ensure_ndarray to get nbytes

3ab36eb

jeromekelleher force-pushed the fix-blosc-errors branch from 4bf5b7f to 9f50722 Compare November 28, 2018 15:35

jeromekelleher force-pushed the fix-blosc-errors branch from 9f50722 to 2c5fcd0 Compare November 29, 2018 08:57

Added max_buffer_size to ensure_contiguous_ndarray.

e4f9152

Reverted earlier changes to the ABC layer as converting to ndarray and checking buffer size seems pointless and inefficient.

jeromekelleher force-pushed the fix-blosc-errors branch from 2c5fcd0 to e4f9152 Compare November 29, 2018 08:59

Removed conversion to array in Buffer.

2b091fa

Add ensure_contiguous_array checks to vlen codecs.

7333035

jeromekelleher force-pushed the fix-blosc-errors branch from 4170f93 to 7333035 Compare November 29, 2018 11:57

alimanfoo approved these changes Nov 29, 2018

View reviewed changes

release notes [ci skip]

d5a5267

alimanfoo merged commit bc61f5c into zarr-developers:master Nov 29, 2018

alimanfoo mentioned this pull request Nov 29, 2018

Refactor decode implementation in vlen codecs #135

Open

This was referenced May 5, 2022

Minimize copying in maybe_compress & byte_sample dask/distributed#6273

Merged

Compress larger buffers dask/distributed#6286

Open

Fixing blosc encode error handling #81

Fixing blosc encode error handling #81

Uh oh!

Conversation

jeromekelleher commented May 9, 2018

Uh oh!

jeromekelleher commented May 9, 2018

Uh oh!

alimanfoo commented May 9, 2018

Uh oh!

FrancescAlted commented May 9, 2018

Uh oh!

jeromekelleher commented May 10, 2018

Uh oh!

jeromekelleher commented May 10, 2018

Uh oh!

alimanfoo commented May 10, 2018

Uh oh!

alimanfoo commented Nov 6, 2018

Uh oh!

jakirkham commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 15, 2018

Uh oh!

alimanfoo commented Nov 15, 2018

Uh oh!

jeromekelleher commented Nov 23, 2018

Uh oh!

jeromekelleher commented Nov 23, 2018

Uh oh!

alimanfoo commented Nov 23, 2018

Uh oh!

jeromekelleher commented Nov 23, 2018

Uh oh!

alimanfoo commented Nov 23, 2018

Uh oh!

jakirkham commented Nov 27, 2018

Uh oh!

alimanfoo commented Nov 28, 2018

Uh oh!

alimanfoo commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 28, 2018

Uh oh!

alimanfoo commented Nov 28, 2018 via email

Uh oh!

jeromekelleher commented Nov 28, 2018

Uh oh!

alimanfoo commented Nov 28, 2018

Uh oh!

jeromekelleher commented Nov 28, 2018

Uh oh!

jakirkham commented Nov 28, 2018

Uh oh!

jeromekelleher commented Nov 28, 2018

Uh oh!

alimanfoo commented Nov 28, 2018 via email

Uh oh!

jeromekelleher commented Nov 29, 2018

Uh oh!

alimanfoo commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Nov 29, 2018

Uh oh!

jeromekelleher commented Nov 29, 2018

Uh oh!

jeromekelleher commented Nov 29, 2018

Uh oh!

alimanfoo commented Nov 29, 2018

Uh oh!

jeromekelleher commented Nov 29, 2018

Uh oh!

alimanfoo commented Nov 29, 2018

Uh oh!

Uh oh!

jakirkham commented Nov 6, 2018 •

edited

Loading

alimanfoo commented Nov 28, 2018 •

edited

Loading

alimanfoo commented Nov 29, 2018 •

edited

Loading