Skip to content

Fixing blosc encode error handling #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 29, 2018

Conversation

jeromekelleher
Copy link
Member

WIP. Closes #80.

Some problems:

  1. Provoking the error requires making a 2G array, which CI providers might not like. I guess we'll have to see how this works out as it runs.
  2. More seriously, there is some problem with mutexes when blosc.use_threads is true. It looks like there is some library cleanup that should be performed after an error occurs?
  3. Currently the error message is written to stderr/stdout, which isn't ideal. There should some mechanism for telling blosc not to write out error messages, and instead retrieve the error message so we can put it into the exception.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py36 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@jeromekelleher
Copy link
Member Author

Looks like CI providers are OK with making a 2GB array, so that's good. We'll need to take a bit of care on 32 bit builds on Windows, but this seems manageable.

The bigger issue seems that there are hard limits on the buffer size for other compressors too (LZ4 has anyway, haven't gone through the others). LZ4 fails correctly, but doesn't give any error message. This is poor from a user perspective.

I guess the only good way to do this is to have a max_buffer_size for each codec, and check this ourselves, raising an explanatory exception.

@alimanfoo
Copy link
Member

Thanks @jeromekelleher.

Adding a max_buffer_size for each codec and doing our own checks sounds like a practical way forward.

Re point (2), I haven't seen that before, @FrancescAlted is there something we should be doing to clean up after a blosc internal error has occurred, when using blosc with multiple threads and global state (blosc_compress(), blosc_descompress())?

Re point (3), @FrancescAlted could we discuss how blosc reports errors and if there is a better way to enable applications like numcodecs to capture and propagate appropriately? Happy to raise an issue on the c-blosc repo if you think appropriate.

@FrancescAlted
Copy link

Hi. I don't have that much experience with crash recovery (Blosc crashes very few times, if any, for me), but I'd say that a blosc_destroy() followed by blosc_init() would be enough for a 'recovery'.

Regarding the error messages, yes, currently a description (more or less accurate) is sent to stderr and a negative value is returned. Fixing this, while feasible, would require the introduction of a couple of APIs and a better catalog for the different error (and messages) that can occur internally. While not difficult, this would take quite a bit of time, so happy if anybody would be interested in doing a PR for fixing this.

@jeromekelleher
Copy link
Member Author

Thanks for the clarifications @FrancescAlted, this is very helpful.

I agree with @alimanfoo, in that I think the best way forward here is to keep the max_buffer_size for each codec and check for it. Other errors from blosc seem pretty unlikely, so let's not worry too much about how to generate error messages and so on.

@jeromekelleher
Copy link
Member Author

ps. I'm happy to do the coding for this, but might let some other PRs get merged before taking it up again.

@alimanfoo
Copy link
Member

Thanks @jeromekelleher, SGTM.

@alimanfoo alimanfoo added this to the 0.6.0 milestone May 13, 2018
@alimanfoo
Copy link
Member

Hi @jeromekelleher, would you be interested in taking this up again? We're pushing towards a release and would be great to include this.

@jakirkham
Copy link
Member

jakirkham commented Nov 6, 2018

We're discussing possibly dropping Windows 32-bit in issue ( #97 ). That might free us from figuring out 32-bit errors here.

Edit: Though we could just xfail these new tests on Windows 32-bit.

@jeromekelleher
Copy link
Member Author

OK, I can take a look here @alimanfoo and see if there's something I can do cleanly fairly quickly. What's the timeline for the release?

@alimanfoo
Copy link
Member

Thanks @jeromekelleher. No specific timeline but there's only this and a couple of other maintenance issues then I think would be at a nice point to release.

@jeromekelleher
Copy link
Member Author

I've made a pass at handling this in a reasonably general way @alimanfoo. We could embed the check down in the C code or in the guts of each codec separately, which would be simpler and more efficient. However, it would be very difficult to test this. The testing hack I have here is a bit messy, but at least we get coverage on the actual mechanism. Having to create > 2GiB to provoke this will be a mess on CI, as we'll regularly have failures when the tests happen to run on a machine that's a bit more memory constrained or whatever.

If you like the approach, I'd need to add calls to self._check_buffer_size(buf) to all the other codecs as well and generalise the test case somehow. We should also find out what the limits for the other codecs are, but I bet 2GiB is a safe and sensible limit.

@jeromekelleher
Copy link
Member Author

Ah, good ole Python 2. I'm sure this is fixable anyway if we like the general approach.

@alimanfoo
Copy link
Member

Hi @jeromekelleher, I think the approach looks great, thanks.

I think it would be worth holding further work on this until we get either of #128 or #121 merged, those PRs are both attempts to simplify the handling and normalisation of different possible input types. Once one of those is merged, we could then rebase this PR, which would simplify a little and provide a consistent base to work from across other codecs. It would also enable a fix to the PY27 failures on travis about .nbytes being missing on memoryview, as we will be normalising all inputs to numpy arrays, so you can be sure an .nbytes property is present.

@jeromekelleher
Copy link
Member Author

SGTM @alimanfoo, would you mind pinging me when the infrastructure is in place?

@alimanfoo
Copy link
Member

@jeromekelleher will do, thanks.

@jakirkham
Copy link
Member

Went ahead and merged master into this PR and updated it to use ensure_ndarray instead of memoryview to see if that works. Hope that is ok.

@alimanfoo alimanfoo changed the title First pass for fixing blosc encode error handling. Fixing blosc encode error handling Nov 28, 2018
@alimanfoo
Copy link
Member

Thanks @jakirkham.

FWIW I think it would be OK to restrict this to just Blosc for the current PR. Playing around with some other codecs now, looks like at least Zlib can handle larger inputs, but no idea how big.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 28, 2018

For the record...

LZ4 has a max input buffer size of 0x7E000000 (somewhere around 2**31 - 40000000, not sure exactly where). A larger buffer gives RuntimeError: LZ4 compression error: 0. Probably would be worth adding a buffer size check and giving a better err.

Zstd, Zlib, LZMA, BZ2 all work for buffers larger than 2**31, don't know what is max.

@jeromekelleher
Copy link
Member Author

Thanks for the update @jakirkham, this is much neater.

@alimanfoo, how about we have a default max_buffer_size of 2**63 - 1 (practically unlimited) and set the buffer sizes for LZ4 and Blosc appropriately? I think it is good to keep this in the ABC so that we can test the limit checking properly, otherwise it's a real mess trying to provoke the error conditions.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 28, 2018 via email

@jeromekelleher
Copy link
Member Author

Cool. I'm happy to finish this one up so. One quick question: how do I make a test that will run for all of the different encoders?

@alimanfoo
Copy link
Member

Thanks @jeromekelleher. Currently, any tests to be run on multiple codecs are defined in numcodecs/tests/common.py, e.g., check_config() etc. To run them still needs an explicit test in the test module for each codec, e.g., each codec's test module has a test called test_config() which internally calls check_config().

@jeromekelleher
Copy link
Member Author

I had a change of heart here @alimanfoo, and changed this to only check the max_buffer_size in the LZ4 and Blosc codecs. The reasoning here was that it seemed like a lot of code changes for no real benefit, since we can probably assume that any Python interfaces to these codecs are already checking for this sort of thing. Also, there's some overhead in calling check_ndarray over and over again, which also seems pointless. To avoid keep everything in one place, I added a max_buffer_size argument to ensure_ndarray which we use in these C codecs.

What do you think?

@jakirkham
Copy link
Member

Maybe this check should live in ensure_contiguous_ndarray? That's where other similar checks are already handled.

@jeromekelleher
Copy link
Member Author

Maybe this check should live in ensure_contiguous_ndarray? That's where other similar checks are already handled.

Good point --- presumably these codecs are requiring contiguity in the C code anyway, so we're not making any extra requirements. @alimanfoo?

@alimanfoo
Copy link
Member

alimanfoo commented Nov 28, 2018 via email

Reverted earlier changes to the ABC layer as converting to ndarray and
checking buffer size seems pointless and inefficient.
@jeromekelleher
Copy link
Member Author

OK, I think we're ready to go @alimanfoo and @jakirkham. I can squash down to one commit if you prefer (seemed impolite to nuke @jakirkham's commit!).

@alimanfoo
Copy link
Member

alimanfoo commented Nov 29, 2018

Many thanks @jeromekelleher.

There is one small thing worth mentioning. Currently a call to ensure_contiguous_array() is being made down inside the Buffer convenience class here which is used inside the blosc, lz4 and zstd cython modules. If we add a call to ensure_contiguous_array() higher up within the codec's encode() and decode() methods, then doing so within the Buffer class becomes redundant.

Probably worth keeping things tight and avoiding duplicated calls. Suggest the simplest thing to do would be to remove the ensure_contiguous_array() call inside the Buffer class constructor. A call to ensure_contiguous_array() would need to be added inside the Zstd class encode() and decode() methods.

@alimanfoo
Copy link
Member

Many thanks @jeromekelleher.

There is one small thing worth mentioning. Currently a call to ensure_contiguous_array() is being made down inside the Buffer convenience class here which is used inside the blosc, lz4 and zstd cython modules. If we add a call to ensure_contiguous_array() higher up within the codec's encode() and decode() methods, then doing so within the Buffer class becomes redundant.

Probably worth keeping things tight and avoiding duplicated calls. Suggest the simplest thing to do would be to remove the ensure_contiguous_array() call inside the Buffer class constructor. A call to ensure_contiguous_array() would need to be added inside the Zstd class encode() and decode() methods.

Otherwise I think it's good to go.

@jeromekelleher
Copy link
Member Author

OK, that's done @alimanfoo. There was a slight complication in decoding into a buffer also needed to convert the contiguous array, which was automatically being done by Buffer. I just put in explicit calls to fix it up, and it seems fine now.

@jeromekelleher
Copy link
Member Author

Hmm, seems like some PY2 test specialisations aren't needed either now. Nice side effect I guess.

@alimanfoo
Copy link
Member

Actually the PY2 test failures are an indicator that the vlen codecs also need to make use of ensure_contiguous_array() during decode. Admittedly the test failures are cryptic, and the actual test that would expose the need for this are not currently there (e.g., what happens if you pass array.array on PY2 to vlen codec decode() methods - these do not expose new-style buffer interface).

I think a decent solution to this would be to add ensure_contiguous_array() calls into the decode methods on the three vlen codec classes, right before instantiating Buffer, and leave the tests as they are.

@jeromekelleher
Copy link
Member Author

I see, thanks @alimanfoo. That's done now.

Out of curiosity, is there some quirk in Cython that won't let you use superclasses? The decode methods look identical here and look like they could be refactored into a VlenCodec superclass easily enough.

@alimanfoo
Copy link
Member

Thanks @jeromekelleher.

You're right about the code duplication, the decode methods are almost identical, except for one line where they construct an item to be placed in the output array. AFAIK super-classes are fine, and there is clearly an opportunity for refactoring here. Suggest we deal with that as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants