-
-
Notifications
You must be signed in to change notification settings - Fork 330
Default compression settings #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @mrocklin, nice thought. Do you think you know enough to be able to propose a concrete implementation of that function? Or would it need some discussion and/or input from others? |
These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value. def blosc_args(dt):
if np.issubdtype(dt, int):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, np.datetime64):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, float):
return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
return None |
Do you apply this for all compressors or just blosclz? On Wed, Dec 23, 2015 at 4:27 PM, Matthew Rocklin [email protected]
Alistair Miles |
Castra only used blosclz I think |
I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time. |
Thanks, it's well worth having this knowledge captured somewhere at least, On Wednesday, December 23, 2015, Matthew Rocklin [email protected]
Alistair Miles |
I don't feel I have enough experience of a range of datasets to be confident about defining good compression defaults based on dtype alone at the moment. In my limited experience a lot also depends on the correlation structure in the data as well, so patterns from one dataset may not generalise to another. I propose to close this issue for now but reopen in future if some clear recommendations emerge supported by experiences with a variety of data. If anyone finds this in the mean time feel free to add thoughts on what default compression settings should be. Default settings are currently fixed as using the blosclz compressor with compression level 5 and the byte shuffle filter. |
I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype. |
A pretty obvious rule would be to use the bitshuffle filter instead of byte On Monday, April 11, 2016, Matthew Rocklin [email protected] wrote:
Alistair Miles |
Revising these issues I stumbled upon this (BTW @falted is not my nickname in github). I agree in that making too much assumptions on compression parameters based on dtype is risky. In fact, even that in Blosc the shuffle is active by default, it is not unusual to find datasets that work better without it. Also, I'm +1 on Alistair suggestion to activate the bitshuffle filter for single-byte dtypes (but still, I am not even sure if this could be beneficial for string data). |
Default compressors are being changed at the moment over at #2470 - I'd encourage folks to take a look there and chip in with any feedback. |
Mark xfail on get requests and add serialization
I have noticed performance increases in other projects when I choose default compression settings based on dtype.
Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.
It might improve performance to change the compression defaults in
defaults.py
to come from a function that takes the dtype as an input.The text was updated successfully, but these errors were encountered: