Skip to content

Default compression settings #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrocklin opened this issue Dec 23, 2015 · 11 comments
Open

Default compression settings #8

mrocklin opened this issue Dec 23, 2015 · 11 comments

Comments

@mrocklin
Copy link
Contributor

I have noticed performance increases in other projects when I choose default compression settings based on dtype.

Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.

It might improve performance to change the compression defaults in defaults.py to come from a function that takes the dtype as an input.

@alimanfoo
Copy link
Member

Thanks @mrocklin, nice thought. Do you think you know enough to be able to propose a concrete implementation of that function? Or would it need some discussion and/or input from others?

@mrocklin
Copy link
Contributor Author

These were the defaults that I was using in castra. They came from some ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for floating point values, intense compression was of marginal value.

def blosc_args(dt):
    if np.issubdtype(dt, int):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, np.datetime64):
        return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
    if np.issubdtype(dt, float):
        return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
    return None

@alimanfoo
Copy link
Member

Do you apply this for all compressors or just blosclz?

On Wed, Dec 23, 2015 at 4:27 PM, Matthew Rocklin [email protected]
wrote:

These were the defaults that I was using in castra. They came from some
ad-hoc benchmarking on the NYCTaxi dataset. Mostly I found that, for
floating point values, intense compression was of marginal value.

def blosc_args(dt):
if np.issubdtype(dt, int):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, np.datetime64):
return bloscpack.BloscArgs(dt.itemsize, clevel=3, shuffle=True)
if np.issubdtype(dt, float):
return bloscpack.BloscArgs(dt.itemsize, clevel=1, shuffle=False)
return None


Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/8#issuecomment-166936678.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: [email protected] [email protected]
Tel: +44 (0)1865 287721

@mrocklin
Copy link
Contributor Author

Castra only used blosclz I think

@mrocklin
Copy link
Contributor Author

I wouldn't take too much from that project. The general lesson learned was that compression was way more useful on ints/datetimes than on the floating point data that I was looking at at the time.

@alimanfoo
Copy link
Member

Thanks, it's well worth having this knowledge captured somewhere at least,
even if only in documentation. I have other snippets like for (integer)
genotype data zlib level 1 gets you good compression with good speed,
increasing compression level above that adds very little while slowing
things down a lot. This may not generalise to other datasets of course.

On Wednesday, December 23, 2015, Matthew Rocklin [email protected]
wrote:

I wouldn't take too much from that project. The general lesson learned was
that compression was way more useful on ints/datetimes than on the floating
point data that I was looking at at the time.


Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/8#issuecomment-166938303.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: [email protected] [email protected]
Tel: +44 (0)1865 287721

@alimanfoo alimanfoo changed the title Change compression default based on datatype Default compression settings Apr 11, 2016
@alimanfoo
Copy link
Member

I don't feel I have enough experience of a range of datasets to be confident about defining good compression defaults based on dtype alone at the moment. In my limited experience a lot also depends on the correlation structure in the data as well, so patterns from one dataset may not generalise to another.

I propose to close this issue for now but reopen in future if some clear recommendations emerge supported by experiences with a variety of data.

If anyone finds this in the mean time feel free to add thoughts on what default compression settings should be. Default settings are currently fixed as using the blosclz compressor with compression level 5 and the byte shuffle filter.

@mrocklin
Copy link
Contributor Author

I wonder if @falted could jump in here with a few sentences about how he would set compression defaults knowing only the dtype.

@alimanfoo
Copy link
Member

A pretty obvious rule would be to use the bitshuffle filter instead of byte
shuffle for single byte dtypes.

On Monday, April 11, 2016, Matthew Rocklin [email protected] wrote:

I wonder if @falted could jump in here with a few sentences about how he
would set compression defaults knowing only the dtype.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/alimanfoo/zarr/issues/8#issuecomment-208564495

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@FrancescAlted
Copy link

Revising these issues I stumbled upon this (BTW @falted is not my nickname in github). I agree in that making too much assumptions on compression parameters based on dtype is risky. In fact, even that in Blosc the shuffle is active by default, it is not unusual to find datasets that work better without it.

Also, I'm +1 on Alistair suggestion to activate the bitshuffle filter for single-byte dtypes (but still, I am not even sure if this could be beneficial for string data).

@dstansby
Copy link
Contributor

Default compressors are being changed at the moment over at #2470 - I'd encourage folks to take a look there and chip in with any feedback.

maxrjones pushed a commit to maxrjones/zarr-python that referenced this issue Feb 6, 2025
Mark xfail on get requests and add serialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants