Skip to content

Create bitpacking filter for biallelic diploid datasets #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eric-czech opened this issue Jul 30, 2020 · 5 comments
Open

Create bitpacking filter for biallelic diploid datasets #80

eric-czech opened this issue Jul 30, 2020 · 5 comments
Labels
good first issue Good for newcomers IO Issues related to reading and writing common third-party file formats

Comments

@eric-czech
Copy link
Collaborator

We should add a bitpacking numcodecs filter for biallelic diploid data since it makes for a substantial improvement over zarr's default compression. Here's an example: https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/io.ipynb (at the end, results are ~%40 smaller).

That same filter won't work since calls are in a 3D array now, so we'll have to rethink the packing scheme.

@hammer hammer added the IO Issues related to reading and writing common third-party file formats label Jul 30, 2020
@hammer
Copy link
Contributor

hammer commented Jul 31, 2020

@alimanfoo @daletovar we thought this task might be a good one for Quansight to pick up. Any interest?

@jeromekelleher
Copy link
Collaborator

Sounds like a good idea to me. This does sound like something that should be in numcodecs, though, right?

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

@eric-czech
Copy link
Collaborator Author

This does sound like something that should be in numcodecs, though, right?

Possibly? An alternative that made less sense when we were working with PLINK data directly (as alternate allele counts) would be to encode our calls as booleans and use PackBits as is. This is pretty straightforward and putting a little thought into the syntax would be something like:

# Convert  unphased diploid biallelic calls to 2-bit bools
hap0, hap1 = ds.call_genotype[..., 0], ds.call_genotype[..., 1]
bit0 = (hap0 < 0) | (hap0 != hap1) # Missing or heterozygous
bit1 = hap0 == 0 # Homozygous major or minor
ds['call_genotype_bits'] = xr.concat([bit0, bit1], dim='bits')
xr.to_zarr(ds, encoding={'call_genotype_bits': {'filters': [PackBits()]}})

I suspect there's a way to do it in a single pass without creating two in-memory arrays and one more generic interface that numcodecs could support to cover this would be to have a bit packing filter optimized for 8-bit integers in a pre-specified range (something like -1 to 2 in this case for the 4 PLINK call states). We could call a function like https://github.com/pystatgen/sgkit/issues/85 first, then feed it to this filter. Decoding back to the 3D format would still be non-trivial but we'll probably need a function for that at some point anyhow.

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

I'm probably biased in trying to optimize for https://github.com/pystatgen/sgkit/issues/67 but I think there may be an argument in improving storage for data specific to PLINK since our current format with the separate genotype mask and extra ploidy dimension is so wasteful by comparison. I'm not sure if the 50%+ savings in storage and network transfers is enough to call it critical for everyone though.

@daletovar
Copy link
Collaborator

@hammer, yes I could add something like this if @alimanfoo gives the okay. @eric-czech's idea looks reasonable to me.

@jeromekelleher
Copy link
Collaborator

I think this is a useful feature we will want sooner or later - we can decide later where it should live. In the absence of @alimanfoo, I think it'd be a good thing for @daletovar to look at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

No branches or pull requests

4 participants