Create bitpacking filter for biallelic diploid datasets #80

eric-czech · 2020-07-30T19:15:04Z

We should add a bitpacking numcodecs filter for biallelic diploid data since it makes for a substantial improvement over zarr's default compression. Here's an example: https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/io.ipynb (at the end, results are ~%40 smaller).

That same filter won't work since calls are in a 3D array now, so we'll have to rethink the packing scheme.

hammer · 2020-07-31T16:11:12Z

@alimanfoo @daletovar we thought this task might be a good one for Quansight to pick up. Any interest?

jeromekelleher · 2020-08-03T11:02:30Z

Sounds like a good idea to me. This does sound like something that should be in numcodecs, though, right?

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

eric-czech · 2020-08-03T12:10:54Z

This does sound like something that should be in numcodecs, though, right?

Possibly? An alternative that made less sense when we were working with PLINK data directly (as alternate allele counts) would be to encode our calls as booleans and use PackBits as is. This is pretty straightforward and putting a little thought into the syntax would be something like:

# Convert  unphased diploid biallelic calls to 2-bit bools
hap0, hap1 = ds.call_genotype[..., 0], ds.call_genotype[..., 1]
bit0 = (hap0 < 0) | (hap0 != hap1) # Missing or heterozygous
bit1 = hap0 == 0 # Homozygous major or minor
ds['call_genotype_bits'] = xr.concat([bit0, bit1], dim='bits')
xr.to_zarr(ds, encoding={'call_genotype_bits': {'filters': [PackBits()]}})

I suspect there's a way to do it in a single pass without creating two in-memory arrays and one more generic interface that numcodecs could support to cover this would be to have a bit packing filter optimized for 8-bit integers in a pre-specified range (something like -1 to 2 in this case for the 4 PLINK call states). We could call a function like https://github.com/pystatgen/sgkit/issues/85 first, then feed it to this filter. Decoding back to the 3D format would still be non-trivial but we'll probably need a function for that at some point anyhow.

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

I'm probably biased in trying to optimize for https://github.com/pystatgen/sgkit/issues/67 but I think there may be an argument in improving storage for data specific to PLINK since our current format with the separate genotype mask and extra ploidy dimension is so wasteful by comparison. I'm not sure if the 50%+ savings in storage and network transfers is enough to call it critical for everyone though.

daletovar · 2020-08-04T19:26:54Z

@hammer, yes I could add something like this if @alimanfoo gives the okay. @eric-czech's idea looks reasonable to me.

jeromekelleher · 2020-08-05T08:31:29Z

I think this is a useful feature we will want sooner or later - we can decide later where it should live. In the absence of @alimanfoo, I think it'd be a good thing for @daletovar to look at.

hammer added the IO Issues related to reading and writing common third-party file formats label Jul 30, 2020

hammer added the good first issue Good for newcomers label Aug 5, 2020

goanpeca mentioned this issue Aug 21, 2020

#80 Create bitpacking filter for biallelic diploid datasets goanpeca/sgkit-clone#80

Open

eric-czech mentioned this issue Aug 22, 2020

Create utilities or add examples that demonstrate how to store bgen zarr efficiently sgkit-dev/sgkit-bgen#14

Open

eric-czech mentioned this issue Sep 3, 2020

Add PC Relate #228

Merged

tomwhite mentioned this issue Oct 13, 2022

maximise lossless compression of vcf_to_zarr #925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create bitpacking filter for biallelic diploid datasets #80

Create bitpacking filter for biallelic diploid datasets #80

eric-czech commented Jul 30, 2020

hammer commented Jul 31, 2020

Uh oh!

jeromekelleher commented Aug 3, 2020

Uh oh!

eric-czech commented Aug 3, 2020

Uh oh!

daletovar commented Aug 4, 2020

Uh oh!

jeromekelleher commented Aug 5, 2020

Uh oh!

Create bitpacking filter for biallelic diploid datasets #80

Create bitpacking filter for biallelic diploid datasets #80

Comments

eric-czech commented Jul 30, 2020

hammer commented Jul 31, 2020

Uh oh!

jeromekelleher commented Aug 3, 2020

Uh oh!

eric-czech commented Aug 3, 2020

Uh oh!

daletovar commented Aug 4, 2020

Uh oh!

jeromekelleher commented Aug 5, 2020

Uh oh!