-
Notifications
You must be signed in to change notification settings - Fork 35
Create bitpacking filter for biallelic diploid datasets #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@alimanfoo @daletovar we thought this task might be a good one for Quansight to pick up. Any interest? |
Sounds like a good idea to me. This does sound like something that should be in Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right? |
Possibly? An alternative that made less sense when we were working with PLINK data directly (as alternate allele counts) would be to encode our calls as booleans and use PackBits as is. This is pretty straightforward and putting a little thought into the syntax would be something like: # Convert unphased diploid biallelic calls to 2-bit bools
hap0, hap1 = ds.call_genotype[..., 0], ds.call_genotype[..., 1]
bit0 = (hap0 < 0) | (hap0 != hap1) # Missing or heterozygous
bit1 = hap0 == 0 # Homozygous major or minor
ds['call_genotype_bits'] = xr.concat([bit0, bit1], dim='bits')
xr.to_zarr(ds, encoding={'call_genotype_bits': {'filters': [PackBits()]}}) I suspect there's a way to do it in a single pass without creating two in-memory arrays and one more generic interface that numcodecs could support to cover this would be to have a bit packing filter optimized for 8-bit integers in a pre-specified range (something like -1 to 2 in this case for the 4 PLINK call states). We could call a function like https://github.com/pystatgen/sgkit/issues/85 first, then feed it to this filter. Decoding back to the 3D format would still be non-trivial but we'll probably need a function for that at some point anyhow.
I'm probably biased in trying to optimize for https://github.com/pystatgen/sgkit/issues/67 but I think there may be an argument in improving storage for data specific to PLINK since our current format with the separate genotype mask and extra ploidy dimension is so wasteful by comparison. I'm not sure if the 50%+ savings in storage and network transfers is enough to call it critical for everyone though. |
@hammer, yes I could add something like this if @alimanfoo gives the okay. @eric-czech's idea looks reasonable to me. |
I think this is a useful feature we will want sooner or later - we can decide later where it should live. In the absence of @alimanfoo, I think it'd be a good thing for @daletovar to look at. |
We should add a bitpacking numcodecs filter for biallelic diploid data since it makes for a substantial improvement over zarr's default compression. Here's an example: https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/io.ipynb (at the end, results are ~%40 smaller).
That same filter won't work since calls are in a 3D array now, so we'll have to rethink the packing scheme.
The text was updated successfully, but these errors were encountered: