-
Notifications
You must be signed in to change notification settings - Fork 35
Vcfzarr concatenate and rechunk #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vcfzarr concatenate and rechunk #324
Conversation
I'm making a note to myself here that this I'm increasingly curious about whether a library or collection of libraries that sit between Dask/Xarray/Zarr and sgkit/pangeo/satpy/metpy/astropy would be a reasonable thing to build. |
fwiw I'd love to talk about this more on our call tomorrow:
I haven't found that to be a problem in concatenating datasets and rechunking them w/ Dask yet (it has worked at least), though it certainly worries me that it will be constantly necessary as a part of https://github.com/pystatgen/sgkit/issues/298. I imagined doing this kind of rechunking with Dask would be very common in a QC workflow. Did you start off using Dask to rechunk and have trouble or go right for a more optimal solution from the start? |
I started off using Dask to rechunk, but hit memory problems. I would rather have avoided writing this code, since ideally Dask rechunk would handle this case better, but it's probably worth keeping around until there's a more reliable way of doing it. |
I found the following thread which mentions the |
I opened an issue on the Dask project that dives into this problem in more detail. |
37e43de
to
4b46a59
Compare
Discussion upstream (in dask/dask#6745 and linked issues) has made it clear that there are no easy fixes for this problem. It's possible that Dask fixes and/or tuning will produce a solution, but in the meantime I think the pragmatic course of action is to merge the functionality in this PR. I have refactored it slightly to make it possible for the caller to choose which concatenate and rechunk algorithm to use (the optimized one introduced in this PR is the default, the other one is the Xarray/Dask internal one) - this also makes it easier to remove the optimized one in the future if there is a Dask solution. I have tested this code again on the MalariaGEN data. The notebook is here. |
Codecov Report
@@ Coverage Diff @@
## master #324 +/- ##
==========================================
+ Coverage 95.40% 95.52% +0.12%
==========================================
Files 31 31
Lines 2109 2233 +124
==========================================
+ Hits 2012 2133 +121
- Misses 97 100 +3
Continue to review full report at Codecov.
|
…ndency by vcfzarr_to_zarr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Do we have follow-up issues for all the unresolved points raised?
Thanks @jeromekelleher. I've opened #365 and #366 to track outstanding work. |
OK, thanks. I'd say add the merge label if no one objects by tomorrow morning. |
sgkit/io/vcfzarr_reader.py
Outdated
grouped_by_contig: bool = False, | ||
consolidated: bool = False, | ||
tempdir: Optional[PathType] = None, | ||
concat_algorithm: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably use Literal
type here to make it more user friendly and safe.
sgkit/io/vcfzarr_reader.py
Outdated
|
||
if var == "variant_id": | ||
ds.attrs["max_variant_id_length"] = max_len | ||
if var == "variant_allele": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be elif
, or:
if var in {"variant_id", "variant_allele"}:
ds.attrs[f"max_{var}_length"] = max_len
This addresses parts of #56, and was motivated by the need to convert MalariaGEN data in VCF Zarr format to sgkit's representation.
The difficult case is where each contig is stored in a separate Zarr group, since the arrays need to be concatenated (along the variants dimension), and then uniformly rechunked so they can be stored in Zarr. Rechunker doesn't handle the case where Zarr arrays are concatenated (via Dask), since the input chunk sizes are not uniform (each Zarr array will have a final chunk that is less than the chunk size, so stacking them introduces non-uniformity).
Furthermore, using Dask to rechunk using the fix in dask/dask#5105 (comment), doesn't work either since Dask caches input chunks that are split across output chunks (this was not the problem addressed in that issue).
This PR introduces a new
concatenate_and_rechunk
function that implements the solution outlined in the same issue: dask/dask#5105 (comment). The idea is to create an empty output array and fill it in by slicing from the Zarr stores directly. This means that there is very low memory usage, and the whole thing is embarrassingly parallel. (The downside is that it operates at the Zarr level, not Xarray.)The
concatenate_and_rechunk
function is used in a newvcfzarr_to_zarr
function that converts (scikit-allel) VCF Zarr format to a sgkit's Zarr file.This code will be useful for the native VCF support where we need to concatenate and rechunk multiple Zarr files, but I haven't integrated it there yet.