Vcfzarr concatenate and rechunk #324

tomwhite · 2020-10-13T14:22:23Z

This addresses parts of #56, and was motivated by the need to convert MalariaGEN data in VCF Zarr format to sgkit's representation.

The difficult case is where each contig is stored in a separate Zarr group, since the arrays need to be concatenated (along the variants dimension), and then uniformly rechunked so they can be stored in Zarr. Rechunker doesn't handle the case where Zarr arrays are concatenated (via Dask), since the input chunk sizes are not uniform (each Zarr array will have a final chunk that is less than the chunk size, so stacking them introduces non-uniformity).

Furthermore, using Dask to rechunk using the fix in dask/dask#5105 (comment), doesn't work either since Dask caches input chunks that are split across output chunks (this was not the problem addressed in that issue).

This PR introduces a new concatenate_and_rechunk function that implements the solution outlined in the same issue: dask/dask#5105 (comment). The idea is to create an empty output array and fill it in by slicing from the Zarr stores directly. This means that there is very low memory usage, and the whole thing is embarrassingly parallel. (The downside is that it operates at the Zarr level, not Xarray.)

The concatenate_and_rechunk function is used in a new vcfzarr_to_zarr function that converts (scikit-allel) VCF Zarr format to a sgkit's Zarr file.

This code will be useful for the native VCF support where we need to concatenate and rechunk multiple Zarr files, but I haven't integrated it there yet.

hammer · 2020-10-14T15:03:17Z

I'm making a note to myself here that this concatenate_and_rechunk is an example of a function that logically lives above the Dask/Xarray/Zarr stack, not within any individual subproject, and whose utility is not limited to the statistical genetics domain.

I'm increasingly curious about whether a library or collection of libraries that sit between Dask/Xarray/Zarr and sgkit/pangeo/satpy/metpy/astropy would be a reasonable thing to build.

eric-czech · 2020-10-14T19:43:07Z

fwiw I'd love to talk about this more on our call tomorrow:

Furthermore, using Dask to rechunk using the fix in dask/dask#5105 (comment), doesn't work either since Dask caches input chunks that are split across output chunks (this was not the problem addressed in that issue).

I haven't found that to be a problem in concatenating datasets and rechunking them w/ Dask yet (it has worked at least), though it certainly worries me that it will be constantly necessary as a part of https://github.com/pystatgen/sgkit/issues/298. I imagined doing this kind of rechunking with Dask would be very common in a QC workflow.

Did you start off using Dask to rechunk and have trouble or go right for a more optimal solution from the start?

tomwhite · 2020-10-15T13:13:45Z

Did you start off using Dask to rechunk and have trouble or go right for a more optimal solution from the start?

I started off using Dask to rechunk, but hit memory problems. I would rather have avoided writing this code, since ideally Dask rechunk would handle this case better, but it's probably worth keeping around until there's a more reliable way of doing it.

tomwhite · 2020-10-15T14:31:35Z

I found the following thread which mentions the block_size_limit keyword to Dask rechunk, which might be helpful for controlling memory usage, although it's not obvious (to me) what to set it to in general:

https://stackoverflow.com/questions/55324848/for-a-bigger-than-memory-dask-array-of-size-m-n-how-to-re-chunk-from-chunks/55413100

tomwhite · 2020-10-19T15:59:56Z

I opened an issue on the Dask project that dives into this problem in more detail.

…arr arrays.

tomwhite · 2020-11-02T10:29:34Z

Discussion upstream (in dask/dask#6745 and linked issues) has made it clear that there are no easy fixes for this problem. It's possible that Dask fixes and/or tuning will produce a solution, but in the meantime I think the pragmatic course of action is to merge the functionality in this PR.

I have refactored it slightly to make it possible for the caller to choose which concatenate and rechunk algorithm to use (the optimized one introduced in this PR is the default, the other one is the Xarray/Dask internal one) - this also makes it easier to remove the optimized one in the future if there is a Dask solution.

I have tested this code again on the MalariaGEN data. The notebook is here.

codecov-io · 2020-11-02T10:36:19Z

Codecov Report

Merging #324 into master will increase coverage by 0.12%.
The diff coverage is 97.97%.

@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
+ Coverage   95.40%   95.52%   +0.12%     
==========================================
  Files          31       31              
  Lines        2109     2233     +124     
==========================================
+ Hits         2012     2133     +121     
- Misses         97      100       +3

Impacted Files	Coverage Δ
sgkit/io/vcfzarr_reader.py	`98.13% <97.80%> (-1.87%)`	⬇️
sgkit/io/utils.py	`98.70% <98.14%> (-1.30%)`	⬇️
sgkit/io/vcf/__init__.py	`54.54% <100.00%> (+4.54%)`	⬆️
sgkit/io/vcf/vcf_reader.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe45a21...bd30da5. Read the comment docs.

…ndency by vcfzarr_to_zarr

jeromekelleher

LGTM.

Do we have follow-up issues for all the unresolved points raised?

tomwhite · 2020-11-02T14:03:32Z

Thanks @jeromekelleher. I've opened #365 and #366 to track outstanding work.

jeromekelleher · 2020-11-02T17:56:22Z

OK, thanks. I'd say add the merge label if no one objects by tomorrow morning.

ravwojdyla · 2020-11-03T12:17:36Z

sgkit/io/vcfzarr_reader.py

+    grouped_by_contig: bool = False,
+    consolidated: bool = False,
+    tempdir: Optional[PathType] = None,
+    concat_algorithm: Optional[str] = None,


you could probably use Literal type here to make it more user friendly and safe.

ravwojdyla · 2020-11-03T12:35:27Z

sgkit/io/vcfzarr_reader.py

+
+                if var == "variant_id":
+                    ds.attrs["max_variant_id_length"] = max_len
+                if var == "variant_allele":


nit: should be elif, or:

if var in {"variant_id", "variant_allele"}: ds.attrs[f"max_{var}_length"] = max_len

tomwhite mentioned this pull request Oct 22, 2020

Add documentation explaining how to visualize computations in Dask #344

Merged

tomwhite mentioned this pull request Oct 30, 2020

Early scalability demonstrations #345

Closed

tomwhite added 4 commits November 2, 2020 10:06

Add concatenate_and_rechunk function for efficient concatenation of Z…

c0387b7

…arr arrays.

Add vcfzarr_to_zarr

8d115b6

Rename variant_alleles to variant_allele

6c7fd4a

Make concat algorithm a parameter

4b46a59

tomwhite force-pushed the vcfzarr-concatenate-and-rechunk branch from 37e43de to 4b46a59 Compare November 2, 2020 10:28

tomwhite marked this pull request as ready for review November 2, 2020 10:29

Move zarrs_to_dataset to sgkit.io.utils to avoid implicit cyvcf2 depe…

bd30da5

…ndency by vcfzarr_to_zarr

jeromekelleher approved these changes Nov 2, 2020

View reviewed changes

This was referenced Nov 2, 2020

Add concat algorithm parameter to vcf_to_zarr #365

Closed

Track upstream improvements to concat and rechunk issue #366

Open

ravwojdyla approved these changes Nov 3, 2020

View reviewed changes

Small improvements

9381c17

tomwhite added the auto-merge Auto merge label for mergify test flight label Nov 3, 2020

Merge branch 'master' into vcfzarr-concatenate-and-rechunk

7c0ed20

mergify bot merged commit 03415a2 into sgkit-dev:master Nov 3, 2020

tomwhite mentioned this pull request Sep 13, 2021

Add concat algorithm parameter to vcf_to_zarr #365 #665

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vcfzarr concatenate and rechunk #324

Vcfzarr concatenate and rechunk #324

Uh oh!

tomwhite commented Oct 13, 2020

Uh oh!

hammer commented Oct 14, 2020

Uh oh!

eric-czech commented Oct 14, 2020

Uh oh!

tomwhite commented Oct 15, 2020

Uh oh!

tomwhite commented Oct 15, 2020

Uh oh!

tomwhite commented Oct 19, 2020

Uh oh!

tomwhite commented Nov 2, 2020

Uh oh!

codecov-io commented Nov 2, 2020 •

edited

Loading

Uh oh!

jeromekelleher left a comment

Uh oh!

tomwhite commented Nov 2, 2020

Uh oh!

jeromekelleher commented Nov 2, 2020

Uh oh!

ravwojdyla Nov 3, 2020

Uh oh!

ravwojdyla Nov 3, 2020

Uh oh!

Uh oh!

Vcfzarr concatenate and rechunk #324

Vcfzarr concatenate and rechunk #324

Uh oh!

Conversation

tomwhite commented Oct 13, 2020

Uh oh!

hammer commented Oct 14, 2020

Uh oh!

eric-czech commented Oct 14, 2020

Uh oh!

tomwhite commented Oct 15, 2020

Uh oh!

tomwhite commented Oct 15, 2020

Uh oh!

tomwhite commented Oct 19, 2020

Uh oh!

tomwhite commented Nov 2, 2020

Uh oh!

codecov-io commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Nov 2, 2020

Uh oh!

jeromekelleher commented Nov 2, 2020

Uh oh!

ravwojdyla Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Nov 2, 2020 •

edited

Loading