Skip to content

Vcfzarr concatenate and rechunk #324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 3, 2020

Conversation

tomwhite
Copy link
Collaborator

This addresses parts of #56, and was motivated by the need to convert MalariaGEN data in VCF Zarr format to sgkit's representation.

The difficult case is where each contig is stored in a separate Zarr group, since the arrays need to be concatenated (along the variants dimension), and then uniformly rechunked so they can be stored in Zarr. Rechunker doesn't handle the case where Zarr arrays are concatenated (via Dask), since the input chunk sizes are not uniform (each Zarr array will have a final chunk that is less than the chunk size, so stacking them introduces non-uniformity).

Furthermore, using Dask to rechunk using the fix in dask/dask#5105 (comment), doesn't work either since Dask caches input chunks that are split across output chunks (this was not the problem addressed in that issue).

This PR introduces a new concatenate_and_rechunk function that implements the solution outlined in the same issue: dask/dask#5105 (comment). The idea is to create an empty output array and fill it in by slicing from the Zarr stores directly. This means that there is very low memory usage, and the whole thing is embarrassingly parallel. (The downside is that it operates at the Zarr level, not Xarray.)

The concatenate_and_rechunk function is used in a new vcfzarr_to_zarr function that converts (scikit-allel) VCF Zarr format to a sgkit's Zarr file.

This code will be useful for the native VCF support where we need to concatenate and rechunk multiple Zarr files, but I haven't integrated it there yet.

@hammer
Copy link
Contributor

hammer commented Oct 14, 2020

I'm making a note to myself here that this concatenate_and_rechunk is an example of a function that logically lives above the Dask/Xarray/Zarr stack, not within any individual subproject, and whose utility is not limited to the statistical genetics domain.

I'm increasingly curious about whether a library or collection of libraries that sit between Dask/Xarray/Zarr and sgkit/pangeo/satpy/metpy/astropy would be a reasonable thing to build.

@eric-czech
Copy link
Collaborator

fwiw I'd love to talk about this more on our call tomorrow:

Furthermore, using Dask to rechunk using the fix in dask/dask#5105 (comment), doesn't work either since Dask caches input chunks that are split across output chunks (this was not the problem addressed in that issue).

I haven't found that to be a problem in concatenating datasets and rechunking them w/ Dask yet (it has worked at least), though it certainly worries me that it will be constantly necessary as a part of https://github.com/pystatgen/sgkit/issues/298. I imagined doing this kind of rechunking with Dask would be very common in a QC workflow.

Did you start off using Dask to rechunk and have trouble or go right for a more optimal solution from the start?

@tomwhite
Copy link
Collaborator Author

Did you start off using Dask to rechunk and have trouble or go right for a more optimal solution from the start?

I started off using Dask to rechunk, but hit memory problems. I would rather have avoided writing this code, since ideally Dask rechunk would handle this case better, but it's probably worth keeping around until there's a more reliable way of doing it.

@tomwhite
Copy link
Collaborator Author

I found the following thread which mentions the block_size_limit keyword to Dask rechunk, which might be helpful for controlling memory usage, although it's not obvious (to me) what to set it to in general:

https://stackoverflow.com/questions/55324848/for-a-bigger-than-memory-dask-array-of-size-m-n-how-to-re-chunk-from-chunks/55413100

@tomwhite
Copy link
Collaborator Author

I opened an issue on the Dask project that dives into this problem in more detail.

@tomwhite tomwhite force-pushed the vcfzarr-concatenate-and-rechunk branch from 37e43de to 4b46a59 Compare November 2, 2020 10:28
@tomwhite
Copy link
Collaborator Author

tomwhite commented Nov 2, 2020

Discussion upstream (in dask/dask#6745 and linked issues) has made it clear that there are no easy fixes for this problem. It's possible that Dask fixes and/or tuning will produce a solution, but in the meantime I think the pragmatic course of action is to merge the functionality in this PR.

I have refactored it slightly to make it possible for the caller to choose which concatenate and rechunk algorithm to use (the optimized one introduced in this PR is the default, the other one is the Xarray/Dask internal one) - this also makes it easier to remove the optimized one in the future if there is a Dask solution.

I have tested this code again on the MalariaGEN data. The notebook is here.

@tomwhite tomwhite marked this pull request as ready for review November 2, 2020 10:29
@codecov-io
Copy link

codecov-io commented Nov 2, 2020

Codecov Report

Merging #324 into master will increase coverage by 0.12%.
The diff coverage is 97.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
+ Coverage   95.40%   95.52%   +0.12%     
==========================================
  Files          31       31              
  Lines        2109     2233     +124     
==========================================
+ Hits         2012     2133     +121     
- Misses         97      100       +3     
Impacted Files Coverage Δ
sgkit/io/vcfzarr_reader.py 98.13% <97.80%> (-1.87%) ⬇️
sgkit/io/utils.py 98.70% <98.14%> (-1.30%) ⬇️
sgkit/io/vcf/__init__.py 54.54% <100.00%> (+4.54%) ⬆️
sgkit/io/vcf/vcf_reader.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe45a21...bd30da5. Read the comment docs.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Do we have follow-up issues for all the unresolved points raised?

@tomwhite
Copy link
Collaborator Author

tomwhite commented Nov 2, 2020

Thanks @jeromekelleher. I've opened #365 and #366 to track outstanding work.

@jeromekelleher
Copy link
Collaborator

OK, thanks. I'd say add the merge label if no one objects by tomorrow morning.

grouped_by_contig: bool = False,
consolidated: bool = False,
tempdir: Optional[PathType] = None,
concat_algorithm: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably use Literal type here to make it more user friendly and safe.


if var == "variant_id":
ds.attrs["max_variant_id_length"] = max_len
if var == "variant_allele":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be elif, or:

if var in {"variant_id", "variant_allele"}:
  ds.attrs[f"max_{var}_length"] = max_len

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Nov 3, 2020
@mergify mergify bot merged commit 03415a2 into sgkit-dev:master Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants