Skip to content

Windowing functions should be lazy #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomwhite opened this issue Oct 20, 2020 · 3 comments
Closed

Windowing functions should be lazy #340

tomwhite opened this issue Oct 20, 2020 · 3 comments

Comments

@tomwhite
Copy link
Collaborator

See https://github.com/pystatgen/sgkit/pull/303#discussion_r507906940

@tomwhite tomwhite mentioned this issue Oct 20, 2020
@tomwhite
Copy link
Collaborator Author

Currently, window variables (window_{contig,start,stop}) are numpy arrays, which as Eric points out in the comment do not scale well to 100M variants. I think there are two things to do:

  1. Make the windowing functions construct Dask arrays for the window variables so that they can be computed more efficiently.
  2. Make window computations (window_statistic and ld_matrix) use Dask arrays.

For 1. we need a Dask implementation of np.searchsorted, both for finding contig boundaries, and for finding windows by genomic position. See dask/dask#7696

2 is more work, since we can't materialize the window arrays as we do now. Instead, the strategy is probably: rechunk the window arrays to match the target variable chunks (e.g. genotypes for computing some of the popgen stats), then use map_overlap to process the target variable and the window definitions chunk by chunk. We'll also need a Dask delayed version of this in cases where we can't use map_overlap, such as when using window_by_position where the overlap depth is not bounded. This is similar to the current implementation in ld_matrix, which has the additional complication of returning a Dask dataframe, rather than a Dask array like map_overlap does.

So it might be worth having a couple of utility functions, map_windows and a more flexible Dask delayed version.

@jeromekelleher
Copy link
Collaborator

It's not obvious to me that there really is a scaling issue here - have we done some experiments to see what sort of time these windowing functions take? Sounds like the dask versions are doing to be fairly tricky so might be worth doing a few quick checks before embarking on it?

@tomwhite
Copy link
Collaborator Author

tomwhite commented Jan 4, 2023

Closing this for the time being, can re-open if we see scalability issues.

@tomwhite tomwhite closed this as completed Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants