Skip to content

Port function for calculating genomic windows #281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-czech opened this issue Sep 25, 2020 · 4 comments
Closed

Port function for calculating genomic windows #281

eric-czech opened this issue Sep 25, 2020 · 4 comments

Comments

@eric-czech
Copy link
Collaborator

The axis_intervals function may be a good place to start for https://github.com/pystatgen/sgkit/issues/229. I already started lifting this over on a branch for https://github.com/pystatgen/sgkit/issues/31 at some point, so I'll prioritize getting that in for review soon so we can see if there's anything important missing.

Let me know if this is blocking you at all @tomwhite. I'm happy to coordinate on it if you're looking to start working on windowing right away.

@tomwhite
Copy link
Collaborator

Thanks @eric-czech. It would be good to get something in soon. Having said that, a while ago I was looking at how to port scikit-allel's moving_statistic function to Dask, so I have a bare-bones implementation of windowing that uses Dask's map_overlap, and which I can use for the time being. I think axis_intervals is more fully-featured, but it would be interesting to compare the two approaches (Dask delayed vs map_overlap).

@eric-czech
Copy link
Collaborator Author

Having said that, a while ago I was looking at how to port scikit-allel's moving_statistic function to Dask, so I have a bare-bones implementation of windowing that uses Dask's map_overlap, and which I can use for the time being.

👍 Something of an outstanding question for me is whether or not chromosomes should be reflected in chunking, or if they should be passed to these operations and dealt with at a higher level. Given that it's pretty straightforward to put a chunk boundary at chromosome boundaries, I thought that was a nicer solution though I don't imagine it will ever work well with managed rechunking solutions like those mentioned in dask/dask#6597. It would have to know which small chunks to merge into larger preceding chunks.

It would be nice if there was an efficient way to do a groupby and then apply chromosome-agnostic functions to those groups, but that performed really terribly in my early gwas-analysis experiments. Perhaps it's worth pushing upstream on an Xarray grouping operation that is optimized for pre-sorted groups (and probably relies on chunking like we will likely have to)? Maybe that effort already exists.

I think axis_intervals is more fully-featured, but it would be interesting to compare the two approaches (Dask delayed vs map_overlap).

I'm definitely curious if the map_overlap -> map_blocks -> blockwise code ends up being more efficient than a loop that fetches the chunks individually. It would be great if we didn't also need chunk-specific variables like the window-specific ones you mentioned in https://github.com/pystatgen/sgkit/issues/229#issuecomment-702164597. I was assuming the axis_intervals port would create both and the chunk-specific ones would be a bit awkward at the user level since it's an implementation detail whereas windows can useful to analyze on their own IMO if defined by physical distance.

@tomwhite tomwhite mentioned this issue Oct 5, 2020
@tomwhite
Copy link
Collaborator

tomwhite commented Oct 5, 2020

Something of an outstanding question for me is whether or not chromosomes should be reflected in chunking, or if they should be passed to these operations and dealt with at a higher level.

I feel that chunking is a lower-level concern, and shouldn't be tied to the data representation itself. In other words, it should be possible to rechunk a dataset without it changing the result of running a function on the data. It might be more or less efficient, but the result should stay the same.

So I think chunks don’t have to respect contig boundaries, whereas windows (intervals) do. The approach in #303 will allow windows to be defined that take contigs into account - although the implementation doesn't yet do that. (Of course, we'll need efficient primitives to compute over windows, and there may be work to do there.)

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2021

Fixed in #341

@tomwhite tomwhite closed this as completed Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants