-
Notifications
You must be signed in to change notification settings - Fork 35
Port function for calculating genomic windows #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @eric-czech. It would be good to get something in soon. Having said that, a while ago I was looking at how to port scikit-allel's |
👍 Something of an outstanding question for me is whether or not chromosomes should be reflected in chunking, or if they should be passed to these operations and dealt with at a higher level. Given that it's pretty straightforward to put a chunk boundary at chromosome boundaries, I thought that was a nicer solution though I don't imagine it will ever work well with managed rechunking solutions like those mentioned in dask/dask#6597. It would have to know which small chunks to merge into larger preceding chunks. It would be nice if there was an efficient way to do a groupby and then apply chromosome-agnostic functions to those groups, but that performed really terribly in my early gwas-analysis experiments. Perhaps it's worth pushing upstream on an Xarray grouping operation that is optimized for pre-sorted groups (and probably relies on chunking like we will likely have to)? Maybe that effort already exists.
I'm definitely curious if the map_overlap -> map_blocks -> blockwise code ends up being more efficient than a loop that fetches the chunks individually. It would be great if we didn't also need chunk-specific variables like the window-specific ones you mentioned in https://github.com/pystatgen/sgkit/issues/229#issuecomment-702164597. I was assuming the |
I feel that chunking is a lower-level concern, and shouldn't be tied to the data representation itself. In other words, it should be possible to rechunk a dataset without it changing the result of running a function on the data. It might be more or less efficient, but the result should stay the same. So I think chunks don’t have to respect contig boundaries, whereas windows (intervals) do. The approach in #303 will allow windows to be defined that take contigs into account - although the implementation doesn't yet do that. (Of course, we'll need efficient primitives to compute over windows, and there may be work to do there.) |
Fixed in #341 |
The axis_intervals function may be a good place to start for https://github.com/pystatgen/sgkit/issues/229. I already started lifting this over on a branch for https://github.com/pystatgen/sgkit/issues/31 at some point, so I'll prioritize getting that in for review soon so we can see if there's anything important missing.
Let me know if this is blocking you at all @tomwhite. I'm happy to coordinate on it if you're looking to start working on windowing right away.
The text was updated successfully, but these errors were encountered: