-
Notifications
You must be signed in to change notification settings - Fork 35
Rename window to window_by_index, and add window_by_position #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5f8eac7
to
c413978
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some comments.
sgkit/window.py
Outdated
@@ -12,27 +13,32 @@ | |||
# Window definition (user code) | |||
|
|||
|
|||
def window( | |||
def window_by_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a strong opinion here, but have you considered window_by_variant
? The index
suffix here might refer to a number of things, and there would be a nice symmetry then between window_by_variant
and window_by_position
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought of it, but window_by_variant
might be better.
ds: Dataset, | ||
*, | ||
size: int, | ||
offset: int = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a num
(or similar) option which is mutually exclusive to size
that gives the number of windows? This would be really useful for plotting - I want to plot around 200 windows, e.g.
So the size
argument corresponds to numpy.arange and the num
argument corresponds to numpy.linspace
?
Haven't thought through how offset
fits in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_by_index
uses np.arange
, so perhaps a num
argument makes more sense there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could want "num" in either - maybe I want n
windows with the same number of variants (or as close as possible) or I want n
equally spaced windows along the genome. Both are valid things, I think, and would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I think we could add num
later, since it's fairly easy to work around. For window_by_index
, you can set size=ds.dims["variants"] // num
. For window_by_position
you can work out the size by adding up the lengths of the contigs and dividing by num
.
However, this does raise a difference in the types of windowing when measuring by genomic position. The window_by_position
function here produces one window per variant (position), which is what we want for computing ld_matrix
(which came from Eric's implementation, and was influenced by Hail's locus_windows). This is subtly different to chopping the genome up into equally spaced windows measured by number of base pairs. I think we need both, but I'm not sure what the second one looks like from an API point of view (or what it should be called!) - particularly as it probably involves #434 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I think the two ways of looking at windowing when measuring by genomic position can be generalized. I've introduced a variable called window_start_position
that can be used to set window starts - so setting it to variant_position
for example will create a window per variant (the ld_matrix
case). Alternatively, leaving it unset will produce equally-spaced windows of size size
across the genome (Jerome's case).
How does this look?
When discussing this issue in last week's meeting, it was clear that there was some confusion about naming and what these functions do. I've added some examples and more documentation which hopefully should help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting back on this @tomwhite, but I really wanted to get what you're doing here. I think I'm still missing something basic though - see the comment above.
Dammit, github lost my comment! Basically I'm confused about why For window_by_position, consider the following example. What if we wanted to plot diversity in windows along the genome for two different datasets. If the window start/stop coords are based on the variant positions, then this will be quite messy. I feel like I must be misunderstanding something though, so apologies for being slow here.. |
I think the answer to why
Would using BTW I'm away all next week, but look forward to discussing this more when I get back! |
It probably would, but what if I wanted to compare the values directly? I think it'd definitely want the windows to be strictly comparable across the two datasets, which they wouldn't be if the coordinates didn't align. I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.
Have fun! And yes, I think it's definitely worth talking this through in more detail... |
Windows can overlap, so a variant may fall in an arbitrary number of windows. So I'm not sure how useful it would be as a representation. |
This sounds to me like a partitionining, which I suppose could be considered a special case of windowing. I can imagine certain workloads where we'd want to request a partioning that obeys certain characteristics, so perhaps we should file a separate issue to design an API and representation for that use case? |
We can do this within a single contig with something like the following:
So it should be possible to plot diversity for two datasets with something similar, although doing it for multiple contigs would be more complicated. |
Right, I think this is what was bothering me and what I hadn't grokked. So if windows can overlap (and yes, it's better that they can), then what we have is a good approach. We'll probably want some recipes and maybe some utility functions for doing simple plotting like I was talking about above, but these can be added later. Thanks for your patience! |
Thanks @jeromekelleher. (It wasn't until I started trying to implement your suggestion that I realised that overlapping windows caused a problem.) |
Codecov Report
@@ Coverage Diff @@
## main #581 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 35 35
Lines 2754 2771 +17
=========================================
+ Hits 2754 2771 +17
Continue to review full report at Codecov.
|
This is a draft implementation of the design sketched in https://github.com/pystatgen/sgkit/issues/341#issuecomment-847140397
There is still work to change existing
window
references towindow_by_index
. Also, the windowing code does not use Dask arrays, see #340 for follow on work to fix that.