Rename window to window_by_index, and add window_by_position #581

tomwhite · 2021-05-25T12:12:25Z

This is a draft implementation of the design sketched in https://github.com/pystatgen/sgkit/issues/341#issuecomment-847140397

There is still work to change existing window references to window_by_index. Also, the windowing code does not use Dask arrays, see #340 for follow on work to fix that.

jeromekelleher

LGTM! Some comments.

jeromekelleher · 2021-06-02T18:11:05Z

sgkit/window.py

@@ -12,27 +13,32 @@
 # Window definition (user code)


-def window(
+def window_by_index(


Not a strong opinion here, but have you considered window_by_variant? The index suffix here might refer to a number of things, and there would be a nice symmetry then between window_by_variant and window_by_position.

I hadn't thought of it, but window_by_variant might be better.

jeromekelleher · 2021-06-02T18:17:54Z

sgkit/window.py

+    ds: Dataset,
+    *,
+    size: int,
+    offset: int = 0,


could we add a num (or similar) option which is mutually exclusive to size that gives the number of windows? This would be really useful for plotting - I want to plot around 200 windows, e.g.

So the size argument corresponds to numpy.arange and the num argument corresponds to numpy.linspace?

Haven't thought through how offset fits in here.

window_by_index uses np.arange, so perhaps a num argument makes more sense there?

You could want "num" in either - maybe I want n windows with the same number of variants (or as close as possible) or I want n equally spaced windows along the genome. Both are valid things, I think, and would be useful.

True. I think we could add num later, since it's fairly easy to work around. For window_by_index, you can set size=ds.dims["variants"] // num. For window_by_position you can work out the size by adding up the lengths of the contigs and dividing by num.

However, this does raise a difference in the types of windowing when measuring by genomic position. The window_by_position function here produces one window per variant (position), which is what we want for computing ld_matrix (which came from Eric's implementation, and was influenced by Hail's locus_windows). This is subtly different to chopping the genome up into equally spaced windows measured by number of base pairs. I think we need both, but I'm not sure what the second one looks like from an API point of view (or what it should be called!) - particularly as it probably involves #434 as well.

Thinking about this more, I think the two ways of looking at windowing when measuring by genomic position can be generalized. I've introduced a variable called window_start_position that can be used to set window starts - so setting it to variant_position for example will create a window per variant (the ld_matrix case). Alternatively, leaving it unset will produce equally-spaced windows of size size across the genome (Jerome's case).

How does this look?

tomwhite · 2021-06-14T15:49:42Z

When discussing this issue in last week's meeting, it was clear that there was some confusion about naming and what these functions do. I've added some examples and more documentation which hopefully should help.

jeromekelleher

Sorry for the delay in getting back on this @tomwhite, but I really wanted to get what you're doing here. I think I'm still missing something basic though - see the comment above.

jeromekelleher · 2021-06-17T17:36:29Z

Dammit, github lost my comment! Basically I'm confused about why window_start and window_stop aren't genome coordinates, and why we won't have a window_index variable (which gives the index of the window for each variant). OK, it's a bit more storage space, but it would be simpler to work with, wouldn't it?

For window_by_position, consider the following example. What if we wanted to plot diversity in windows along the genome for two different datasets. If the window start/stop coords are based on the variant positions, then this will be quite messy.

I feel like I must be misunderstanding something though, so apologies for being slow here..

tomwhite · 2021-06-18T10:18:11Z

I think the answer to why window_start and window_stop aren't genome coordinates is because we started with windowing by variant (index) and not taking coordinates into account. (I think of them as "internal" variables that other methods use.) I'm not sure that storing them as genome coordinates helps much though, since they would have to be converted to indexes at some point anyway. And we can support genome coordinates by using np.searchsorted to compute window_start and window_stop from coordinates (as this PR does).

What if we wanted to plot diversity in windows along the genome for two different datasets. If the window start/stop coords are based on the variant positions, then this will be quite messy.

Would using window_by_position on each dataset, then plotting each (on the same chart) not work? It might be worth trying an example to see how it looks.

BTW I'm away all next week, but look forward to discussing this more when I get back!

jeromekelleher · 2021-06-18T10:59:54Z

Would using window_by_position on each dataset, then plotting each (on the same chart) not work? It might be worth trying an example to see how it looks.

It probably would, but what if I wanted to compare the values directly? I think it'd definitely want the windows to be strictly comparable across the two datasets, which they wouldn't be if the coordinates didn't align.

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

BTW I'm away all next week, but look forward to discussing this more when I get back!

Have fun! And yes, I think it's definitely worth talking this through in more detail...

tomwhite · 2021-06-28T15:06:22Z

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

Windows can overlap, so a variant may fall in an arbitrary number of windows. So I'm not sure how useful it would be as a representation.

hammer · 2021-06-28T15:24:25Z

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

This sounds to me like a partitionining, which I suppose could be considered a special case of windowing. I can imagine certain workloads where we'd want to request a partioning that obeys certain characteristics, so perhaps we should file a separate issue to design an API and representation for that use case?

tomwhite · 2021-06-28T15:46:45Z

I think it'd definitely want the windows to be strictly comparable across the two datasets, which they wouldn't be if the coordinates didn't align.

We can do this within a single contig with something like the following:

>>> import sgkit as sg
>>> import numpy as np
>>> 
>>> # first dataset
>>> ds1 = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=3, seed=0)
>>> ds1["variant_position"] = (["variants"], np.array([1, 4, 6, 8, 12]))
>>> ds1 = sg.window_by_position(ds1, size=5, offset=1)
>>> ds1
<xarray.Dataset>
Dimensions:             (windows: 3, variants: 5, alleles: 2, samples: 3, ploidy: 2)
Dimensions without coordinates: windows, variants, alleles, samples, ploidy
Data variables:
    window_contig       (windows) int64 0 0 0
    window_start        (windows) int64 0 2 4
    window_stop         (windows) int64 2 4 5
    variant_contig      (variants) int64 0 0 0 0 0
    variant_position    (variants) int64 1 4 6 8 12
    variant_allele      (variants, alleles) |S1 b'C' b'T' b'C' ... b'A' b'A'
    sample_id           (samples) <U2 'S0' 'S1' 'S2'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 1 1 1 1
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]
>>> 
>>> # second dataset with different number of variants, and none in first window
>>> ds2 = sg.simulate_genotype_call_dataset(n_variant=4, n_sample=3, seed=1)
>>> ds2["variant_position"] = (["variants"], np.array([6, 11, 13, 17]))
>>> ds2 = sg.window_by_position(ds2, size=5, offset=1)
>>> ds2
<xarray.Dataset>
Dimensions:             (windows: 4, variants: 4, alleles: 2, samples: 3, ploidy: 2)
Dimensions without coordinates: windows, variants, alleles, samples, ploidy
Data variables:
    window_contig       (windows) int64 0 0 0 0
    window_start        (windows) int64 0 0 1 3
    window_stop         (windows) int64 0 1 3 4
    variant_contig      (variants) int64 0 0 0 0
    variant_position    (variants) int64 6 11 13 17
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'T' ... b'A' b'T'
    sample_id           (samples) <U2 'S0' 'S1' 'S2'
    call_genotype       (variants, samples, ploidy) int8 1 0 1 0 1 ... 0 1 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]
>>> 
>>> # positions of variants in window 2 (0-based) for each dataset
>>> ds1.sel(dict(variants=slice(ds1.window_start[2].values, ds1.window_stop[2].values))).variant_position.values
array([12])
>>> ds2.sel(dict(variants=slice(ds2.window_start[2].values, ds2.window_stop[2].values))).variant_position.values
array([11, 13])

So it should be possible to plot diversity for two datasets with something similar, although doing it for multiple contigs would be more complicated.

jeromekelleher · 2021-06-29T09:02:31Z

Windows can overlap, so a variant may fall in an arbitrary number of windows. So I'm not sure how useful it would be as a representation.

Right, I think this is what was bothering me and what I hadn't grokked. So if windows can overlap (and yes, it's better that they can), then what we have is a good approach.

We'll probably want some recipes and maybe some utility functions for doing simple plotting like I was talking about above, but these can be added later. Thanks for your patience!

tomwhite · 2021-06-29T15:04:13Z

Right, I think this is what was bothering me and what I hadn't grokked. So if windows can overlap (and yes, it's better that they can), then what we have is a good approach.

Thanks @jeromekelleher. (It wasn't until I started trying to implement your suggestion that I realised that overlapping windows caused a problem.)

codecov-commenter · 2021-07-05T08:55:21Z

Codecov Report

Merging #581 (f93e0d5) into main (57f36ab) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #581   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2754      2771   +17     
=========================================
+ Hits          2754      2771   +17

Impacted Files	Coverage Δ
sgkit/stats/ld.py	`100.00% <ø> (ø)`
sgkit/stats/popgen.py	`100.00% <ø> (ø)`
sgkit/__init__.py	`100.00% <100.00%> (ø)`
sgkit/window.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57f36ab...f93e0d5. Read the comment docs.

tomwhite force-pushed the window_by branch 2 times, most recently from 5f8eac7 to c413978 Compare May 25, 2021 12:59

jeromekelleher reviewed Jun 2, 2021

View reviewed changes

tomwhite force-pushed the window_by branch from c413978 to f9abeb8 Compare June 8, 2021 16:37

tomwhite added 3 commits June 14, 2021 16:33

Rename window to window_by_index, and add window_by_position

c72d871

Rename window_by_index to window_by_variant

9e63a71

window_start_position

790d7d5

tomwhite force-pushed the window_by branch from f9abeb8 to c69a1e8 Compare June 14, 2021 15:47

tomwhite marked this pull request as ready for review June 14, 2021 15:49

tomwhite added 2 commits June 14, 2021 17:00

Add examples for window_by_variant and window_by_position

9e8ca42

Remove references to renamed window function.

96731b8

tomwhite force-pushed the window_by branch from c69a1e8 to 96731b8 Compare June 14, 2021 16:01

jeromekelleher reviewed Jun 17, 2021

View reviewed changes

jeromekelleher approved these changes Jun 29, 2021

View reviewed changes

tomwhite added the auto-merge Auto merge label for mergify test flight label Jul 5, 2021

Merge branch 'main' into window_by

f93e0d5

mergify bot merged commit a7cb326 into sgkit-dev:main Jul 5, 2021

tomwhite mentioned this pull request Jul 8, 2021

Flexible specification of windows #341

Closed

Rename window to window_by_index, and add window_by_position #581

Rename window to window_by_index, and add window_by_position #581

Uh oh!

Conversation

tomwhite commented May 25, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

tomwhite Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Jun 2, 2021

Choose a reason for hiding this comment

Uh oh!

tomwhite Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

tomwhite Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

tomwhite Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Jun 14, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher commented Jun 17, 2021

Uh oh!

tomwhite commented Jun 18, 2021

Uh oh!

jeromekelleher commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomwhite commented Jun 28, 2021

Uh oh!

hammer commented Jun 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomwhite commented Jun 28, 2021

Uh oh!

jeromekelleher commented Jun 29, 2021

Uh oh!

tomwhite commented Jun 29, 2021

Uh oh!

codecov-commenter commented Jul 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jeromekelleher commented Jun 18, 2021 •

edited

Loading

hammer commented Jun 28, 2021 •

edited

Loading

codecov-commenter commented Jul 5, 2021 •

edited

Loading