Skip to content

Rename window to window_by_index, and add window_by_position #581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 5, 2021

Conversation

tomwhite
Copy link
Collaborator

This is a draft implementation of the design sketched in https://github.com/pystatgen/sgkit/issues/341#issuecomment-847140397

There is still work to change existing window references to window_by_index. Also, the windowing code does not use Dask arrays, see #340 for follow on work to fix that.

@tomwhite tomwhite force-pushed the window_by branch 2 times, most recently from 5f8eac7 to c413978 Compare May 25, 2021 12:59
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some comments.

sgkit/window.py Outdated
@@ -12,27 +13,32 @@
# Window definition (user code)


def window(
def window_by_index(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion here, but have you considered window_by_variant? The index suffix here might refer to a number of things, and there would be a nice symmetry then between window_by_variant and window_by_position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought of it, but window_by_variant might be better.

ds: Dataset,
*,
size: int,
offset: int = 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a num (or similar) option which is mutually exclusive to size that gives the number of windows? This would be really useful for plotting - I want to plot around 200 windows, e.g.

So the size argument corresponds to numpy.arange and the num argument corresponds to numpy.linspace?

Haven't thought through how offset fits in here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

window_by_index uses np.arange, so perhaps a num argument makes more sense there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could want "num" in either - maybe I want n windows with the same number of variants (or as close as possible) or I want n equally spaced windows along the genome. Both are valid things, I think, and would be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I think we could add num later, since it's fairly easy to work around. For window_by_index, you can set size=ds.dims["variants"] // num. For window_by_position you can work out the size by adding up the lengths of the contigs and dividing by num.

However, this does raise a difference in the types of windowing when measuring by genomic position. The window_by_position function here produces one window per variant (position), which is what we want for computing ld_matrix (which came from Eric's implementation, and was influenced by Hail's locus_windows). This is subtly different to chopping the genome up into equally spaced windows measured by number of base pairs. I think we need both, but I'm not sure what the second one looks like from an API point of view (or what it should be called!) - particularly as it probably involves #434 as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I think the two ways of looking at windowing when measuring by genomic position can be generalized. I've introduced a variable called window_start_position that can be used to set window starts - so setting it to variant_position for example will create a window per variant (the ld_matrix case). Alternatively, leaving it unset will produce equally-spaced windows of size size across the genome (Jerome's case).

How does this look?

@tomwhite
Copy link
Collaborator Author

When discussing this issue in last week's meeting, it was clear that there was some confusion about naming and what these functions do. I've added some examples and more documentation which hopefully should help.

@tomwhite tomwhite marked this pull request as ready for review June 14, 2021 15:49
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting back on this @tomwhite, but I really wanted to get what you're doing here. I think I'm still missing something basic though - see the comment above.

@jeromekelleher
Copy link
Collaborator

Dammit, github lost my comment! Basically I'm confused about why window_start and window_stop aren't genome coordinates, and why we won't have a window_index variable (which gives the index of the window for each variant). OK, it's a bit more storage space, but it would be simpler to work with, wouldn't it?

For window_by_position, consider the following example. What if we wanted to plot diversity in windows along the genome for two different datasets. If the window start/stop coords are based on the variant positions, then this will be quite messy.

I feel like I must be misunderstanding something though, so apologies for being slow here..

@tomwhite
Copy link
Collaborator Author

I think the answer to why window_start and window_stop aren't genome coordinates is because we started with windowing by variant (index) and not taking coordinates into account. (I think of them as "internal" variables that other methods use.) I'm not sure that storing them as genome coordinates helps much though, since they would have to be converted to indexes at some point anyway. And we can support genome coordinates by using np.searchsorted to compute window_start and window_stop from coordinates (as this PR does).

What if we wanted to plot diversity in windows along the genome for two different datasets. If the window start/stop coords are based on the variant positions, then this will be quite messy.

Would using window_by_position on each dataset, then plotting each (on the same chart) not work? It might be worth trying an example to see how it looks.

BTW I'm away all next week, but look forward to discussing this more when I get back!

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Jun 18, 2021

Would using window_by_position on each dataset, then plotting each (on the same chart) not work? It might be worth trying an example to see how it looks.

It probably would, but what if I wanted to compare the values directly? I think it'd definitely want the windows to be strictly comparable across the two datasets, which they wouldn't be if the coordinates didn't align.

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

BTW I'm away all next week, but look forward to discussing this more when I get back!

Have fun! And yes, I think it's definitely worth talking this through in more detail...

@tomwhite
Copy link
Collaborator Author

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

Windows can overlap, so a variant may fall in an arbitrary number of windows. So I'm not sure how useful it would be as a representation.

@hammer
Copy link
Contributor

hammer commented Jun 28, 2021

I think it's worth at least thinking through the alternative approach to windowing where we store the index of the window each variant belongs to.

This sounds to me like a partitionining, which I suppose could be considered a special case of windowing. I can imagine certain workloads where we'd want to request a partioning that obeys certain characteristics, so perhaps we should file a separate issue to design an API and representation for that use case?

@tomwhite
Copy link
Collaborator Author

I think it'd definitely want the windows to be strictly comparable across the two datasets, which they wouldn't be if the coordinates didn't align.

We can do this within a single contig with something like the following:

>>> import sgkit as sg
>>> import numpy as np
>>> 
>>> # first dataset
>>> ds1 = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=3, seed=0)
>>> ds1["variant_position"] = (["variants"], np.array([1, 4, 6, 8, 12]))
>>> ds1 = sg.window_by_position(ds1, size=5, offset=1)
>>> ds1
<xarray.Dataset>
Dimensions:             (windows: 3, variants: 5, alleles: 2, samples: 3, ploidy: 2)
Dimensions without coordinates: windows, variants, alleles, samples, ploidy
Data variables:
    window_contig       (windows) int64 0 0 0
    window_start        (windows) int64 0 2 4
    window_stop         (windows) int64 2 4 5
    variant_contig      (variants) int64 0 0 0 0 0
    variant_position    (variants) int64 1 4 6 8 12
    variant_allele      (variants, alleles) |S1 b'C' b'T' b'C' ... b'A' b'A'
    sample_id           (samples) <U2 'S0' 'S1' 'S2'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 1 1 1 1
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]
>>> 
>>> # second dataset with different number of variants, and none in first window
>>> ds2 = sg.simulate_genotype_call_dataset(n_variant=4, n_sample=3, seed=1)
>>> ds2["variant_position"] = (["variants"], np.array([6, 11, 13, 17]))
>>> ds2 = sg.window_by_position(ds2, size=5, offset=1)
>>> ds2
<xarray.Dataset>
Dimensions:             (windows: 4, variants: 4, alleles: 2, samples: 3, ploidy: 2)
Dimensions without coordinates: windows, variants, alleles, samples, ploidy
Data variables:
    window_contig       (windows) int64 0 0 0 0
    window_start        (windows) int64 0 0 1 3
    window_stop         (windows) int64 0 1 3 4
    variant_contig      (variants) int64 0 0 0 0
    variant_position    (variants) int64 6 11 13 17
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'T' ... b'A' b'T'
    sample_id           (samples) <U2 'S0' 'S1' 'S2'
    call_genotype       (variants, samples, ploidy) int8 1 0 1 0 1 ... 0 1 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0]
>>> 
>>> # positions of variants in window 2 (0-based) for each dataset
>>> ds1.sel(dict(variants=slice(ds1.window_start[2].values, ds1.window_stop[2].values))).variant_position.values
array([12])
>>> ds2.sel(dict(variants=slice(ds2.window_start[2].values, ds2.window_stop[2].values))).variant_position.values
array([11, 13])

So it should be possible to plot diversity for two datasets with something similar, although doing it for multiple contigs would be more complicated.

@jeromekelleher
Copy link
Collaborator

Windows can overlap, so a variant may fall in an arbitrary number of windows. So I'm not sure how useful it would be as a representation.

Right, I think this is what was bothering me and what I hadn't grokked. So if windows can overlap (and yes, it's better that they can), then what we have is a good approach.

We'll probably want some recipes and maybe some utility functions for doing simple plotting like I was talking about above, but these can be added later. Thanks for your patience!

@tomwhite
Copy link
Collaborator Author

Right, I think this is what was bothering me and what I hadn't grokked. So if windows can overlap (and yes, it's better that they can), then what we have is a good approach.

Thanks @jeromekelleher. (It wasn't until I started trying to implement your suggestion that I realised that overlapping windows caused a problem.)

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Jul 5, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 5, 2021

Codecov Report

Merging #581 (f93e0d5) into main (57f36ab) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #581   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2754      2771   +17     
=========================================
+ Hits          2754      2771   +17     
Impacted Files Coverage Δ
sgkit/stats/ld.py 100.00% <ø> (ø)
sgkit/stats/popgen.py 100.00% <ø> (ø)
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/window.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57f36ab...f93e0d5. Read the comment docs.

@mergify mergify bot merged commit a7cb326 into sgkit-dev:main Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants