Mean of windowed popgen stats #662

timothymillar · 2021-09-09T22:44:19Z

Currently the windowed aggregation of statistics in popgen.py is hard-coded to use np.sum [1, 2, 3, 4]. Would it be possible to make this aggregation optional or have a span_normalise argument as in Tskit?

Another option would be to record the number of loci in each window so that the user can manual average the sums. Or to return both a stat_sum and stat_mean variable by default.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2021-09-15T14:13:44Z

This sounds like an excellent idea @timothymillar - do you think span_normalise would be the appropriate thing here? I guess sometimes we'll want to normalise by the sequence length (number of bases) but also want to normalise by the number of variants.

It's not obvious, is it?

timothymillar · 2021-09-15T21:32:11Z

I guess the issue with using span_normalise to alter the output is that any downstream methods can't determine if and how a variable has been normalised. So it's not safe to reuse the same variable name for what is what is actually a different variable.

One option is to create variable name for each of the normalisation options though that may result in too many variables?

The simpler option would be to just provide the denominators, e.g. window_n_bases and window_n_variants, so that a user can normalise the result themselves. ~~Currently it's easy to compute window_n_bases from window_start and window_stop but not easy to get window_n_variants~~.

Edit: Actually it's easy to get the number of variants but not as easy to get the base length.

jeromekelleher · 2021-09-16T08:40:57Z

I agree, I think the right option is to provide the denominators.

timothymillar · 2021-09-29T07:47:49Z

Looking into this it's not completely clear to me what should be reported as the base length of each window. If we use the window_start and window_stop indices to calculate the base length from variant_position then we get the distance between the first and last variants within the window rather than its intended size. This could be misleading, for example if one was aiming to calculate the frequency of variable loci in fixed sized windows.

If I use window_by_variant then I'd expect the reported base length to be the distance between the first and last variant of each window. However, if I used window_by_position I would expect the base length to be the specified size except where a window extends over the end of a chromosome. If a window extends beyond a chromosome then I'd expect the length to be the size of the intercept between the window and the chromosome. This suggests the base lengths need to be calculated within each windowing function.

jeromekelleher · 2021-09-29T11:05:47Z

Hmm, tricky! @tomwhite, any thoughts here?

tomwhite · 2021-10-04T16:50:50Z

If I use window_by_variant then I'd expect the reported base length to be the distance between the first and last variant of each window.

Agreed. This should be fairly straightforward to calculate:

positions = ds.variant_position.values
base_lengths = positions[ds[window_stop].values - 1] + 1 - (positions[ds[window_start].values])

However, if I used window_by_position I would expect the base length to be the specified size except where a window extends over the end of a chromosome. If a window extends beyond a chromosome then I'd expect the length to be the size of the intercept between the window and the chromosome.

So any windows that extend over the chromosome length needs to be clipped. We don't have chromosome lengths stored anywhere at the moment (see #464), but assuming we had that information, we could find the base lengths with something like the following (not extensively tested):

contig_lengths = np.array([15, 58]) # get from VCF metadata

contig_ids = np.arange(n_contigs)
variant_contig = ds["variant_contig"]
contig_starts = np.searchsorted(variant_contig.values, contig_ids)
contig_stops = np.searchsorted(variant_contig.values, contig_ids, side="right")

max_lengths = np.empty_like(ds[window_start].values)
for i in range(n_contigs):
    max_lengths[contig_starts[i]:contig_stops[i]] = contig_lengths[i]

positions = ds.variant_position.values
window_start_positions = positions[ds[window_start].values]
window_stop_positions = window_start_positions + size
window_stop_positions = np.clip(window_stop_positions, None, max_lengths)
base_lengths = window_stop_positions - window_start_positions

Should the window functions add a window_base_length variable?

timothymillar · 2021-10-04T20:36:04Z

Should the window functions add a window_base_length variable?

Just a thought, would it be worth adding window_base_start and window_base_stop instead? That would allow a more direct translation to and from BED, GFF and etc in future, if that's useful?

tomwhite · 2021-10-05T07:56:16Z

Just a thought, would it be worth adding window_base_start and window_base_stop instead?

Yes, that's better. Perhaps window_start_position and window_stop_position to echo the variant_position variable?

This will take us into coordinate system territory (#434). (I realised the code I posted also needs to clip start positions to be at least 0 or 1, depending on the coordinate system in use.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mean of windowed popgen stats #662

Mean of windowed popgen stats #662

timothymillar commented Sep 9, 2021 •

edited

Loading

jeromekelleher commented Sep 15, 2021

Uh oh!

timothymillar commented Sep 15, 2021 •

edited

Loading

Uh oh!

jeromekelleher commented Sep 16, 2021

Uh oh!

timothymillar commented Sep 29, 2021

Uh oh!

jeromekelleher commented Sep 29, 2021

Uh oh!

tomwhite commented Oct 4, 2021

Uh oh!

timothymillar commented Oct 4, 2021

Uh oh!

tomwhite commented Oct 5, 2021

Uh oh!

Mean of windowed popgen stats #662

Mean of windowed popgen stats #662

Comments

timothymillar commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jeromekelleher commented Sep 15, 2021

Uh oh!

timothymillar commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Sep 16, 2021

Uh oh!

timothymillar commented Sep 29, 2021

Uh oh!

jeromekelleher commented Sep 29, 2021

Uh oh!

tomwhite commented Oct 4, 2021

Uh oh!

timothymillar commented Oct 4, 2021

Uh oh!

tomwhite commented Oct 5, 2021

Uh oh!

timothymillar commented Sep 9, 2021 •

edited

Loading

timothymillar commented Sep 15, 2021 •

edited

Loading