Skip to content

Mean of windowed popgen stats #662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timothymillar opened this issue Sep 9, 2021 · 8 comments
Open

Mean of windowed popgen stats #662

timothymillar opened this issue Sep 9, 2021 · 8 comments

Comments

@timothymillar
Copy link
Collaborator

timothymillar commented Sep 9, 2021

Currently the windowed aggregation of statistics in popgen.py is hard-coded to use np.sum [1, 2, 3, 4]. Would it be possible to make this aggregation optional or have a span_normalise argument as in Tskit?

Another option would be to record the number of loci in each window so that the user can manual average the sums. Or to return both a stat_sum and stat_mean variable by default.

@jeromekelleher
Copy link
Collaborator

This sounds like an excellent idea @timothymillar - do you think span_normalise would be the appropriate thing here? I guess sometimes we'll want to normalise by the sequence length (number of bases) but also want to normalise by the number of variants.

It's not obvious, is it?

@timothymillar
Copy link
Collaborator Author

timothymillar commented Sep 15, 2021

I guess the issue with using span_normalise to alter the output is that any downstream methods can't determine if and how a variable has been normalised. So it's not safe to reuse the same variable name for what is what is actually a different variable.

One option is to create variable name for each of the normalisation options though that may result in too many variables?

The simpler option would be to just provide the denominators, e.g. window_n_bases and window_n_variants, so that a user can normalise the result themselves. Currently it's easy to compute window_n_bases from window_start and window_stop but not easy to get window_n_variants.

Edit: Actually it's easy to get the number of variants but not as easy to get the base length.

@jeromekelleher
Copy link
Collaborator

I agree, I think the right option is to provide the denominators.

@timothymillar
Copy link
Collaborator Author

Looking into this it's not completely clear to me what should be reported as the base length of each window. If we use the window_start and window_stop indices to calculate the base length from variant_position then we get the distance between the first and last variants within the window rather than its intended size. This could be misleading, for example if one was aiming to calculate the frequency of variable loci in fixed sized windows.

If I use window_by_variant then I'd expect the reported base length to be the distance between the first and last variant of each window. However, if I used window_by_position I would expect the base length to be the specified size except where a window extends over the end of a chromosome. If a window extends beyond a chromosome then I'd expect the length to be the size of the intercept between the window and the chromosome. This suggests the base lengths need to be calculated within each windowing function.

@jeromekelleher
Copy link
Collaborator

Hmm, tricky! @tomwhite, any thoughts here?

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 4, 2021

If I use window_by_variant then I'd expect the reported base length to be the distance between the first and last variant of each window.

Agreed. This should be fairly straightforward to calculate:

positions = ds.variant_position.values
base_lengths = positions[ds[window_stop].values - 1] + 1 - (positions[ds[window_start].values])

However, if I used window_by_position I would expect the base length to be the specified size except where a window extends over the end of a chromosome. If a window extends beyond a chromosome then I'd expect the length to be the size of the intercept between the window and the chromosome.

So any windows that extend over the chromosome length needs to be clipped. We don't have chromosome lengths stored anywhere at the moment (see #464), but assuming we had that information, we could find the base lengths with something like the following (not extensively tested):

contig_lengths = np.array([15, 58]) # get from VCF metadata

contig_ids = np.arange(n_contigs)
variant_contig = ds["variant_contig"]
contig_starts = np.searchsorted(variant_contig.values, contig_ids)
contig_stops = np.searchsorted(variant_contig.values, contig_ids, side="right")

max_lengths = np.empty_like(ds[window_start].values)
for i in range(n_contigs):
    max_lengths[contig_starts[i]:contig_stops[i]] = contig_lengths[i]

positions = ds.variant_position.values
window_start_positions = positions[ds[window_start].values]
window_stop_positions = window_start_positions + size
window_stop_positions = np.clip(window_stop_positions, None, max_lengths)
base_lengths = window_stop_positions - window_start_positions

Should the window functions add a window_base_length variable?

@timothymillar
Copy link
Collaborator Author

Should the window functions add a window_base_length variable?

Just a thought, would it be worth adding window_base_start and window_base_stop instead? That would allow a more direct translation to and from BED, GFF and etc in future, if that's useful?

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 5, 2021

Just a thought, would it be worth adding window_base_start and window_base_stop instead?

Yes, that's better. Perhaps window_start_position and window_stop_position to echo the variant_position variable?

This will take us into coordinate system territory (#434). (I realised the code I posted also needs to clip start positions to be at least 0 or 1, depending on the coordinate system in use.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants