-
Notifications
You must be signed in to change notification settings - Fork 35
Mean of windowed popgen stats #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This sounds like an excellent idea @timothymillar - do you think It's not obvious, is it? |
I guess the issue with using One option is to create variable name for each of the normalisation options though that may result in too many variables? The simpler option would be to just provide the denominators, e.g. Edit: Actually it's easy to get the number of variants but not as easy to get the base length. |
I agree, I think the right option is to provide the denominators. |
Looking into this it's not completely clear to me what should be reported as the base length of each window. If we use the If I use |
Hmm, tricky! @tomwhite, any thoughts here? |
Agreed. This should be fairly straightforward to calculate: positions = ds.variant_position.values
base_lengths = positions[ds[window_stop].values - 1] + 1 - (positions[ds[window_start].values])
So any windows that extend over the chromosome length needs to be clipped. We don't have chromosome lengths stored anywhere at the moment (see #464), but assuming we had that information, we could find the base lengths with something like the following (not extensively tested): contig_lengths = np.array([15, 58]) # get from VCF metadata
contig_ids = np.arange(n_contigs)
variant_contig = ds["variant_contig"]
contig_starts = np.searchsorted(variant_contig.values, contig_ids)
contig_stops = np.searchsorted(variant_contig.values, contig_ids, side="right")
max_lengths = np.empty_like(ds[window_start].values)
for i in range(n_contigs):
max_lengths[contig_starts[i]:contig_stops[i]] = contig_lengths[i]
positions = ds.variant_position.values
window_start_positions = positions[ds[window_start].values]
window_stop_positions = window_start_positions + size
window_stop_positions = np.clip(window_stop_positions, None, max_lengths)
base_lengths = window_stop_positions - window_start_positions Should the window functions add a |
Just a thought, would it be worth adding |
Yes, that's better. Perhaps This will take us into coordinate system territory (#434). (I realised the code I posted also needs to clip start positions to be at least 0 or 1, depending on the coordinate system in use.) |
Uh oh!
There was an error while loading. Please reload this page.
Currently the windowed aggregation of statistics in
popgen.py
is hard-coded to usenp.sum
[1, 2, 3, 4]. Would it be possible to make this aggregation optional or have aspan_normalise
argument as in Tskit?Another option would be to record the number of loci in each window so that the user can manual average the sums. Or to return both a stat_sum and stat_mean variable by default.
The text was updated successfully, but these errors were encountered: