-
Notifications
You must be signed in to change notification settings - Fork 35
Simulate genotypes for unit tests #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 @eric-czech when we figure out the right toolset for (test) data generation it could be valuable to expose them as strategies in |
I'm impressed you've gotten this far with the new undocumented APIs @eric-czech! You're just missing a "length" parameter to tell it how much sequence to simulate: model = msprime.Demography.island_model(2, migration_rate=1e-3, Ne=10**4)
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
samples=samples,
demography=model,
length=10**6,
recombination_rate=1e-8
)
ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
print(ts.num_sites)
for variant in ts.variants():
print(int(variant.site.position), variant.alleles, variant.genotypes, sep="\t") This gives ~3000 sites, that look like
Using the new APIs we can simulate mutations at discrete sites, with lots of different mutation models. We're using Jukes-Cantor here, as it's likely fine for the vast majority of stuff. We can easily bias towards different bases, etc, if this ever becomes something we want. These discrete mutation models mean that we can have multiple mutations per site. This has happened at site 642507 above, which has three alleles, but only two of them are actually inherited by the samples. This is something to watch out for, I think. Re HWE, I'll have a bit of a think about that. |
Aha! Thanks, works for me.
Good to know, will definitely watch out for that.
Most of the tests I've been thinking of (not just HWE) would ideally operate by knowing going into the simulation which variants or samples would have certain properties coming out of it. For example, I'd like to say that variants with indexes 1000 through 2000 end up with genotypes across samples that are in HWE so that I can check that HWE tests for those same variants come back not significant. Is that kind of a priori knowledge about samples/variants at odds with how coalescent simulations work? I haven't had a chance to learn much about them yet so forgive my ignorance. |
Yea, I don't see why not |
You can't really say anything about particular sites if they've been thrown randomly on the trees - each site is draw from the underlying mutation process. For HWE, I guess if you pair up the haploid samples within a population which has been isolated for a long time, then they should come out in HWE. Whereas, if you pair up samples between populations, they should not. Something like this: import msprime
import numpy as np
model = msprime.Demography.island_model(2, migration_rate=0, Ne=10**4)
# The populations are isolated until 100k generations ago
model.events.append(
msprime.MassMigration(time=100000, source=1, dest=0))
samples = model.sample(10, 10) # 10 haploids from each pop
ts = msprime.simulate(
samples=samples,
demography=model,
length=10**6,
recombination_rate=1e-8
)
ts = msprime.mutate(ts, model=msprime.JukesCantor(), rate=1e-8, discrete=True)
G = ts.genotype_matrix()
# Within pop-diploids:
W = [(G[:,2 * j], G[:, 2 * j + 1]) for j in range(10)]
# Between pop-diploids:
B = [(G[:,j], G[:, 10 + j]) for j in range(10)]
for j in range(10):
print("B", np.sum(B[j][0] != B[j][1]))
print("W", np.sum(W[j][0] != W[j][1])) Here I'm just printing out the number of differences between the two haploids in our psuedo diploids. Unsurprisingly, we get way more differences between the between-pop diploids. I think something like this should be helpful? It's these sort of scenarious we want to create anyway, with old splits between the populations ensuring that there's lots of divergence, but still with a fair bit of shared diversity because of deep coalescence. For creating sites that are in HWE and not, you could mix and match the bits of the genotype matrix afterwards to create the patterns that you want, I think. Does this help? |
Thanks @jeromekelleher! Very helpful example.
Makes sense, I think it would certainly be interesting to relate exact HWE test statistic distributions to much more intuitive changes in an underlying simulation model like that.
I'm assuming this determines when the two populations split right? What happens if you don't specify this for a 2 population island model?
I think there are several different characteristics of an allele that HWE filters are attempting to eliminate and I would say that example is simulating an extreme form of something like assortative mating (which is one of those characteristics)? I can imagine how I might be able to emulate some others by mixing up the genotype matrix, but the two more common ones I'm aware of are selection and inbreeding. It would be useful to see how HWE tests behave as those things are increased/decreased along some continuous spectrum. Would you happen to know off-hand of good examples of how I could parameterize those things in a model? This getting away from the simple unit testing case a bit but I can see how this could shape up into some deeper tests. Or it could make for great examples of how something like an HWE statistic works and what it can do (for our docs). |
Sorry @eric-czech this one sunk to the bottom for a while.
That's right, yeah. In an island model the samples will coalesce eventually, because as lineages migrate between the two populations they both have to be in the same population to be able to coalesce. Basically, the branches of the trees will be longer, and the signals of geographic structure will be much less obvious.
Neither of these are things we can do directly with a simple coalescent model, but approximations are available. The easiest thing to do is to use a forwards-time simulator like SLiM or fwdpy11. I wouldn't imagine picking out signals with this sort of subtlety is something that we could do in unit tests though - this seems more like something we'd do as part of a validation suite. For unit tests, I think we would be happy if we can get fairly crude expected results when we have extreme structure/very strong signals. For a validation suite, we'd want to have a more nuanced set of tests, where results might be plots of various forms rather than simple yes-no assertions. Do we have an issue talking about a validation suite (as opposed to performance benchmarks)? |
Not that I know of. Maybe it would make sense to colocate benchmarking, validation, and public dataset documentation/scripts in a separate repo (making them all related to https://github.com/pystatgen/sgkit/pull/9)? |
That’s my preferred outcome though I am not a developer so if there are good reasons to keep it in a folder in this repo that’s fine too. |
I initially suggested putting them in this repo, but I'm not opposed to moving them out. I can see that if there were large datasets in the validation repo it would be good to keep it separate so new developers to the main repo can submit a bugfix PR without having to check out lots of unrelated stuff. But that may not be a very compelling reason. We have yet to build Sphinx documentation from multiple repos, so it would be good to figure out how to do that as a part of setting up a separate repo. |
I agree that's critical for the IO repos but for I'm not sure we need to include the validation repo documentation on the main documentation website. It may end up making sense to do so, though. |
We have |
I think we can - closing. |
Uh oh!
There was an error while loading. Please reload this page.
I'm trying to simulate some genotype calls for unit tests using a simple model from
msprime
. At @jeromekelleher's suggestion, I tried this:What is
demography
supposed to be? I saw a "FIXME" for documenting it insimulate
but I wasn't sure what to set it to if not theDemography
instance that comes back fromisland_model
.Also, is this why I'm not getting any variants back out?
I am ultimately trying to simulate hardy weinberg equilibrium. I'd like to have some fraction of the variants be in or near perfect equilibrium and the remainder way out of it. If you get a chance, could I get an assist @jeromekelleher?
The text was updated successfully, but these errors were encountered: