Skip to content

allele frequency function #504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
petrelharp opened this issue Mar 30, 2020 · 16 comments
Closed

allele frequency function #504

petrelharp opened this issue Mar 30, 2020 · 16 comments
Labels
enhancement New feature or request future Issues that are closed as they are not planned in the medium-term, but which are still desirable. Python API Issue is about the Python API statistics

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Mar 30, 2020

We should implement the TreeSequence.allele_frequencies(sample_sets) function, which returns a numpy array of (non-ancestral allele frequencies) x (sample_sets).

Here's an implementation:

def allele_frequencies(ts, sample_sets=None):
    if sample_sets is None:
       sample_sets = [ts.samples()] 
    n = np.array([len(x) for x in sample_sets])
    def f(x):
       return x / n
    return ts.sample_count_stat(sample_sets, f, len(sample_sets), windows='sites', polarised=True, mode='site', strict=False, span_normalise=False)

Edit: originally this omitted span_normalise=False.

@jeromekelleher
Copy link
Member

Nice, good idea.

@jeromekelleher jeromekelleher added this to the Python 0.2.4 milestone Mar 30, 2020
@molpopgen
Copy link
Member

It may be nice to optionally return the joint fs as a sparse matrix.

@petrelharp
Copy link
Contributor Author

It may be nice to optionally return the joint fs as a sparse matrix.

I'm not sure what you mean here?

@molpopgen
Copy link
Member

I assume your return value is the marginal fs in each sample set, but sometimes you want the joint fs to instead, such that fs[2,3] is the number of mutations present twice in sample set 0 and three times in set 1. Or maybe I'm not understanding what you are returning here.

@petrelharp
Copy link
Contributor Author

That sounds like the AFS? Which we already have? Here I'm wanting to return the list of allele frequencies for each SNP.

@molpopgen
Copy link
Member

Gotcha--my mistake.

@jeromekelleher
Copy link
Member

That sounds like the AFS? Which we already have? Here I'm wanting to return the list of allele frequencies for each SNP.

Triggered my AFS implementation PTSD there - would not want to do that again!

@molpopgen
Copy link
Member

molpopgen commented Mar 31, 2020 via email

@petrelharp
Copy link
Contributor Author

This computes the total frequency of all mutations that are segregating in the samples, by site. People will also want to know the frequencies of mutations, separately: we should have another method that returns something of length equal to the number of mutations. I don't think we can compute this easily with the general stat framework, but here is a method to pull out this information for the mutations at a particular site:

def mut_counts(site):
    counts = {}
    t = ts.at(site.position)
    for m in site.mutations:
        counts[m.id] = t.num_samples(m.node)
        if m.parent != tskit.NULL:
            assert(m.parent in counts)
            counts[m.id] -= counts[m.parent]
    return counts

@jeromekelleher jeromekelleher added enhancement New feature or request Python API Issue is about the Python API labels Sep 29, 2020
@jeromekelleher jeromekelleher removed this from the Python upcoming milestone Sep 29, 2020
@petrelharp
Copy link
Contributor Author

petrelharp commented Jun 1, 2021

Another related thing: it'd be nice to have a method that returns a numpy array giving the number of alleles at each site, as discussed in this discussion.

@benjeffery benjeffery added this to the Python upcoming milestone Jun 9, 2021
@jeromekelleher
Copy link
Member

jeromekelleher commented Aug 4, 2021

Here's a function I just wrote which I think would be useful to have in the library in some form:

def count_site_alleles(ts, tree, site):
    counts = collections.Counter({site.ancestral_state: ts.num_samples})
    for m in site.mutations:
        current_state = site.ancestral_state
        if m.parent != tskit.NULL:
            current_state = ts.mutation(m.parent).derived_state
        # Silent mutations do nothing
        if current_state != m.derived_state:
            num_samples = tree.num_samples(m.node)
            counts[m.derived_state] += num_samples
            counts[current_state] -= num_samples
    return counts

def count_ancestral(ts):
    num_ancestral = np.zeros(ts.num_sites, dtype=int)
    for tree in ts.trees():
        for site in tree.sites():
            counts = count_site_alleles(ts, tree, site)
            num_ancestral[site.id] = counts[site.ancestral_state]
    return num_ancestral

@jeromekelleher
Copy link
Member

So, should we add a method count_alleles(site) to the Tree class? So the code above would look like

def count_ancestral(ts):
    num_ancestral = np.zeros(ts.num_sites, dtype=int)
    for tree in ts.trees():
        for site in tree.sites():
            counts = tree.count_alleles(site)
            num_ancestral[site.id] = counts[site.ancestral_state]
    return num_ancestral

We might want a more general method that can do this within specified sample sets, but this couldn't be a method of the Tree like this, so I think this simpler version is a useful building block to have anyway.

Site could either be an instance of Site or be an integer ID. If the site is not in the current trees interval we raise an error.

@petrelharp
Copy link
Contributor Author

I like that idea!

@jeromekelleher
Copy link
Member

Made a new issue in #1610. I guess we should keep this issue open in case there's higher-level counting operations we want to do?

@mufernando
Copy link
Member

@petrelharp your original allele_frequencies function should specify ts.sample_count_stat(..., span_normalise=False)!

@petrelharp
Copy link
Contributor Author

doh! I will edit.

@benjeffery benjeffery removed this from the Python upcoming milestone Jun 10, 2025
@benjeffery benjeffery added the future Issues that are closed as they are not planned in the medium-term, but which are still desirable. label Jun 10, 2025
@benjeffery benjeffery closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future Issues that are closed as they are not planned in the medium-term, but which are still desirable. Python API Issue is about the Python API statistics
Projects
None yet
Development

No branches or pull requests

5 participants