Obtain a list of trees based on edge ID or node ID #2507

Proteios1998 · 2022-09-14T14:04:51Z

Proteios1998
Sep 14, 2022

Hi, I wonder how I can obtain the list of trees efficiently given a list of edge ID or node ID. I want to take a look at the number of samples of select edges (or children nodes of those edges), using the num_samples function of each tree. Every time I get a new edge, I need to seek the tree based on the edge coordinate, which is not efficient. I wonder if there are alternative ways to do that? Thank you very much!

Answered by jeromekelleher

Sep 15, 2022

Would something like this do the trick?

import numpy as np

# Total number of samples below each mutation on an edge
edge_mutation_samples = np.zeros(ts.num_edges)
# Total number of mutations that fall on each edge
edge_mutations = np.zeros(ts.num_edges)

for tree in ts.trees():
    for mut in tree.mutations():
        edge_mutations[mut.edge] += 1
        edge_mutation_samples[mut.edge] += tree.num_samples(mut.node)

print(edge_mutations)
print(edge_mutation_samples)

keep = edge_mutations > 0
print(edge_mutation_samples[keep] / edge_mutations[keep])

for a little example I made this gives:

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 3. 2. 0.]
[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0…

View full answer

jeromekelleher · 2022-09-15T08:13:16Z

jeromekelleher
Sep 15, 2022
Maintainer

Hi @Proteios1998 👋, welcome to tskit-dev!

I moved this to our Discussion board as I think it's a good question, and could be useful to other folks to see the answer.

First, I guess it would be helpful if you could show us some code that does what you want inefficiently, in as simple a way as you can. That'll help us understand exactly what it is you want to do.

1 reply

Proteios1998 Sep 15, 2022
Author

Thank you Jerome! What I wanted to do is to summarize the distribution of number of samples per edge with mutations. To this end, I need the tree information so that I could use the tree.num_samples() function.

My first try is as the code below:

    edge = []
    for k in ts.mutations():
        edge.append(k.edge)
    edge = pd.DataFrame(edge)
    edge.columns = ["edge"]
    n_samples = []
    for i in sorted(edge.edge.unique()):
        mid = int((ts.edge(i).left + ts.edge(i).right)/2)
        tree = ts.at(mid)
        n_samples.append(tree.num_samples(ts.edge(i).child))
    return(n_samples)

This is inefficient because every time I need to find the tree ID first and it takes a lot of sorting time. I wonder for this step, whether you could add the tree ID information to each edge so that I don't have to sort it out myself.

I came up with another way to do this:

   n_samples = []
  for tree in ts.trees():
        node = []
        for k in tree.mutations():
            node.append(k.node)
        if len(node)==0:
            continue
        node = pd.DataFrame(node)
        node.columns = ["node"]
        for i in sorted(node.node.unique()):
            n_samples.append(tree.num_samples(i))
    df_samples = pd.DataFrame(n_samples)
    df_samples.columns = ["Num_Samples"]
    samples = pd.DataFrame(df_samples.value_counts())
    samples.columns = ["Counts"]
    samples.reset_index(inplace = True)
    return(samples)

I think this method is much more efficient than the previous one (and I also test the speed which is fast) because I don't need to iterate the same tree again and again. But I also wonder if there is other better way to retrieve the number of samples from the tree sequence. Thank you very much!

jeromekelleher · 2022-09-15T13:53:46Z

jeromekelleher
Sep 15, 2022
Maintainer

Would something like this do the trick?

import numpy as np

# Total number of samples below each mutation on an edge
edge_mutation_samples = np.zeros(ts.num_edges)
# Total number of mutations that fall on each edge
edge_mutations = np.zeros(ts.num_edges)

for tree in ts.trees():
    for mut in tree.mutations():
        edge_mutations[mut.edge] += 1
        edge_mutation_samples[mut.edge] += tree.num_samples(mut.node)

print(edge_mutations)
print(edge_mutation_samples)

keep = edge_mutations > 0
print(edge_mutation_samples[keep] / edge_mutations[keep])

for a little example I made this gives:

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 3. 2. 0.]
[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  2. 18.  4.  0.]
[1. 2. 6. 2.]

3 replies

Proteios1998 Sep 15, 2022
Author

I think this is similar to my later method. I think it should be the most efficient way if we don't have a way to store the tree ID for each edge. I am not sure how many trees don't have any mutations. If the number of such trees is small, then I don't think we need to add another layer to store the tree ID for edges or mutations. But if the number is large, by doing it, I think it could save the running time a lot since we don't have to iterate all trees in the tree sequence (remove the first for loop).

jeromekelleher Sep 15, 2022
Maintainer

I don't think there's a more efficient way than this tbh - left-to-right iteration over the trees using ts.trees() is very fast, much faster than seeking to specific trees.

Proteios1998 Sep 15, 2022
Author

I see! Thank you very much!

Obtain a list of trees based on edge ID or node ID #2507

Uh oh!

Proteios1998 Sep 14, 2022

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

jeromekelleher Sep 15, 2022 Maintainer

Uh oh!

Uh oh!

Proteios1998 Sep 15, 2022 Author

Uh oh!

Uh oh!

jeromekelleher Sep 15, 2022 Maintainer

Uh oh!

Proteios1998 Sep 15, 2022 Author

Uh oh!

jeromekelleher Sep 15, 2022 Maintainer

Uh oh!

Proteios1998 Sep 15, 2022 Author

Proteios1998
Sep 14, 2022

Replies: 2 comments 4 replies

jeromekelleher
Sep 15, 2022
Maintainer

Proteios1998 Sep 15, 2022
Author

jeromekelleher
Sep 15, 2022
Maintainer

Proteios1998 Sep 15, 2022
Author

jeromekelleher Sep 15, 2022
Maintainer

Proteios1998 Sep 15, 2022
Author