-
Notifications
You must be signed in to change notification settings - Fork 77
Description
An operation that came up as being a useful feature in the recent PopSim meeting is to be able to count the number of extant lineages in each tree at a given time. This is equivalent to (a) deleting all edges older than t and truncating all edges intersecting with it (this would require the planned time
argument to delete_interval
in #382); and (b) counting the number of roots in each tree. We could define the operation by just doing this, but it seems a bit inefficient, especially if we do this for lots of different times.
Here's a way we can do it more efficiently:
def count_roots(ts, t):
time = ts.tables.nodes.time
num_children = np.zeros(ts.num_nodes, dtype=int)
num_roots = ts.num_samples
ret = np.zeros(ts.num_trees, dtype=int)
j = 0
for _, edges_out, edges_in in ts.edge_diffs():
for edge in edges_out:
if time[edge.parent] <= t:
num_children[edge.parent] -= 1
if num_children[edge.parent] == 1:
num_roots += 1
for edge in edges_in:
if time[edge.parent] <= t:
num_children[edge.parent] += 1
if num_children[edge.parent] == 2:
num_roots -= 1
ret[j] = num_roots
j += 1
return ret
We can code this up pretty efficiently in C, and return the numpy array. Should we do it this way? Should we call the method TreeSequence.num_roots(time=None)
, which returns a numpy array like the above? (If time is None, it defaults to the oldest time in the ts)