Efficient graph APIs #2869
Replies: 20 comments 28 replies
-
Here's an ARG-type algorithm that fits well: def descendant_span(ts, arg, u):
"""
Return an array giving the total sequence lenght over which
each node in the tree sequence descends from the specified node.
"""
total_descending = np.zeros(ts.num_nodes)
stack = [(u, 0, ts.sequence_length)]
total_descending[u] = ts.sequence_length # NOTE questionable quick hack
while len(stack) > 0:
u, left, right = stack.pop()
# NOTE: if we had an index here sorted by left coord
# we could binary search to first match, and could
# break once e_right > left (I think?)
for e in range(*arg.child_range[u]):
e_left = ts.edges_left[e]
e_right = ts.edges_right[e]
if e_right > left and right > e_left:
inter_left = max(e_left, left)
inter_right = min(e_right, right)
e_child = ts.edges_child[e]
total_descending[e_child] += inter_right - inter_left
stack.append((e_child, inter_left, inter_right))
return total_descending
def descendant_span_tree(ts, u):
total_descending = np.zeros(ts.num_nodes)
for tree in ts.trees():
descendants = tree.preorder(u)
total_descending[descendants] += tree.span
return total_descending
for u in range(ts.num_nodes):
d1 = descendant_span(ts, arg, u)
d2 = descendant_span_tree(ts, u)
np.testing.assert_array_equal(d1, d2) I think this is basically what you were looking for at some point @hyanwong? This would be substantially faster than using set-ops, I expect (you just filter by descendant_span > 0 to find nodes that descend somewhere) I think the ARG algorithm could be made quicker with an index and some more logic, but it would probably be good to run on something larger before making any solid judgements. |
Beta Was this translation helpful? Give feedback.
-
The main tsdate algorithm just requires running through the all the edges for a parent, ordering parents by time (easy), or all the edges for a child (ordering children by time, then by parent id: requires a sort like the one here). To traverse the DAG efficiently, visiting children before parents (or vice versa), might we require the new tables to be ordered by node time? This structure might make it slightly easier, but I don't think there's much in it? We would require the secondary sort key to be the node ID, not the left coordinate, though. |
Beta Was this translation helpful? Give feedback.
-
Indexing into the In tsdate we rely, roughly, on the fact that edges are sorted first by parent (time), then by child ID, then by left coord. I think that the list of sequential parent edges for a child should therefore be sorted by parent ID, and only after by left coord? That we we can do a You could imagine wrapping this in something so that you could easily traverse through the parent-child (or child-parent) links without caring about the intervals. |
Beta Was this translation helpful? Give feedback.
-
For the list of descendants of a node u, you could imagine wanting either the list of samples that have inherited anything from u (which is what you have), or the list of descendants of u regardless of whether they have inherited anything from it. This latter list is actually easier to construct, as you don't have any intervals to keep track of. Another interesting (related) one is, for a set of samples s , to find the most recent common ancestor anywhere in the genome, or the most recent common ancestor in the ARG regardless of genomic inheritance. |
Beta Was this translation helpful? Give feedback.
-
Here's some corresponding operations for ancestors: def ancestor_span(ts, arg, u):
"""
Return an array giving the total sequence lenght over which
each node in the tree sequence is an ancestor to the specified node.
"""
total_ancestral = np.zeros(ts.num_nodes)
stack = [(u, 0, ts.sequence_length)]
total_ancestral[u] = ts.sequence_length # NOTE questionable quick hack
while len(stack) > 0:
u, left, right = stack.pop()
# NOTE: if we had an index here sorted by left coord
# we could binary search to first match, and could
# break once e_right > left (I think?)
for j in range(*arg.parent_range[u]):
e = arg.parent_index[j]
e_left = ts.edges_left[e]
e_right = ts.edges_right[e]
if e_right > left and right > e_left:
inter_left = max(e_left, left)
inter_right = min(e_right, right)
e_parent = ts.edges_parent[e]
total_ancestral[e_parent] += inter_right - inter_left
stack.append((e_parent, inter_left, inter_right))
return total_ancestral
def ancestor_span_tree(ts, u):
total_ancestral = np.zeros(ts.num_nodes)
for tree in ts.trees():
v = u
while v != -1:
total_ancestral[v] += tree.span
v = tree.parent(v)
return total_ancestral
def graph_ancestors(ts, arg, u):
"""
Return boolean array marking whether a node is a graph ancestor
of the specified node u. Note that this does not require that the
node inherited any genetic material.
"""
is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
is_ancestor[u] = True
stack = [u]
while len(stack) > 0:
u = stack.pop()
for j in range(*arg.parent_range[u]):
e = arg.parent_index[j]
e_parent = ts.edges_parent[e]
if not is_ancestor[e_parent]:
# Note: setting is_ancestor here because we can
# push the same node on the stack multiple times otherwise
is_ancestor[e_parent] = True
stack.append(e_parent)
return is_ancestor
for u in range(ts.num_nodes):
d1 = ancestor_span(ts, arg, u)
d2 = ancestor_span_tree(ts, u)
np.testing.assert_array_equal(d1, d2) Note these are basically symmetric with the |
Beta Was this translation helpful? Give feedback.
-
We can use the def graph_mrca(ts, arg, u, v):
u_ancestors = graph_ancestors(ts, arg, u)
v_ancestors = graph_ancestors(ts, arg, v)
common = np.logical_and(u_ancestors, v_ancestors)
min_index = np.argmin(ts.nodes_time[common])
return np.where(common)[0][min_index] You could implement it a bit better, but the time difference wouldn't be that much. The "real" MRCA is much harder because you have to propagate the combined ancestral material through the graph for u and v. What you're doing, essentially, is simplifying the ARG wrt to the samples u and v, so implementing in terms of simplify would be a reasonable thing to do (if not optimal). |
Beta Was this translation helpful? Give feedback.
-
I assume most of these algorithms would work fine if we had a "non-sample-resolved" ARG, e.g. from an unsimplified forward simulation? We'd have lots of "hanging topology", but the interval tracking stuff should still work OK. |
Beta Was this translation helpful? Give feedback.
-
So, the interesting algorithmic question that's emerging here is whether sorting parent/child intervals by left coordinate for would actually make the fundamental Are there graphs in which we have enough intervals per parent/child pair to make sorting worthwhile? It's not clear to me. |
Beta Was this translation helpful? Give feedback.
-
Here's a version in which we capture the sub-ARG descending from a given individual as an edge table: def descendant_intervals(ts, arg, u):
descending = tskit.EdgeTable()
stack = [(u, 0, ts.sequence_length)]
while len(stack) > 0:
u, left, right = stack.pop()
for e in range(*arg.child_range[u]):
e_left = ts.edges_left[e]
e_right = ts.edges_right[e]
if e_right > left and right > e_left:
inter_left = max(e_left, left)
inter_right = min(e_right, right)
e_child = ts.edges_child[e]
descending.add_row(inter_left, inter_right, u, e_child)
stack.append((e_child, inter_left, inter_right))
return descending This is basically the same as |
Beta Was this translation helpful? Give feedback.
-
The main use for these ARG algorithms are when we want to perform node-focussed analysis (e.g. we have a big ARG, and want to know specific things about a node or set of nodes and their ancestors/descendants). We discussed a few interesting use cases:
|
Beta Was this translation helpful? Give feedback.
-
I just thought of another case where vertical traversal might make more sense that left-right traversing. Imagine a large number of |
Beta Was this translation helpful? Give feedback.
-
This is very nice! Tell me if this is right - the point is essentially providing efficient ways to look up "who are the parents" and "who are the children" of a given node(and on which intervals)? Thus making bottom-up or top-down iteration easier? And so it's 'just' indexing into the same data structure, but by saying "ARG API" you're saying "ARG" (as opposed to "tree sequence") to connote thinking in the up/down direction, rather than left/right? I like it, although I worry a bit that the ARG/tree sequence terminology doesn't overlap terribly well with our definitions of those? Another set of operations are IBD-related ones. |
Beta Was this translation helpful? Give feedback.
-
Here's another useful node-focussed algorithm: give 2 samples, which part(s) of the genome have the most recent MRCA. I.e. which pieces of the genome are the closest, and what ancestral node(s) do those correspond to? |
Beta Was this translation helpful? Give feedback.
-
I'm looking at shared recombinations at the moment, and, for a specific node which has multiple ARG parents, where the parent changes from A to B at a specific breakpoint x, I want to find the "MRCA" shared between parent A in the left tree and parent B in the right tree. I also needed to do exactly this for the sc2ts tree, to find the "age" of the recombinant parents. So I suspect this might be a commonly-used ARG operation? Basically: how distantly related are two parents at a breakpoint position? At the moment I have to jump to the tree at the breakpoint, find the chain of parents from B, then switch to the previous tree and find the other chain from A. There's probably a neater way to do this on a per-node basis using the ARG structures than by building the entire tree every time, right? |
Beta Was this translation helpful? Give feedback.
-
Hey folks, just getting to this now, it's been a hectic few weeks here.
Will weigh in once I've parsed all you've already done here!
…On Wed, Nov 15, 2023, 5:49 AM Jerome Kelleher ***@***.***> wrote:
I doubt either of those is much quicker than just looking at the tree on
the left and right of the breakpoint, though.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/tskit-dev/tskit/discussions/2869*discussioncomment-7577258__;Iw!!K-Hz7m0Vt54!hpem6iY0PcerQ7cRAYZnZ6wARUKq0BMvFI2cSrPJwfd8jqOiul_Yf5uKTc0CnAl0f4XWy50nu6QSyePGcJ9GmFn8$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEHOXQTK4QZABDWIT6GX4H3YETCADAVCNFSM6AAAAAA7GDTAA6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TKNZXGI2TQ__;!!K-Hz7m0Vt54!hpem6iY0PcerQ7cRAYZnZ6wARUKq0BMvFI2cSrPJwfd8jqOiul_Yf5uKTc0CnAl0f4XWy50nu6QSyePGcJx3k62Y$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I just noticed that in the original definition of I am coming round to the idea that the next sort key after this should be the left position, as this allows more efficient processing of intervals. We should probably sort by both right coordinate and parent ID after than, so that we enforce a canonical ordering? |
Beta Was this translation helpful? Give feedback.
-
Here's another nice algorithm to count the number of samples reachable from each node in the graph: #2882 (comment) |
Beta Was this translation helpful? Give feedback.
-
Another easy thing to write using parent indexes is a |
Beta Was this translation helpful? Give feedback.
-
Just pinging @kitchensjn here, as he has been thinking about ARG traversals and suchlike, and might have some useful comments. I think we intend to implement something like this for tskit 1.0, right? So finding users to test this functionality would be good. I wonder if we could implement a non-documented API first, so that users like James could try it out in beta versions of software like the tskit_arg_visualizer (not that that particular one usually needs to be efficient, I think, unless it's doing something funky with the sc2ts ARG?) |
Beta Was this translation helpful? Give feedback.
-
Thanks for pointing me to this @hyanwong! Happy to help test things out. For sparg, we often thought about the different paths of inheritance that a sample had within an ARG. You could do this across all of the trees and then remove the duplicated paths, but I find it easier to work with the graph in this case. Something like... def unique_ancestral_lineages(ts, arg, u):
"""
Returns list of ancestral lineages above a specified node. Note that this
does not require that the node inherited any genetic material.
"""
parent_range = range(*arg.parent_range[u])
if parent_range == range(-1,-1):
return [np.array([u])]
lineages = []
previous = []
for j in parent_range:
e = arg.parent_index[j]
e_parent = ts.edges_parent[e]
if e_parent not in previous: #ignore duplicate child-parent edges
previous.append(e_parent)
for line in unique_ancestral_lineages(ts, arg, e_parent):
lineages.append(np.append(np.array([u]), line))
return lineages Similar to Do you think that there will/should be constraints for the recombination node format used in the graph API? I'd imagine it would be difficult to enforce given the attachment to the edge table, but I've found the 2-RE versus 1-RE formatting differences even more critical when working with the graph object. Yan recently pointed out an issue in one of the visualizer's conversion functions that relates to this ambiguity in formatting. And for a more relevant example here... say I wanted to calculate the shared time between lineages |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We don't have much support at the moment for ways to view the whole tree sequence as a graph - everything is very much focused on the "sequence of trees" view. We should add some graph-level functionality. One way we can do this is to add an
ARG
class (and correspondingtsk_arg_t
C struct) which is analogous to the Tree class: a view on the underlying data model with some additional data structures to facilitate computations.Before getting into details, it's important to set out some requirements for the low-level implementation. We can certainly build higher-level Python stuff on top of this (like export to networkx, etc), but we have to do it in a way that's efficient at the low-level first.
The first two requirements rule out quite a lot of possibilities when representing a graph structure. An adjacency matrix would work, but would need a lot of memory and this wouldn't solve the problem of how to annotate the edges with inheritance intervals.
After some thought, I think something like this is as good as we're going to do:
(The eagle-eyed will notice that the
lexsort
in here violates property 3, but we can get back to that later).The idea is that we use the existing edge table to represent the inheritance intervals. Ultimately, we have to attach O(num_edges) information to the graph, so we may as well use what we already have. Each node is associated with a set of indexes into the edge table, those which define either the parents or children of that node. Because the edge table is already sorted by
parent
ID (through the sortedness requirements) we can represent all the child edges of a node by two numbers: thestart
andstop
values for the range of indexes.To do the same for finding the parents of a node, we have to build an additional index into the edge table (
parent_index
above).Let's see an example of this working:
gives
So far so good. The most basic application I can think of for an ARG data structure is to recover the local tree at a given position, so:
Note that here we've make the left coordinate secondary sorting criterion in the
parent_index
which makes finding the edge for a given position potentially binary searchable.Additional indexes
The only real problem with this approach I think is that it requires us to do an expensive sort on the edge table. I don't see any way around this, if we want efficient access to the parent edges for a given node. In principle, we can build this extra index in the build_index function, and store on disk. We deliberately left the indexing open-ended to facilitate adding additional indexes later. Questions about whether to build this index by default etc would need to answered.
Secondary sort key?
Assuming we are building an extra index for ARG traversal, we could need to decide what the secondary sort key is. Should we group the parent edges for a given node by ID or left coordinate? Left coordinate seems more useful to me (see above) but I guess it would be useful to keep all the edges for a given parent node adjacent when doing a graph traversal?
If we decide to sort by left coordinate for the required new
parent_index
, we would have to consider also computing and storing the equivalentchild_index
, which has the left coordinate as the secondary sort key, rather thanchild
node. This is in principle much cheaper though, as the edge table is already sorted by parent ID and so you would just need to do lots of much smaller sorts within the edges for a given parent.Applications
There's no point in building these APIs if we don't have some applications. The only things I can think of (besides tree construction, which may potentially be faster than what we have) are:
@hyanwong @nspope - you're in the weeds with tsdate right now, would the structures above make your lives any easier?
Beta Was this translation helpful? Give feedback.
All reactions