Efficient graph APIs #2869

jeromekelleher · 2023-11-10T12:36:48Z

jeromekelleher
Nov 10, 2023
Maintainer

We don't have much support at the moment for ways to view the whole tree sequence as a graph - everything is very much focused on the "sequence of trees" view. We should add some graph-level functionality. One way we can do this is to add an ARG class (and corresponding tsk_arg_t C struct) which is analogous to the Tree class: a view on the underlying data model with some additional data structures to facilitate computations.

Before getting into details, it's important to set out some requirements for the low-level implementation. We can certainly build higher-level Python stuff on top of this (like export to networkx, etc), but we have to do it in a way that's efficient at the low-level first.

Data structures used must be shared between C and Python implementations
Python version must be "numba friendly" (this is key)
An ARG instance should be derivable from a TreeSequence in O(num edges) time.

The first two requirements rule out quite a lot of possibilities when representing a graph structure. An adjacency matrix would work, but would need a lot of memory and this wouldn't solve the problem of how to annotate the edges with inheritance intervals.

After some thought, I think something like this is as good as we're going to do:

import msprime
import numpy as np
import dataclasses


@dataclasses.dataclass
class ARG:
    parent_range: list
    child_range: list
    parent_index: list

    def __str__(self):
        s = "id\tchild\tparent\n"
        for j in range(len(self.parent_range)):
            s += f"{j}\t" f"{self.child_range[j]}\t" f"{self.parent_range[j]}\t" "\n"
        return s


def make_arg(ts):
    arg = ARG(
        child_range=np.zeros((ts.num_nodes, 2), dtype=np.int32) - 1,
        parent_range=np.zeros((ts.num_nodes, 2), dtype=np.int32) - 1,
        parent_index=None,
    )
    last_parent = -1
    for edge in ts.edges():
        if edge.parent != last_parent:
            arg.child_range[edge.parent, 0] = edge.id
            if last_parent != -1:
                arg.child_range[last_parent, 1] = edge.id
        last_parent = edge.parent
    if last_parent != -1:
        arg.child_range[last_parent, 1] = ts.num_edges

    # Group together all parent edges for a given child.
    # Here we're sorting by left-coordinate first, so that we can find the
    # parent of a node at a give position quickly.
    arg.parent_index = np.lexsort((ts.edges_parent, ts.edges_left, ts.edges_child))

    last_child = -1
    for j, e in enumerate(arg.parent_index):
        edge = ts.edge(e)
        if edge.child != last_child:
            arg.parent_range[edge.child, 0] = j
            if last_child != -1:
                arg.parent_range[last_child, 1] = j
        last_child = edge.child

    if last_child != -1:
        arg.parent_range[last_child, 1] = ts.num_edges
    return arg

(The eagle-eyed will notice that the lexsort in here violates property 3, but we can get back to that later).

The idea is that we use the existing edge table to represent the inheritance intervals. Ultimately, we have to attach O(num_edges) information to the graph, so we may as well use what we already have. Each node is associated with a set of indexes into the edge table, those which define either the parents or children of that node. Because the edge table is already sorted by parent ID (through the sortedness requirements) we can represent all the child edges of a node by two numbers: the start and stop values for the range of indexes.

To do the same for finding the parents of a node, we have to build an additional index into the edge table (parent_index above).

Let's see an example of this working:

ts = msprime.sim_ancestry(
    4,
    ploidy=1,
    recombination_rate=0.1,
    sequence_length=10,
    random_seed=1234,
    record_full_arg=True,
)

print(ts.draw_text())
print(ts.tables.edges)

arg = make_arg(ts)

print("parent index =\n", arg.parent_index)
print(arg)
for u in range(ts.num_nodes):
    for e in range(*arg.child_range[u]):
        assert ts.edges_parent[e] == u
    for j in range(*arg.parent_range[u]):
        e = arg.parent_index[j]
        assert ts.edges_child[e] == u

gives

2.02┊         ┊         ┊    18   ┊         ┊  
    ┊         ┊         ┊   ┏━┻━┓ ┊         ┊  
1.67┊    17   ┊         ┊   ┃  17 ┊    17   ┊  
    ┊   ┏━┻━┓ ┊         ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  
1.48┊  16   ┃ ┊         ┊   ┃   ┃ ┊  16   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.46┊   ┃   ┃ ┊         ┊  14   ┃ ┊  15   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.15┊  12   ┃ ┊         ┊  13   ┃ ┊  13   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
0.77┊   ┃  11 ┊         ┊   ┃  11 ┊   ┃  11 ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
0.72┊  10   ┃ ┊    10   ┊  10   ┃ ┊  10   ┃ ┊  
    ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊  
0.44┊   ┃   8 ┊   ┃   9 ┊  ┃  9 ┃ ┊  ┃  9 ┃ ┊  
    ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┃  ┃ ┃ ┊  ┃  ┃ ┃ ┊  
0.37┊   7   ┃ ┊   7   ┃ ┊  7  ┃ ┃ ┊  7  ┃ ┃ ┊  
    ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊  
0.12┊  6  ┃ ┃ ┊  6  ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊  
0.06┊ ┃ 4 ┃ ┃ ┊ ┃ 4 ┃ ┃ ┊ ┃ ┃ ┃ 5 ┊ ┃ ┃ ┃ 5 ┊ 
    ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊                                                                                                                                                 
0.00┊ 0 1 2 3 ┊ 0 1 2 3 ┊ 0 2 3 1 ┊ 0 2 3 1 ┊ 
    0         1         5         7        10 

╔══╤════╤═════╤══════╤═════╤════════╗
║id│left│right│parent│child│metadata║
╠══╪════╪═════╪══════╪═════╪════════╣
║0 │   0│    5│     4│    1│        ║
║1 │   5│   10│     5│    1│        ║
║2 │   0│   10│     6│    0│        ║
║3 │   0│    5│     6│    4│        ║
║4 │   0│   10│     7│    2│        ║
║5 │   0│   10│     7│    6│        ║
║6 │   0│    1│     8│    3│        ║
║7 │   1│   10│     9│    3│        ║
║8 │   0│   10│    10│    7│        ║
║9 │   1│   10│    10│    9│        ║
║10│   5│   10│    11│    5│        ║
║11│   0│    1│    11│    8│        ║
║12│   0│    1│    12│   10│        ║
║13│   5│   10│    13│   10│        ║
║14│   5│    7│    14│   13│        ║
║15│   7│   10│    15│   13│        ║
║16│   0│    1│    16│   12│        ║
║17│   7│   10│    16│   15│        ║
║18│   0│    1│    17│   11│        ║
║19│   5│   10│    17│   11│        ║
║20│   0│    1│    17│   16│        ║
║21│   7│   10│    17│   16│        ║
║22│   5│    7│    18│   14│        ║
║23│   5│    7│    18│   17│        ║
╚══╧════╧═════╧══════╧═════╧════════╝

parent index =                                 
 [ 2  0  1  4  6  7  3 10  5  8 11  9 12 13 18 19 16 14 15 22 17 20 21 23]                     
id      child   parent                         
0       [-1 -1] [0 1]                          
1       [-1 -1] [1 3]                          
2       [-1 -1] [3 4]                          
3       [-1 -1] [4 6]                          
4       [0 1]   [6 7]                          
5       [1 2]   [7 8]                          
6       [2 4]   [8 9]                          
7       [4 6]   [ 9 10]                        
8       [6 7]   [10 11]                        
9       [7 8]   [11 12]                        
10      [ 8 10] [12 14]                        
11      [10 12] [14 16]                        
12      [12 13] [16 17]                        
13      [13 14] [17 19]                        
14      [14 15] [19 20]                        
15      [15 16] [20 21]                        
16      [16 18] [21 23]                        
17      [18 22] [23 24]                        
18      [22 24] [-1 -1]

So far so good. The most basic application I can think of for an ARG data structure is to recover the local tree at a given position, so:

def build_tree(ts, arg, x):
    """
    Returns the tree ancestral to samples for position x.
    """
    pi = np.zeros(ts.num_nodes, np.int32) - 1
    for u in ts.samples():
        while u != -1 and pi[u] == -1:
            v = -1
            # NOTE: these are sorted by left coordinate, so we could binary search on x
            for j in range(*arg.parent_range[u]):
                e = arg.parent_index[j]
                if ts.edges_left[e] <= x < ts.edges_right[e]:
                    v = ts.edges_parent[e]
                    pi[u] = v
                    break
            u = v
    return pi

for x in ts.breakpoints(as_array=True)[:-1]:
    pi = build_tree(ts, arg, x)
    np.testing.assert_array_equal(pi, ts.at(x).parent_array[:-1])

Note that here we've make the left coordinate secondary sorting criterion in the parent_index which makes finding the edge for a given position potentially binary searchable.

Additional indexes

The only real problem with this approach I think is that it requires us to do an expensive sort on the edge table. I don't see any way around this, if we want efficient access to the parent edges for a given node. In principle, we can build this extra index in the build_index function, and store on disk. We deliberately left the indexing open-ended to facilitate adding additional indexes later. Questions about whether to build this index by default etc would need to answered.

Secondary sort key?

Assuming we are building an extra index for ARG traversal, we could need to decide what the secondary sort key is. Should we group the parent edges for a given node by ID or left coordinate? Left coordinate seems more useful to me (see above) but I guess it would be useful to keep all the edges for a given parent node adjacent when doing a graph traversal?

If we decide to sort by left coordinate for the required new parent_index, we would have to consider also computing and storing the equivalent child_index, which has the left coordinate as the secondary sort key, rather than child node. This is in principle much cheaper though, as the edge table is already sorted by parent ID and so you would just need to do lots of much smaller sorts within the edges for a given parent.

Applications

There's no point in building these APIs if we don't have some applications. The only things I can think of (besides tree construction, which may potentially be faster than what we have) are:

Make simplify slightly simpler
tsdate?

@hyanwong @nspope - you're in the weeds with tsdate right now, would the structures above make your lives any easier?

jeromekelleher · 2023-11-10T15:05:42Z

jeromekelleher
Nov 10, 2023
Maintainer Author

Here's an ARG-type algorithm that fits well:

def descendant_span(ts, arg, u):
    """
    Return an array giving the total sequence lenght over which
    each node in the tree sequence descends from the specified node.
    """
    total_descending = np.zeros(ts.num_nodes)
    stack = [(u, 0, ts.sequence_length)]
    total_descending[u] = ts.sequence_length  # NOTE questionable quick hack
    while len(stack) > 0:
        u, left, right = stack.pop()
        # NOTE: if we had an index here sorted by left coord
        # we could binary search to first match, and could
        # break once e_right > left (I think?)
        for e in range(*arg.child_range[u]):
            e_left = ts.edges_left[e]
            e_right = ts.edges_right[e]
            if e_right > left and right > e_left:
                inter_left = max(e_left, left)
                inter_right = min(e_right, right)
                e_child = ts.edges_child[e]
                total_descending[e_child] += inter_right - inter_left
                stack.append((e_child, inter_left, inter_right))
    return total_descending


def descendant_span_tree(ts, u):
    total_descending = np.zeros(ts.num_nodes)
    for tree in ts.trees():
        descendants = tree.preorder(u)
        total_descending[descendants] += tree.span
    return total_descending

for u in range(ts.num_nodes):
    d1 = descendant_span(ts, arg, u)
    d2 = descendant_span_tree(ts, u)
    np.testing.assert_array_equal(d1, d2)

I think this is basically what you were looking for at some point @hyanwong? This would be substantially faster than using set-ops, I expect (you just filter by descendant_span > 0 to find nodes that descend somewhere)

I think the ARG algorithm could be made quicker with an index and some more logic, but it would probably be good to run on something larger before making any solid judgements.

3 replies

jeromekelleher Nov 10, 2023
Maintainer Author

Just tried a quick numba'd version of this:

@numba.njit
def _descendant_span(u, num_nodes, sequence_length, arg_child_range,
                    edges_left, edges_right, edges_child):
    total_descending = np.zeros(num_nodes)
    stack = [(u, 0, sequence_length)]
    while len(stack) > 0:
        u, left, right = stack.pop()
        for e in range(arg_child_range[u, 0], arg_child_range[u, 1]):
            e_left = edges_left[e]
            e_right = edges_right[e]
            if e_right > left and right > e_left:
                inter_left = max(e_left, left)
                inter_right = min(e_right, right)
                e_child = edges_child[e]
                total_descending[e_child] += inter_right - inter_left
                stack.append((e_child, inter_left, inter_right))
    return total_descending

def descendant_span_numba(ts, arg, u):
    """
    Return an array giving the total sequence lenght over which
    each node in the tree sequence descends from the specified node.
    """
    return _descendant_span(u, ts.num_nodes, ts.sequence_length,
                            arg.child_range, ts.edges_left, ts.edges_right, 
                            ts.edges_child)

On a SARS-CoV-2 ARG (780k nodes) the Python version takes 2.6 seconds to run for the "root" node (should be worst case scenario) and the numba version takes 33ms (after warming the jit).

The (fairly fast) tree-by-tree method above takes 14 seconds.

hyanwong Nov 10, 2023
Maintainer

I don't really understand how this works without maintaining an interval library, or assuming some sort of nestedness within the inherited intervals, but I think I'm missing something. Maybe you could explain it to me on the board?

jeromekelleher Nov 10, 2023
Maintainer Author

No need for an interval library, we're doing it ourselves (there's tons of interval stuff in, e.g., simplify). We're just computing interval intersections one-by-one here.

hyanwong · 2023-11-10T15:33:13Z

hyanwong
Nov 10, 2023
Maintainer

@hyanwong @nspope - you're in the weeds with tsdate right now, would the structures above make your lives any easier?

The main tsdate algorithm just requires running through the all the edges for a parent, ordering parents by time (easy), or all the edges for a child (ordering children by time, then by parent id: requires a sort like the one here). To traverse the DAG efficiently, visiting children before parents (or vice versa), might we require the new tables to be ordered by node time?

This structure might make it slightly easier, but I don't think there's much in it? We would require the secondary sort key to be the node ID, not the left coordinate, though.

0 replies

hyanwong · 2023-11-10T15:57:49Z

hyanwong
Nov 10, 2023
Maintainer

Indexing into the tskit edges is clever. It feels to me like you really want to be indexing into the graph edges (each of which might have several intervals), but we don't have such a table structure in tskit. I guess it's fine indexing into the "edge-intervals" table, but you might need to uniquify by e.g. parent ID if you want to find the IDs of all the parent nodes of a focal node u.

In tsdate we rely, roughly, on the fact that edges are sorted first by parent (time), then by child ID, then by left coord. I think that the list of sequential parent edges for a child should therefore be sorted by parent ID, and only after by left coord? That we we can do a for parent, edges in itertools.groupby(edges_for_this_child). This is essentially doing the uniquifying-by-parent that I mentioned above.

You could imagine wrapping this in something so that you could easily traverse through the parent-child (or child-parent) links without caring about the intervals.

2 replies

jeromekelleher Nov 10, 2023
Maintainer Author

Indexing into the tskit edges is clever. It feels to me like you really want to be indexing into the graph edges (each of which might have several intervals), but we don't have such a table structure in tskit. I guess it's fine indexing into the "edge-intervals" table, but you might need to uniquify by e.g. parent ID if you want to find the IDs of all the parent nodes of a focal node u.

I thought hard about this, and I don't see a way of separating the graph edges from the inheritance intervals while still maintaining an efficient, numba-able structure.

hyanwong Nov 10, 2023
Maintainer

I'm glad you thought about it: I can well believe it would kill efficiency. If the edges within the parent_range are sorted by child id first (i.e. child ID is your secondary sort key), then at least you don't need to sort the edge-intervals to get the list of child IDs: you just skip until the child ID changes (equivalent to wrapping in itertools.groupby)

We could probably make a wrapper function that does this skipping and which is numba-able?

hyanwong · 2023-11-10T16:48:12Z

hyanwong
Nov 10, 2023
Maintainer

For the list of descendants of a node u, you could imagine wanting either the list of samples that have inherited anything from u (which is what you have), or the list of descendants of u regardless of whether they have inherited anything from it. This latter list is actually easier to construct, as you don't have any intervals to keep track of.

Another interesting (related) one is, for a set of samples s , to find the most recent common ancestor anywhere in the genome, or the most recent common ancestor in the ARG regardless of genomic inheritance.

4 replies

jeromekelleher Nov 10, 2023
Maintainer Author

For the list of descendants of a node u, you could imagine wanting either the list of samples that have inherited anything from u (which is what you have), or the list of descendants of u regardless of whether they have inherited anything from it. This latter list is actually easier to construct, as you don't have any intervals to keep track of.

Here's the graph descendants (although I wonder what it would actually be useful for?)

def graph_descendants(ts, arg, u):
    """
    Return boolean array marking whether a node is a graph descendant
    of the specified node u. Note that this does not require that the
    node inherited any genetic material.
    """
    is_descendant = np.zeros(ts.num_nodes, dtype=bool)
    stack = [u]
    while len(stack) > 0:
        u = stack.pop()
        for e in range(*arg.child_range[u]):
            e_child = ts.edges_child[e]
            if not is_descendant[e_child]:
                # Note: setting is_descendant here because we can
                # push the same node on the stack multiple times otherwise
                is_descendant[e_child] = True
                stack.append(e_child)
    return is_descendant

hyanwong Nov 10, 2023
Maintainer

Yes, thanks, that's a nice simple one. Re usage, we might want to find approximations to pedigree relations from an ARG, without caring about genetic inheritance, right? Like we might want to know the biparental MRCA from a non-sample-resolved forward simulation.

petrelharp Nov 14, 2023
Maintainer

FYI, another way of getting the graph descendants is to make the adjacency matrix (as a sparse matrix) and then do matrix multiplication: if A[i,j] is equal to the number of edges from i (as parent) to j (as child) then the k-th matrix power of A gives the number of length-k paths from i to j; to get the the descendants of a particular node u you take a row vector with 1 in the uth slots and right-multiply by A a bunch. No idea if this is faster, but it's nice to lean on sparse matrix libraries.

jeromekelleher Nov 14, 2023
Maintainer Author

Unlikely to be faster I would imagine, but it's a great way to test!

jeromekelleher · 2023-11-13T11:37:46Z

jeromekelleher
Nov 13, 2023
Maintainer Author

Here's some corresponding operations for ancestors:

def ancestor_span(ts, arg, u):
    """
    Return an array giving the total sequence lenght over which
    each node in the tree sequence is an ancestor to the specified node.
    """
    total_ancestral = np.zeros(ts.num_nodes)
    stack = [(u, 0, ts.sequence_length)]
    total_ancestral[u] = ts.sequence_length  # NOTE questionable quick hack
    while len(stack) > 0:
        u, left, right = stack.pop()
        # NOTE: if we had an index here sorted by left coord
        # we could binary search to first match, and could
        # break once e_right > left (I think?)
        for j in range(*arg.parent_range[u]):
            e = arg.parent_index[j]
            e_left = ts.edges_left[e]
            e_right = ts.edges_right[e]
            if e_right > left and right > e_left:
                inter_left = max(e_left, left)
                inter_right = min(e_right, right)
                e_parent = ts.edges_parent[e]
                total_ancestral[e_parent] += inter_right - inter_left
                stack.append((e_parent, inter_left, inter_right))
    return total_ancestral


def ancestor_span_tree(ts, u):
    total_ancestral = np.zeros(ts.num_nodes)
    for tree in ts.trees():
        v = u
        while v != -1:
            total_ancestral[v] += tree.span
            v = tree.parent(v)
    return total_ancestral

def graph_ancestors(ts, arg, u):
    """
    Return boolean array marking whether a node is a graph ancestor
    of the specified node u. Note that this does not require that the
    node inherited any genetic material.
    """
    is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
    is_ancestor[u] = True
    stack = [u]
    while len(stack) > 0:
        u = stack.pop()
        for j in range(*arg.parent_range[u]):
            e = arg.parent_index[j]
            e_parent = ts.edges_parent[e]
            if not is_ancestor[e_parent]:
                # Note: setting is_ancestor here because we can
                # push the same node on the stack multiple times otherwise
                is_ancestor[e_parent] = True
                stack.append(e_parent)
    return is_ancestor

for u in range(ts.num_nodes):
    d1 = ancestor_span(ts, arg, u)
    d2 = ancestor_span_tree(ts, u)
    np.testing.assert_array_equal(d1, d2)

Note these are basically symmetric with the descendants version, except for the direction we travel through the graph.

0 replies

jeromekelleher · 2023-11-13T11:43:19Z

jeromekelleher
Nov 13, 2023
Maintainer Author

We can use the graph_ancestors to define the "graph MRCA" (the most recent common ancestor not necessarily sharing any ancestral material):

def graph_mrca(ts, arg, u, v):
    u_ancestors = graph_ancestors(ts, arg, u)
    v_ancestors = graph_ancestors(ts, arg, v)
    common = np.logical_and(u_ancestors, v_ancestors)
    min_index = np.argmin(ts.nodes_time[common])
    return np.where(common)[0][min_index]

You could implement it a bit better, but the time difference wouldn't be that much.

The "real" MRCA is much harder because you have to propagate the combined ancestral material through the graph for u and v. What you're doing, essentially, is simplifying the ARG wrt to the samples u and v, so implementing in terms of simplify would be a reasonable thing to do (if not optimal).

0 replies

hyanwong · 2023-11-13T11:46:24Z

hyanwong
Nov 13, 2023
Maintainer

I assume most of these algorithms would work fine if we had a "non-sample-resolved" ARG, e.g. from an unsimplified forward simulation? We'd have lots of "hanging topology", but the interval tracking stuff should still work OK.

2 replies

jeromekelleher Nov 13, 2023
Maintainer Author

These algorithms should be fully general, but I haven't thought about the specifics. There may be some corner cases lurking if you ran them on an unsimplified prospective ARG, but generally I'd expect them to work.

hyanwong Nov 13, 2023
Maintainer

Yep, that was my thought too. If we do implement anything, we should run tests on both simplified and unsimplified / prospective genealogies, to check we get the same answers.

jeromekelleher · 2023-11-13T11:57:34Z

jeromekelleher
Nov 13, 2023
Maintainer Author

So, the interesting algorithmic question that's emerging here is whether sorting parent/child intervals by left coordinate for would actually make the fundamental ancestor_span and descendant_span operations more efficient. It seems clear that we can implement "graph" level stuff efficiently with O(num nodes) space, which is fine when we're doing these O(num nodes) operations anyway.

Are there graphs in which we have enough intervals per parent/child pair to make sorting worthwhile? It's not clear to me.

5 replies

hyanwong Nov 13, 2023
Maintainer

I'm not sure how much efficiency we would gain. I think the main gain is in syntactic simplicity, so we need to think very hard about how we present the API to the user.

For instance, I see something like the following as canonical:

for node_obj in ts.arg.nodes(time="asc"):
   for arg_edge in node_obj.parent_arg_edges():
         parent = arg_edge.parent
         child = node_obj.id
         for interval in arg_edge.intervals():
             # Do something with the intervals

(and similarly with node_obj.child_arg_edges)

jeromekelleher Nov 13, 2023
Maintainer Author

That's a separate issue - we're not talking about high-level APIs yet: this is about how we support efficient ARG operations. Python loops are by definition not efficient.

hyanwong Nov 13, 2023
Maintainer

Right, but accessing the intervals for a given parent/child combination is not efficient if the indexes have the secondary sort key of left, I think?

jeromekelleher Nov 13, 2023
Maintainer Author

I don't understand - all of the above examples are doing this efficiently? It's just the ordering of the intervals within a parent-child combination that's in question.

hyanwong Jan 20, 2024
Maintainer

I thought that if the first sort key was parent and the secondary sort key was left, then to find all the edges with the same parent AND child, you would need to traverse all the edges for that parent? So the the intervals for a fixed parent/child combination are not adjacent?

jeromekelleher · 2023-11-13T15:04:25Z

jeromekelleher
Nov 13, 2023
Maintainer Author

Here's a version in which we capture the sub-ARG descending from a given individual as an edge table:

def descendant_intervals(ts, arg, u):
    descending = tskit.EdgeTable()
    stack = [(u, 0, ts.sequence_length)]
    while len(stack) > 0:
        u, left, right = stack.pop()
        for e in range(*arg.child_range[u]):
            e_left = ts.edges_left[e]
            e_right = ts.edges_right[e]
            if e_right > left and right > e_left:
                inter_left = max(e_left, left)
                inter_right = min(e_right, right)
                e_child = ts.edges_child[e]
                descending.add_row(inter_left, inter_right, u, e_child)
                stack.append((e_child, inter_left, inter_right))
    return descending

This is basically the same as descendant_span except we keep track of the actual intervals that are percolating down through the ARG instead of summing up their spans.

0 replies

hyanwong · 2023-11-13T15:33:41Z

hyanwong
Nov 13, 2023
Maintainer

The main use for these ARG algorithms are when we want to perform node-focussed analysis (e.g. we have a big ARG, and want to know specific things about a node or set of nodes and their ancestors/descendants). We discussed a few interesting use cases:

for the "intervals_descend_from" case
1. Identify a "focal node" or set of "focal nodes" and create the subgraph of all descendant nodes. E.g. Agricultural datasets: pick "bull X" 50 generations ago: which bits of genome in the current day descend from that bull (or Ghengis Khan, etc). Or pick a set (e.g. "all human ancestors identified as neanderthals") and look at all the pieces descended from them in the current day population. Similarly with SARS-CoV2: which pieces of a given variant are present in e.g. today's isolates.
2. The subgraph formed like this could potentially be quite fragmented: lots of trees will have isolated nodes. It is interesting to look at the variance in isolatedness. E.g. do most europeans share the same sections of neanderthal genome, or does every modern human have a different segment? How does this change depending on the coalescence times in each local tree
3. Are there interesting viz differences between the sub-ARGs e.g. for Neanderthals vs for non-neanderthal fragments?
Other algorithms
1. Running top-to-bottom algorithms looking for metrics as a function of time e.g. looking for boundaries of changes in effective population size, or migration rate over time.
Plotting
1. Can we easily isolate a subgraph of the sc2ts graph, e.g. all the ancestors and descendants of a focal node, for (say) 20 days before and after the node time) for plotting?

1 reply

jeromekelleher Nov 13, 2023
Maintainer Author

Can we easily isolate a subgraph of the sc2ts graph, e.g. all the ancestors and descendants of a focal node, for (say) 20 days before and after the node time) for plotting?

Interesting one. Easy way to do it (not checking actual syntax):

descending = descendant_intervals(ts, arg, u)
ancestral = ancestral_intervals(ts, arg, u)
tables = ts.tables.copy()
tables.edges = ancestral + descending # This doesn't work, but you get the idea
tables.delete_older(ts.nodes_time[u] + 20)
tables.delete_younger(ts.nodes_time[u] - 20) # method doesn't exist
tables.sort() # Needed?
subts = tables.tree_sequence()

hyanwong · 2023-11-13T15:34:07Z

hyanwong
Nov 13, 2023
Maintainer

I just thought of another case where vertical traversal might make more sense that left-right traversing. Imagine a large number of
simulated chromosomes (e.g. Daiki's approximation to the infinitessimal model with no recombination within each chromosome), but where we don't have the underlying pedigree. The left-right traversal will have a large number of edges changes between each tree. But there will be a lof of "swiching back" to the original parent. This is a case where there will be lots of intervals per parent/child combination, and it might make sense to consider e.g. all the children of a specific parent, or all the parents of a specific child, then iterate (or skip) through the intervals.

0 replies

petrelharp · 2023-11-14T05:48:13Z

petrelharp
Nov 14, 2023
Maintainer

This is very nice! Tell me if this is right - the point is essentially providing efficient ways to look up "who are the parents" and "who are the children" of a given node(and on which intervals)? Thus making bottom-up or top-down iteration easier? And so it's 'just' indexing into the same data structure, but by saying "ARG API" you're saying "ARG" (as opposed to "tree sequence") to connote thinking in the up/down direction, rather than left/right? I like it, although I worry a bit that the ARG/tree sequence terminology doesn't overlap terribly well with our definitions of those?

Another set of operations are IBD-related ones.

5 replies

jeromekelleher Nov 14, 2023
Maintainer Author

Yes, that's basically it I think. The current TreeSequence APIs are very good at doing tree-by-tree stuff, but not good at all at answering questions about particular nodes. Viewing the structure as a graph rather than a sequence of trees makes these time-order operations simpler and more efficient. @gtsambos has already done a bunch of pioneering work on bottom-to-top algorithms (including IBD), so she might have some thoughts?

The downside is that we would need to build some additional indexes to fully support the forward-time-order operations (see the initial post above). We might be able to make some backwards time operations a bit more efficient by sorting intervals within a parent node by left coordinate, however.

The nomeclature is really hard here. Perhaps it would be better to reserve the term "ARG" as the general, slightly vague thing, and refer to this specific structure here (the view on the edge table, plus indexes) as an "InheritanceGraph" or something? It's more descriptive and may be less confusing in the long run.

So, the ARG is the general thing, and a succinct tree sequence (nodes + edge table + other tables) is the specific encoding of an ARG. A TreeSequence is a view on the node and edge tables with two additional left-to-right sort-indexes on the edges which facilitates fast tree-by-tree operations. An InheritanceGraph is also a view on the edge table, with two additional forward-and-backwards time sorting-indexes into the edges.

This probably would be less confusing than "An ARG is the general thing, and succinct... A TreeSequence is a view... An ARG is another view ..."

We don't have to, and probably shouldn't try, sort out nomenclature here, I'm just trying to clarify here for the sake of understanding what the structure above is for and why we might want to do it.

petrelharp Nov 14, 2023
Maintainer

+1 for Inheritance Graph. Maybe even +2. Agreed we don't have to sort that out here - I just needed to get straight in my head what the big picture goal here was.

petrelharp Nov 14, 2023
Maintainer

Also - I am still impressed by how clever and simple this idea is.

jeromekelleher Nov 14, 2023
Maintainer Author

It's neat isn't it? I feel like simplify would be much simpler to think about using these tools.

petrelharp Nov 14, 2023
Maintainer

Totally. I guess that simplify is already reasonably natural because the ordering in the tables derives from the natural ordering of coalescent simulations, which is back-in-time. Pushing things forwards in time, on the other hand...

hyanwong · 2023-11-14T21:38:41Z

hyanwong
Nov 14, 2023
Maintainer

Here's another useful node-focussed algorithm: give 2 samples, which part(s) of the genome have the most recent MRCA. I.e. which pieces of the genome are the closest, and what ancestral node(s) do those correspond to?

3 replies

jeromekelleher Nov 15, 2023
Maintainer Author

That's the same question as the genetic mrca isn't it? With the addition of returning the actual intervals, which you would have to track anyway.

hyanwong Nov 15, 2023
Maintainer

Oh yes, sorry. I had lost track of the discussion on that point. The complications are (a) there could be multiple MRCAs at the same time and (b) I guess it would be useful to do a trick like the one in the normal MRCA function which doesn't require us to find all the ancestors but use the time ordering of parents to work out which interval we should look at next, to have a chance of finding the most recent common ancestor, without having to go all the way back to the root.

For a pair of nodes (especially if they are closely related), I think this would be more efficient than simplifying wrt the two nodes? Although I'm not sure of the internal details of simplify.

jeromekelleher Nov 15, 2023
Maintainer Author

Yes, I'm sure it could be implemented more efficiently, but it's basically the same algorithm

hyanwong · 2023-11-15T10:37:14Z

hyanwong
Nov 15, 2023
Maintainer

I'm looking at shared recombinations at the moment, and, for a specific node which has multiple ARG parents, where the parent changes from A to B at a specific breakpoint x, I want to find the "MRCA" shared between parent A in the left tree and parent B in the right tree. I also needed to do exactly this for the sc2ts tree, to find the "age" of the recombinant parents. So I suspect this might be a commonly-used ARG operation? Basically: how distantly related are two parents at a breakpoint position?

At the moment I have to jump to the tree at the breakpoint, find the chain of parents from B, then switch to the previous tree and find the other chain from A. There's probably a neater way to do this on a per-node basis using the ARG structures than by building the entire tree every time, right?

2 replies

jeromekelleher Nov 15, 2023
Maintainer Author

You could use the graph_mrca function above to get a most recent common ancestor: #2869 (comment)

A minor tweak would give you all graph most recent common ancestors, if multiple exist at the same time.

That's probably what you want?

If you want the genetic MRCAs, then you'd need to use genetic_mrca I guess.

jeromekelleher Nov 15, 2023
Maintainer Author

I doubt either of those is much quicker than just looking at the tree on the left and right of the breakpoint, though.

gtsambos · 2023-11-21T20:46:13Z

gtsambos
Nov 21, 2023
Collaborator

Hey folks, just getting to this now, it's been a hectic few weeks here. Will weigh in once I've parsed all you've already done here!

…

On Wed, Nov 15, 2023, 5:49 AM Jerome Kelleher ***@***.***> wrote: I doubt either of those is much quicker than just looking at the tree on the left and right of the breakpoint, though. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/tskit-dev/tskit/discussions/2869*discussioncomment-7577258__;Iw!!K-Hz7m0Vt54!hpem6iY0PcerQ7cRAYZnZ6wARUKq0BMvFI2cSrPJwfd8jqOiul_Yf5uKTc0CnAl0f4XWy50nu6QSyePGcJ9GmFn8$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEHOXQTK4QZABDWIT6GX4H3YETCADAVCNFSM6AAAAAA7GDTAA6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TKNZXGI2TQ__;!!K-Hz7m0Vt54!hpem6iY0PcerQ7cRAYZnZ6wARUKq0BMvFI2cSrPJwfd8jqOiul_Yf5uKTc0CnAl0f4XWy50nu6QSyePGcJx3k62Y$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

hyanwong · 2023-12-15T09:11:47Z

hyanwong
Dec 15, 2023
Maintainer

I just noticed that in the original definition of parent_index the edges are sorted by the ID of the child as primary key. We could use the standard edges convention that they are ordered first by time of child, and then by child ID? What are the pros/cons for this?

I am coming round to the idea that the next sort key after this should be the left position, as this allows more efficient processing of intervals. We should probably sort by both right coordinate and parent ID after than, so that we enforce a canonical ordering?

0 replies

jeromekelleher · 2023-12-22T10:05:23Z

jeromekelleher
Dec 22, 2023
Maintainer Author

Here's another nice algorithm to count the number of samples reachable from each node in the graph: #2882 (comment)

0 replies

hyanwong · 2024-01-20T20:19:51Z

hyanwong
Jan 20, 2024
Maintainer

Another easy thing to write using parent indexes is a sample_resolve algorithm, which prunes down edges by removing material that is non-ancestral to a set of samples (but which keeps all nodes, and also doesn't squash edges together). Based on Jerome's code above I wrote a more elaborate version for the generalised inheritance graph structure (which includes inversions etc) in https://github.com/hyanwong/GeneticInheritanceGraphLibrary/blob/4aa7a922ca873fc82759ae34b9f1070b3f2afd32/GeneticInheritanceGraphLibrary/graph.py#L211

0 replies

hyanwong · 2025-04-14T19:43:24Z

hyanwong
Apr 14, 2025
Maintainer

Just pinging @kitchensjn here, as he has been thinking about ARG traversals and suchlike, and might have some useful comments. I think we intend to implement something like this for tskit 1.0, right? So finding users to test this functionality would be good. I wonder if we could implement a non-documented API first, so that users like James could try it out in beta versions of software like the tskit_arg_visualizer (not that that particular one usually needs to be efficient, I think, unless it's doing something funky with the sc2ts ARG?)

0 replies

kitchensjn · 2025-04-17T01:33:46Z

kitchensjn
Apr 17, 2025

Thanks for pointing me to this @hyanwong! Happy to help test things out.

For sparg, we often thought about the different paths of inheritance that a sample had within an ARG. You could do this across all of the trees and then remove the duplicated paths, but I find it easier to work with the graph in this case. Something like...

def unique_ancestral_lineages(ts, arg, u):
    """
    Returns list of ancestral lineages above a specified node. Note that this
    does not require that the node inherited any genetic material.
    """
    parent_range = range(*arg.parent_range[u])
    if parent_range == range(-1,-1):
        return [np.array([u])]
    lineages = []
    previous = []
    for j in parent_range:
        e = arg.parent_index[j]
        e_parent = ts.edges_parent[e]
        if e_parent not in previous:    #ignore duplicate child-parent edges
            previous.append(e_parent)
            for line in unique_ancestral_lineages(ts, arg, e_parent):
                lineages.append(np.append(np.array([u]), line))
    return lineages

Similar to graph_descendants() from above, this function doesn't take into account whether the node inherited genetic materiel along that path, just that there is a possible connection. I've also ignored duplicated lineages caused by edges appearing multiple times in the edge table. Having the lists of lineages makes it pretty straight forward to calculate the amount of shared time between lineages in the ARG, even if those lineages are found in different trees. This is not the function sparg uses but is similar in concept. We instead calculated the shared time matrix at the same time as identifying the lineages, but that is just for efficiency reasons when working with so many paths.

Do you think that there will/should be constraints for the recombination node format used in the graph API? I'd imagine it would be difficult to enforce given the attachment to the edge table, but I've found the 2-RE versus 1-RE formatting differences even more critical when working with the graph object. Yan recently pointed out an issue in one of the visualizer's conversion functions that relates to this ambiguity in formatting. And for a more relevant example here... say I wanted to calculate the shared time between lineages [ 1 4 6 7 10 13 14 18] and [ 1 5 11 17 18] in the example ARG at the top of the discussion. Since this ARG uses the 2-RE format, we know that nodes 4 and 5 refer to the same recombination node, so we need to add the length of the edge from 1 to 4/5 to the shared time between the lineages. But if we don't have prior information about the formatting, we might just as easily say 4 and 5 are different so the edges from 1 to 4 and 1 to 5 are both unshared. I expect that this will come up with many of the graph API methods since we would want them to handle msprime.sim_ancestry(...,record_full_arg=True) outputs as well as simplified graphs, but I also don't have any great thoughts on how best to handle it.

1 reply

jeromekelleher Apr 17, 2025
Maintainer Author

Good points @kitchensjn, we'll definitely need to have a clear answer to the annoying 1-vs-2 RE node question (ideally, just dealing with it) in the proposed graph APIs

Efficient graph APIs #2869

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer

Additional indexes

Secondary sort key?

Applications

Replies: 20 comments · 28 replies

Uh oh!

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer Author

Uh oh!

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer Author

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer Author

Uh oh!

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer Author

Uh oh!

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

jeromekelleher Nov 10, 2023 Maintainer Author

Uh oh!

hyanwong Nov 10, 2023 Maintainer

Uh oh!

petrelharp Nov 14, 2023 Maintainer

Uh oh!

jeromekelleher Nov 14, 2023 Maintainer Author

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

hyanwong Nov 13, 2023 Maintainer

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

hyanwong Nov 13, 2023 Maintainer

Uh oh!

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

hyanwong Nov 13, 2023 Maintainer

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

hyanwong Nov 13, 2023 Maintainer

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

hyanwong Jan 20, 2024 Maintainer

Uh oh!

Uh oh!

jeromekelleher Nov 13, 2023 Maintainer Author

Uh oh!

jeromekelleher
Nov 10, 2023
Maintainer

Replies: 20 comments 28 replies

jeromekelleher
Nov 10, 2023
Maintainer Author

jeromekelleher Nov 10, 2023
Maintainer Author

hyanwong Nov 10, 2023
Maintainer

jeromekelleher Nov 10, 2023
Maintainer Author

hyanwong
Nov 10, 2023
Maintainer

hyanwong
Nov 10, 2023
Maintainer

jeromekelleher Nov 10, 2023
Maintainer Author

hyanwong Nov 10, 2023
Maintainer

hyanwong
Nov 10, 2023
Maintainer

jeromekelleher Nov 10, 2023
Maintainer Author

hyanwong Nov 10, 2023
Maintainer

petrelharp Nov 14, 2023
Maintainer

jeromekelleher Nov 14, 2023
Maintainer Author

jeromekelleher
Nov 13, 2023
Maintainer Author

jeromekelleher
Nov 13, 2023
Maintainer Author

hyanwong
Nov 13, 2023
Maintainer

jeromekelleher Nov 13, 2023
Maintainer Author

hyanwong Nov 13, 2023
Maintainer

jeromekelleher
Nov 13, 2023
Maintainer Author

hyanwong Nov 13, 2023
Maintainer

jeromekelleher Nov 13, 2023
Maintainer Author

hyanwong Nov 13, 2023
Maintainer

jeromekelleher Nov 13, 2023
Maintainer Author

hyanwong Jan 20, 2024
Maintainer

jeromekelleher
Nov 13, 2023
Maintainer Author