tskit's `genetic_relatedness()` versus eGRM (Fan et al. 2022) #2603

grahamgower · 2022-10-11T07:37:47Z

grahamgower
Oct 11, 2022
Collaborator

Hello tskitters,

I've inferred tree sequences using tsinfer/tsdate for a chicken dataset of 674 individuals. I also have trees from relate (but output below is for tsinfer/tsdate trees). I've chosen the chromosome with the smallest trees file (chr16), and calculated the genetic relatedness matrix (GRM) using tskit's genetic_relatedness() and Fan et al.'s egrm package.

What is the difference between the GRM obtained from tskit (using mode="branch", script reproduced below) and egrm? If someone on the street asked me, I'd say they should be doing essentially the same thing, but I don't grok the stats framework and/or genetic_relatedness docs (it's too general and/or abstract for me).
Why is there such a huge discrepancy in resources used by tskit compared with eGRM? tskit used 170 minutes, and egrm was under 8 minutes. I didn't think to record the memory usage, but saw in top that tskit was hitting 17 Gb (resident), while egrm didn't seem to go much beyond 120 mb.

$ tskit info chr16.dated.trees
╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │      23646║
╟───────────────┼───────────╢
║Sequence Length│    2844601║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │       1348║
╟───────────────┼───────────╢
║Total Size     │   25.8 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table      │Rows  │Size     │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges      │549516│ 16.8 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│   674│ 31.1 KiB│         Yes║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │     0│  8 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations  │ 25334│915.4 KiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes      │ 36377│  2.7 MiB│         Yes║
╟───────────┼──────┼─────────┼────────────╢
║Populations│     5│143 Bytes│         Yes║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│     8│  4.5 KiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Sites      │ 25334│  1.2 MiB│         Yes║
╚═══════════╧══════╧═════════╧════════════╝

# grm.py
import sys

import numpy as np
import tskit


def genetic_relatedness_matrix(ts, sample_sets, mode):
    n = len(sample_sets)
    triu_indices = np.triu_indices(n)
    indexes = np.transpose(triu_indices)
    K = np.zeros((n, n))
    K[triu_indices] = ts.genetic_relatedness(
        sample_sets, indexes, mode=mode, proportion=False, span_normalise=False
    )
    K = K + np.triu(K, 1).transpose()
    return K


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print(f"usage: {sys.argv[0]} file.trees relatedness.txt")
        exit(1)

    ts_file = sys.argv[1]
    out_file = sys.argv[2]
    ts = tskit.load(ts_file)
    sample_sets = [ind.nodes for ind in ts.individuals()]
    K = genetic_relatedness_matrix(ts, sample_sets, "branch")
    np.savetxt(out_file, K)

$ time python grm.py chr16.dated.trees grm.chr16.txt

real    170m51.057s
user    169m58.389s
sys     0m39.733s

$ pip install egrm
...

$ time trees2egrm --c --output-format numpy chr16.dated.trees
[2022-10-11 08:57:07 - INFO] Beginning importing tree sequence at chr16.dated.trees
[2022-10-11 08:57:07 - INFO] Finished importing tree sequence at chr16.dated.trees
[2022-10-11 08:57:07 - INFO] Constructing genetic map
[2022-10-11 08:57:07 - INFO] Beginning eGRM estimation
100%|██████████████████████████████| 23646/23646 [07:41<00:00, 51.21it/
[2022-10-11 09:04:49 - INFO] Finished eGRM estimation
[2022-10-11 09:04:49 - INFO] Beginning export of eGRM
[2022-10-11 09:04:49 - INFO] Finished export of eGRM
[2022-10-11 09:04:49 - INFO] Finished! :D

real    7m43.445s
user    7m42.186s
sys     0m7.699s

petrelharp · 2022-10-11T13:36:47Z

petrelharp
Oct 11, 2022
Maintainer

Let's see - without looking up the details, I think that

Conceptually, tskit's relatedness is like the "number of shared SNPs", while the eGRM computes the sum over shared SNPs of 1/(p (1-p)), where p is the allele frequency. (Well, they compute the expectation of this value, given the trees, under infinite-sites mutations.) Both are standard normalizations for the GRM; the motivation for the upweighting lower-frequency SNPs is that older/more common SNPs are less likely to explain a lot of variance (under stabilizing selection anyhow).
The eGRM software estimates the branch length quantity by simulating lots of mutations and throwing them down on the trees, then counting them up. This can be done pretty efficiently. And, over in tskit we perhaps erred in judgement in a memory-speed tradeoff.

6 replies

grahamgower Oct 12, 2022
Collaborator Author

Thanks @petrelharp, that's really useful. For (2), I don't think eGRM is simulating mutations except for when it does a Variance calculation - which I'm not using here. As far as I can tell from the source code, it keeps a running tally of the relatedness matrix as it visits each tree in the sequence, and traverses each node in a tree to add the "branch area" for each branch above the given node, which is added into the matrix cells for the subtended samples. I don't think it even takes advantage of node/branch sharing across consecutive trees, so possibly there's further room for optimisation here.
https://github.com/Ephraim-usc/egrm/blob/d59bb20882db23b5514987f6cdc18e79e2add28d/egrm/egrm.py#L88-L106

brieuclehmann Oct 13, 2022
Collaborator

That's right, eGRM traverses through nodes and through trees. I am wondering whether some of the difference is coming from the fact that for each node, they only update the GRM for those samples that are subtended by that node: see https://github.com/Ephraim-usc/egrm/blob/d59bb20882db23b5514987f6cdc18e79e2add28d/egrm/egrm.py#L214. This is accounted for post-hoc by centering the resulting GRM: https://github.com/Ephraim-usc/egrm/blob/d59bb20882db23b5514987f6cdc18e79e2add28d/egrm/egrm.py#L227-L228

This follows from this derivation in their paper:

On the other hand, in some sense, the stats framework keeps track of all the samples under all of the nodes. Presumably this is the cause of the memory issues, and I'm guessing the time differences too.

grahamgower Oct 13, 2022
Collaborator Author

One additional trick that eGRM uses is to calculate the haploid GRM, and once complete this is used to calculate the GRM for diploid individuals. I think this should generalise to arbitrary sample sets too - the result will always be a linear combination of the haploid GRM, right?
https://github.com/Ephraim-usc/egrm/blob/d59bb20882db23b5514987f6cdc18e79e2add28d/bin/trees2egrm#L188-L193

brieuclehmann Oct 13, 2022
Collaborator

Ah yes good point. We deal with this directly/internally within the call to general_stat, though it’s not obvious to me whether doing the haploid -> diploid transformation afterwards instead would necessarily speed things up.

petrelharp Oct 13, 2022
Maintainer

On the other hand, in some sense, the stats framework keeps track of all the samples under all of the nodes. Presumably this is the cause of the memory issues, and I'm guessing the time differences too.

This could be a source of time differences, but is not the source of the memory differences.

jeromekelleher · 2022-10-13T14:33:08Z

jeromekelleher
Oct 13, 2022
Maintainer

@grahamgower - I've put together a version that does things in a naive but fast way in this gist. It definitely uses dramatically less memory and should be faster than the current tskit approach. Curious to see how it stacks up in your example - any chance you could try it out please?

6 replies

grahamgower Oct 14, 2022
Collaborator Author

A small wrinkle is that there seem to be multiple roots in some of the trees.
$ python -c "import collections, tskit; ts=tskit.load('chr16.dated.trees'); print(collections.Counter(t.num_roots for t in ts.trees()))"
Counter({1: 23644, 1348: 2})

Oh, I must still be asleep, because I misread this as "1348 trees have 2 roots", when it actually says "2 trees have 1348 roots". These are the first and last trees in the sequence, so should definitely be excluded here (the first tree spans more than 20% of the chromosome!).

jeromekelleher Oct 14, 2022
Maintainer

Hooray! Thanks @grahamgower, great to know we're in the right ballbark with this version.

You can safely skip these trees as they contribute 0 branch length. The script should do if tree.num_roots == ts.num_samples: continue or something along those lines.

jeromekelleher Oct 14, 2022
Maintainer

Here's another version @grahamgower, if you'd like to try it out. Probably a bit quicker, but not quite as numba-d so would expect a much bigger difference if it was done in C.

grahamgower Oct 17, 2022
Collaborator Author

Hmm, well that version took 2 hours (~200 MB RAM again). I guess some profiling is needed because I'd expect it to be quicker too.

jeromekelleher Oct 17, 2022
Maintainer

Thanks, good to know. 👍

tskit's genetic_relatedness() versus eGRM (Fan et al. 2022) #2603

Uh oh!

grahamgower Oct 11, 2022 Collaborator

Replies: 2 comments · 12 replies

Uh oh!

petrelharp Oct 11, 2022 Maintainer

Uh oh!

grahamgower Oct 12, 2022 Collaborator Author

Uh oh!

brieuclehmann Oct 13, 2022 Collaborator

Uh oh!

Uh oh!

grahamgower Oct 13, 2022 Collaborator Author

Uh oh!

brieuclehmann Oct 13, 2022 Collaborator

Uh oh!

petrelharp Oct 13, 2022 Maintainer

Uh oh!

jeromekelleher Oct 13, 2022 Maintainer

Uh oh!

grahamgower Oct 14, 2022 Collaborator Author

Uh oh!

jeromekelleher Oct 14, 2022 Maintainer

Uh oh!

jeromekelleher Oct 14, 2022 Maintainer

Uh oh!

grahamgower Oct 17, 2022 Collaborator Author

Uh oh!

jeromekelleher Oct 17, 2022 Maintainer

tskit's `genetic_relatedness()` versus eGRM (Fan et al. 2022) #2603

grahamgower
Oct 11, 2022
Collaborator

Replies: 2 comments 12 replies

petrelharp
Oct 11, 2022
Maintainer

grahamgower Oct 12, 2022
Collaborator Author

brieuclehmann Oct 13, 2022
Collaborator

grahamgower Oct 13, 2022
Collaborator Author

brieuclehmann Oct 13, 2022
Collaborator

petrelharp Oct 13, 2022
Maintainer

jeromekelleher
Oct 13, 2022
Maintainer

grahamgower Oct 14, 2022
Collaborator Author

jeromekelleher Oct 14, 2022
Maintainer

jeromekelleher Oct 14, 2022
Maintainer

grahamgower Oct 17, 2022
Collaborator Author

jeromekelleher Oct 17, 2022
Maintainer