Makes stats cache optional #2135

jeromekelleher · 2022-02-22T17:37:30Z

Alternative approach for #1937

codecov · 2022-02-22T17:50:12Z

Codecov Report

Merging #2135 (aad375b) into main (3631396) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #2135   +/-   ##
=======================================
  Coverage   93.37%   93.37%           
=======================================
  Files          27       27           
  Lines       25591    25591           
  Branches     1163     1163           
=======================================
  Hits        23895    23895           
  Misses       1666     1666           
  Partials       30       30

Flag	Coverage Δ
c-tests	`92.33% <ø> (ø)`
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`72.11% <ø> (ø)`
python-tests	`98.89% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3631396...aad375b. Read the comment docs.

jeromekelleher · 2022-02-23T15:13:35Z

This is tricky. We definitely want to keep a cache of some sort because we evaluate summary(u) many more times otherwise (not just twice I realise, we also evaluate summar(u) twice for each branch we traverse up the tree as we add and remove edges) and the summary_func is one of the major contributors to run time as it is.

What we could do is have a slightly more intelligent cache where we only store the values of summary(u) for nodes that are currently in the tree, which would reduce the storage a bit (at the cost of making things a bit more complicated).

But then, for a coalescent simulation we only have O(n + t) nodes, so we're only really going to benefit from this, except for relatively small n value (say) n=1000 and t= 10^6. The reality is that we need O(n^3) storage to do this because we need to cache O(n^2) space for each node currently in the tree, unless we want to evaluate the O(n^2) summary function quite a few more times. Although, I guess the actual evaluation of branch_length * summary[u] is also O(n^2), so maybe here's not much difference really.

Any thoughts @petrelharp?

petrelharp · 2022-02-23T21:54:44Z

Here's some thoughts:

The cacheing is saving us only a constant multiple on the number of summary_func evaluations, since as you say, it's updating it twice for each added / removed edge; but we're going to have to evalute summary_func again for all such updated nodes anyways. However, this constant might be substantial.
So, we could avoid cacheing entirely if we were instead clever about which nodes needed to be updated, and in which order. For instance, if we kept track of which nodes needed their contributions updated and then evaluated summary_func for those just once after edges had been added and removed, then we'd have nearly the minimal number of requried summary_func calls (we'd have one extra call per node - for instance, we'd call summary_func(u) when u is added and also when it is removed).
I don't know how calling summary_func compares to looking up the pre-computed value? The later presumably costs something, especially when the table of such things gets big?
An alternative might be something along the lines of this algorithm - I think it'd be possible to compute the n^2 covariance matrix by cacheing at each node a vector of length n (something like the covariance of that node with all samples over some set of SNPs) and then propagating these length-n vectors downwards (as described in that link) until they get to the samples; the length-n vector at each sample would then be the rows of the matrix. I'm unclear right now how this would compare to what we're talking about, or if if would generalize beyond covariance.

First prototype for removing stats cache

aad375b

jeromekelleher marked this pull request as draft February 22, 2022 17:37

jeromekelleher closed this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makes stats cache optional #2135

Makes stats cache optional #2135

Uh oh!

jeromekelleher commented Feb 22, 2022

Uh oh!

codecov bot commented Feb 22, 2022 •

edited

Loading

Uh oh!

jeromekelleher commented Feb 23, 2022 •

edited

Loading

Uh oh!

petrelharp commented Feb 23, 2022

Uh oh!

Uh oh!

Makes stats cache optional #2135

Makes stats cache optional #2135

Uh oh!

Conversation

jeromekelleher commented Feb 22, 2022

Uh oh!

codecov bot commented Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Feb 23, 2022

Uh oh!

Uh oh!

codecov bot commented Feb 22, 2022 •

edited

Loading

jeromekelleher commented Feb 23, 2022 •

edited

Loading