Skip to content

Makes stats cache optional #2135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jeromekelleher
Copy link
Member

Alternative approach for #1937

@jeromekelleher jeromekelleher marked this pull request as draft February 22, 2022 17:37
@codecov
Copy link

codecov bot commented Feb 22, 2022

Codecov Report

Merging #2135 (aad375b) into main (3631396) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2135   +/-   ##
=======================================
  Coverage   93.37%   93.37%           
=======================================
  Files          27       27           
  Lines       25591    25591           
  Branches     1163     1163           
=======================================
  Hits        23895    23895           
  Misses       1666     1666           
  Partials       30       30           
Flag Coverage Δ
c-tests 92.33% <ø> (ø)
lwt-tests 89.05% <ø> (ø)
python-c-tests 72.11% <ø> (ø)
python-tests 98.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3631396...aad375b. Read the comment docs.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Feb 23, 2022

This is tricky. We definitely want to keep a cache of some sort because we evaluate summary(u) many more times otherwise (not just twice I realise, we also evaluate summar(u) twice for each branch we traverse up the tree as we add and remove edges) and the summary_func is one of the major contributors to run time as it is.

What we could do is have a slightly more intelligent cache where we only store the values of summary(u) for nodes that are currently in the tree, which would reduce the storage a bit (at the cost of making things a bit more complicated).

But then, for a coalescent simulation we only have O(n + t) nodes, so we're only really going to benefit from this, except for relatively small n value (say) n=1000 and t= 10^6. The reality is that we need O(n^3) storage to do this because we need to cache O(n^2) space for each node currently in the tree, unless we want to evaluate the O(n^2) summary function quite a few more times. Although, I guess the actual evaluation of branch_length * summary[u] is also O(n^2), so maybe here's not much difference really.

Any thoughts @petrelharp?

@petrelharp
Copy link
Contributor

Here's some thoughts:

  • The cacheing is saving us only a constant multiple on the number of summary_func evaluations, since as you say, it's updating it twice for each added / removed edge; but we're going to have to evalute summary_func again for all such updated nodes anyways. However, this constant might be substantial.
  • So, we could avoid cacheing entirely if we were instead clever about which nodes needed to be updated, and in which order. For instance, if we kept track of which nodes needed their contributions updated and then evaluated summary_func for those just once after edges had been added and removed, then we'd have nearly the minimal number of requried summary_func calls (we'd have one extra call per node - for instance, we'd call summary_func(u) when u is added and also when it is removed).
  • I don't know how calling summary_func compares to looking up the pre-computed value? The later presumably costs something, especially when the table of such things gets big?
  • An alternative might be something along the lines of this algorithm - I think it'd be possible to compute the n^2 covariance matrix by cacheing at each node a vector of length n (something like the covariance of that node with all samples over some set of SNPs) and then propagating these length-n vectors downwards (as described in that link) until they get to the samples; the length-n vector at each sample would then be the rows of the matrix. I'm unclear right now how this would compare to what we're talking about, or if if would generalize beyond covariance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants