Cohort grouping for popgen #260

tomwhite · 2020-09-15T15:39:03Z

This is an initial attempt at #240, which I'm posting as a draft for discussion.

The main idea (following @eric-czech's https://github.com/pystatgen/sgkit/issues/224#issuecomment-685061129) is to pass a single dataset to the diversity, divergence, Fst functions that includes a sample_cohort variable that indicates which cohort each sample belongs to (using an index, 0 = first cohort, 1 = second cohort, etc). Then, there is a new count_cohort_alleles function that counts the number of each allele for each variant and cohort, and which is used by diversity etc to perform computations within/between cohorts.

A few questions for discussion:

What's a good implementation for count_cohort_alleles? I tried one using groupby, and one that should perform better using Dask's map_blocks (but that could also be much improved).
How does sample_cohort get specified? The tests just set it manually at the moment.
Do we need to consider more than two cohorts yet?

Also, there's still more to do to get some of the tests passing, and to convert Tajimas_D to use the same approach.

jeromekelleher · 2020-09-15T15:51:36Z

Looks good, thanks @tomwhite. Pinging @aktech for input.

eric-czech · 2020-09-15T18:34:08Z

What's a good implementation for count_cohort_alleles? I tried one using groupby, and one that should perform better using Dask's map_blocks (but that could also be much improved).

Hm I haven't tested this but it's what came to mind first as a possibility, borrowing some tricks from regenie and Tim's PR:

# n = samples, c = cohorts, k = alleles
@numba.guvectorize(['void(numba.int8[:, :], numba.int32[:], numba.int32[:], numba.int32[:,:])'], '(n, k),(n),(c)->(c,k)')
def count_cohort_alleles(ac, cohorts, _, out):
    out[:,:] = 0 # This is (cohorts, alleles)
    for i in range(ac.shape[0]):
        out[cohorts[i]] += ac[i]

ds = count_call_alleles(ds)
AC, SC = ds.call_allele_count, ds.sample_cohort
C = da.empty(n_cohorts, dtype=np.uint8)

# Call directly if there is no chunking in core dimensions

count_cohort_alleles(AC.chunk((None, -1, -1)), SC.chunk((-1,)), C)
# shape = (variants, cohorts, alleles)

# OR run on individual blocks for better scalability #

AC = da.map_blocks(count_cohort_alleles, AC, SC, C, chunks=(n_cohorts, n_alleles), ...)
# shape = (variants, cohorts * AC.numblocks[0], alleles * AC.numblocks[1])

# Stack the blocks and sum across them 
# (which will only work because each chunk is guaranteed to have same size)
AC = da.stack([AC.blocks[:, i] for i in range(x.numblocks[1])]).sum(axis=0)
# shape = (variants, cohorts, alleles)

I would guess that this is a good bit better than any pure dask/xarray method if it works.

tomwhite · 2020-09-16T15:34:41Z

That's great @eric-czech - thank you. I've incorporated the vectorized function in the latest update.

tomwhite · 2020-09-17T16:31:32Z

In today's call we discussed what should happen for the n > 2 case. One option would be to have a new variable for each combination (named like Fst_cohort0_cohort1, etc), or just have a data array indexed by n cohort dimensions. The latter was thought to be preferable since you can use standard indexing to manipulate it, and xarray coordinates should allow indexing using cohort (population) labels, rather than integers. We should explore this in this PR.

jeromekelleher · 2020-09-17T16:34:01Z

Thanks @tomwhite - I've written up my thoughts on it here

tomwhite · 2020-09-21T15:58:23Z

Thanks @jeromekelleher. I have updated this PR with approach 2 from your suggestion. Fst returns a square matrix, where the entries above the diagonal are the Fst values for cohort i and j.

Using xarray coordinates, we can refer to the values using cohort names. In this example the names are co_0 and co_1, and the dimensions are cohorts_a and cohorts_b. (In general there could be any number of cohort names, but always just two dimensions for Fst.)

>>> cohort_names = ["co_0", "co_1"]
>>> ds = ds.assign_coords({"cohorts_a": cohort_names, "cohorts_b": cohort_names})
>>> ds = Fst(ds) # compute Fst and return a new dataset
>>> ds["stat_Fst"]
<xarray.DataArray 'stat_Fst' (cohorts_a: 2, cohorts_b: 2)>
array([[       nan, 0.02538071],
       [       nan,        nan]])
Coordinates:
  * cohorts_a  (cohorts_a) object 'co_0' 'co_1'
  * cohorts_b  (cohorts_b) object 'co_0' 'co_1'

You can use the cohort names to get a particular Fst value:

>>> ds["stat_Fst"].sel(cohorts_a="co_0", cohorts_b="co_1").values
0.02538071065989833

Similarly, diversity returns a 1D matrix, with dimension cohorts.

aktech · 2020-09-22T15:22:51Z

sgkit/stats/aggregation.py

+    assert_array_shape(AC, n_variants, n_cohorts, n_alleles)
+
+    new_ds = Dataset({"cohort_allele_count": (("variants", "cohorts", "alleles"), AC)})
+    return conditional_merge_datasets(ds, new_ds, merge)


@tomwhite A quick question, I am trying to understand what this function (count_cohort_alleles) is trying to do mathematically, is there a reference implementation or formula, which I can look at to understand this better? Or is it just a an attempt to do a better rewrite of an old implementation?

@aktech Good question (and this function definitely needs documentation!). It's needed because count_call_alleles and count_variant_alleles return allele counts for each sample and for all samples (respectively), whereas in this PR we need something between the two since we now have the concept of "cohort" (a set of samples): so we need per-cohort counts.

The counts array computed here is of shape (n_variants, n_cohorts, n_alleles), so we have a count for each variant/cohort/allele combination. Have a look at the tests (e.g. test_count_cohort_alleles__multi_variant_multi_sample), and perhaps try calling the allele counting functions in a Python repl to get a feel for what they are doing.

jeromekelleher · 2020-09-23T08:07:14Z

This is very beautiful @tomwhite, I don't really have any substantive input. My only question really is, should we fill the lower diagonal of the output matrix also? Fst[0, 1] and Fst[1, 0] should be equal, and we're not saving any memory by not filling out the lower diagonal. It's great that xarray also fulfils the need for labelled cohorts.

tomwhite · 2020-09-23T08:53:44Z

Thanks @jeromekelleher. I agree on the point about filling the lower diagonal - the only reason I haven't is because I'm not sure if xarray/dask would do the computations twice.

jeromekelleher · 2020-09-23T09:58:53Z

Thanks @jeromekelleher. I agree on the point about filling the lower diagonal - the only reason I haven't is because I'm not sure if xarray/dask would do the computations twice.

What about using numpy tricks like these? That should avoid the idea of redoing the computations.

I guess as a general pattern we should just run the calculations again though - for some functions, F[a, b] will not be equal to F[b, a], and so maybe it's simpler to just do all N^2 calculations, and make the optimisation for symmetric functions as a second pass?

mergify · 2020-09-24T14:46:04Z

This PR has conflicts, @tomwhite please rebase and push updated version 🙏

tomwhite · 2020-09-24T14:46:12Z

This latest update uses a generalized ufunc for computing divergence, which should be more efficient than using itertools.

I think this is ready for review - changing the version of Fst (to hudson) can be done separately, as can being able to specify subsets of cohorts.

We might want to discuss the naming convention for cohort dimensions:

cohorts_a, cohorts_b (current PR)
cohorts_1, cohorts_2 (natural alternative)
cohort_a, cohort_b (but other dimensions are plural, e.g. variants)
cohort_1, cohort_2 (ditto)

jeromekelleher

Looks great!

eric-czech · 2020-09-29T17:40:13Z

sgkit/stats/aggregation.py

+    n_alleles = ds.dims["alleles"]
+
+    ds = count_call_alleles(ds)
+    AC, SC = da.asarray(ds.call_allele_count), da.asarray(ds.sample_cohort)


Should we standardize on using ds. call_allele_count vs ds['call_allele_count']? I've been trying to use the former in examples only and stick to the latter in code, but not for great reasons other than consistency and hedging against potentially having variable names that aren't legal attributes or conflict with something in the xarray namespace. Presumably not using attributes becomes easier to standardize on with https://github.com/pystatgen/sgkit/pull/276.

I'm not sure. It's certainly shorter to say ds.call_allele_count, but this form can't be used in assignments:

ds.call_allele_count = ...

I've changed the accessors in this PR, but we should probably discuss a bit more about our code standards, and what we recommend for user code.

sgkit/stats/aggregation.py

eric-czech · 2020-09-29T18:08:34Z

sgkit/stats/aggregation.py

+    -------
+    Dataset containing variable `call_allele_count` of allele counts with
+    shape (variants, cohorts, alleles) and values corresponding to
+    the number of non-missing occurrences of each allele.


I'm wondering if this shouldn't also return a (variants, cohorts) array that contains the number of present samples used in each aggregation. It doesn't do it now, but if count_call_alleles returned -1 for all alleles when any (or potentially all) input calls were missing then I can see a count of the samples with non-negative allele counts being a useful downstream denominator.

Is that not important with the stats functions in popgen.py now? It may make sense to put a warning on those functions and log an issue related to supporting it. It seems like only ignoring the missing values during the original allele count must introduce some kind of bias through all those aggregations.

Good point. I haven't thought about missingness with respect to the meaning of the functions though. Opened #290

sgkit/stats/popgen.py

eric-czech · 2020-09-29T18:31:24Z

sgkit/stats/popgen.py

+
+# c = cohorts
+@guvectorize(  # type: ignore
+    ["void(float64[:], float64[:,:])"],


I think float16 and float32 overloads would make sense here. I haven't tested it, but it looks like it should be possible to get a float32 result out of diversity by feeding it allele counts that are float32 (or float16). That seems like a good way for users to control floating point precision so I think we should try to propagate those types like numpy does where we can, especially since np.sum does that and this function is similar.

Added a float32 overload. Numba doesn't have float16 as far as I can see: https://numba.pydata.org/numba-doc/dev/reference/types.html#numbers

Ah right, ty

eric-czech · 2020-09-29T18:46:33Z

sgkit/tests/test_popgen.py

@@ -37,39 +40,70 @@ def ts_to_dataset(ts, samples=None):
 def test_diversity(size):


Given that map_blocks is underlying these stats, I think chunking on variants and samples dimensions would be useful parameters for these tests. I'm imagining something like:

@pytest.mark.parametrize("size", [2, 3, 10, 100]) @pytest.mark.parametrize("chunks", [(-1, -1), (10, 10)]) def test_diversity(size, chunks): ts = msprime.simulate(size, length=100, mutation_rate=0.05, random_seed=42) ds = ts_to_dataset(ts).chunk(dict(zip(["samples", "variants"], chunks)))

Good call. Adding this test uncovered two bugs: one to do with the wrong numblocks index being used, and another where the guvectorize function if the input is chunked along the samples dimension. I haven't been able to figure out how fix the latter yet, so I'm going to leave it for the moment. (Popgen doesn't need to chunk along the samples dimension just yet.)

I've added warnings to the documentation about not supporting chunking in the samples dimension.

eric-czech

Very nice @tomwhite! Those will be great examples to follow.

I had a few suggestions/questions so let me know when you'd like me to take a look again.

Use guvectorize'd implementation of count_cohort_alleles suggested by Eric Czech. Fix test_Tajimas_D Avoid divide by zero error Remove check for number of samples, since it is now handled by where clause Return dataset for diversity Return dataset for divergence Return dataset for Fst Return dataset for Tajimas_D Cohort coordinates Use xarray's combine_nested to build grid of cohort combinations Use conditional_merge_datasets Fix Fst implementation for n_cohorts > 2, and parameterize tests for n_cohorts. Return values in upper and lower diagonal array. Vectorized divergence Vectorized diversity pairs Add doc for count_cohort_alleles

tomwhite · 2020-10-01T14:00:58Z

Thanks for the review @eric-czech. I have addressed your comments (and left some until later), so it should be ready for another look.

jeromekelleher · 2020-10-01T15:00:15Z

Looks like we're ready to merge?

tomwhite force-pushed the sample-cohort branch from 34779d2 to 606b870 Compare September 21, 2020 15:58

aktech reviewed Sep 22, 2020

View reviewed changes

tomwhite marked this pull request as ready for review September 24, 2020 14:45

mergify bot added the conflict PR conflict label Sep 24, 2020

tomwhite force-pushed the sample-cohort branch from 47c6434 to 2f529f5 Compare September 28, 2020 10:45

mergify bot removed the conflict PR conflict label Sep 28, 2020

tomwhite mentioned this pull request Sep 28, 2020

Document/unify naming convention for repeated dimensions #284

Closed

tomwhite requested review from eric-czech and jeromekelleher September 29, 2020 09:00

jeromekelleher approved these changes Sep 29, 2020

View reviewed changes

eric-czech reviewed Sep 29, 2020

View reviewed changes

sgkit/stats/aggregation.py Outdated Show resolved Hide resolved

eric-czech reviewed Sep 29, 2020

View reviewed changes

sgkit/stats/popgen.py Show resolved Hide resolved

eric-czech reviewed Sep 29, 2020

View reviewed changes

tomwhite mentioned this pull request Oct 1, 2020

Take missing calls into account in popgen stats #290

Open

tomwhite added 7 commits October 1, 2020 14:24

Call two cohorts dimensions 'cohorts_0' and 'cohorts_1'

5f969fb

Change dataset variable accessors to use shorter form

a66f910

Remove spurious 'optional' description in doc

7fd35af

Add float32 overload to _pairwise_sum gufunc

c414bb6

Test chunking across variants

0a735be

Fix Fst usage in Getting Started guide

80c1f10

Add warnings about not supporting datasets chunked by samples

3c6bf4b

tomwhite force-pushed the sample-cohort branch from c98fb2e to 3c6bf4b Compare October 1, 2020 13:55

eric-czech approved these changes Oct 1, 2020

View reviewed changes

jeromekelleher added the auto-merge Auto merge label for mergify test flight label Oct 1, 2020

mergify bot merged commit 9516604 into sgkit-dev:master Oct 1, 2020

hammer mentioned this pull request Oct 15, 2020

Method for grouping samples which is understood by library functions #224

Closed

tomwhite mentioned this pull request Nov 5, 2020

Update Fst and Tajima's D functions to use grouping convention #240

Closed

		@@ -37,39 +40,70 @@ def ts_to_dataset(ts, samples=None):
		def test_diversity(size):

Cohort grouping for popgen #260

Cohort grouping for popgen #260

Uh oh!

Conversation

tomwhite commented Sep 15, 2020

Uh oh!

jeromekelleher commented Sep 15, 2020

Uh oh!

eric-czech commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomwhite commented Sep 16, 2020

Uh oh!

tomwhite commented Sep 17, 2020

Uh oh!

jeromekelleher commented Sep 17, 2020

Uh oh!

tomwhite commented Sep 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher commented Sep 23, 2020

Uh oh!

tomwhite commented Sep 23, 2020

Uh oh!

jeromekelleher commented Sep 23, 2020

Uh oh!

mergify bot commented Sep 24, 2020

Uh oh!

tomwhite commented Sep 24, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-czech left a comment

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Oct 1, 2020

Uh oh!

jeromekelleher commented Oct 1, 2020

Uh oh!

Uh oh!

eric-czech commented Sep 15, 2020 •

edited

Loading