-
Notifications
You must be signed in to change notification settings - Fork 35
Cohort grouping for popgen #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hm I haven't tested this but it's what came to mind first as a possibility, borrowing some tricks from regenie and Tim's PR: # n = samples, c = cohorts, k = alleles
@numba.guvectorize(['void(numba.int8[:, :], numba.int32[:], numba.int32[:], numba.int32[:,:])'], '(n, k),(n),(c)->(c,k)')
def count_cohort_alleles(ac, cohorts, _, out):
out[:,:] = 0 # This is (cohorts, alleles)
for i in range(ac.shape[0]):
out[cohorts[i]] += ac[i]
ds = count_call_alleles(ds)
AC, SC = ds.call_allele_count, ds.sample_cohort
C = da.empty(n_cohorts, dtype=np.uint8)
# Call directly if there is no chunking in core dimensions
count_cohort_alleles(AC.chunk((None, -1, -1)), SC.chunk((-1,)), C)
# shape = (variants, cohorts, alleles)
# OR run on individual blocks for better scalability #
AC = da.map_blocks(count_cohort_alleles, AC, SC, C, chunks=(n_cohorts, n_alleles), ...)
# shape = (variants, cohorts * AC.numblocks[0], alleles * AC.numblocks[1])
# Stack the blocks and sum across them
# (which will only work because each chunk is guaranteed to have same size)
AC = da.stack([AC.blocks[:, i] for i in range(x.numblocks[1])]).sum(axis=0)
# shape = (variants, cohorts, alleles) I would guess that this is a good bit better than any pure dask/xarray method if it works. |
That's great @eric-czech - thank you. I've incorporated the vectorized function in the latest update. |
In today's call we discussed what should happen for the n > 2 case. One option would be to have a new variable for each combination (named like |
34779d2
to
606b870
Compare
Thanks @jeromekelleher. I have updated this PR with approach 2 from your suggestion. Fst returns a square matrix, where the entries above the diagonal are the Fst values for cohort i and j. Using xarray coordinates, we can refer to the values using cohort names. In this example the names are >>> cohort_names = ["co_0", "co_1"]
>>> ds = ds.assign_coords({"cohorts_a": cohort_names, "cohorts_b": cohort_names})
>>> ds = Fst(ds) # compute Fst and return a new dataset
>>> ds["stat_Fst"]
<xarray.DataArray 'stat_Fst' (cohorts_a: 2, cohorts_b: 2)>
array([[ nan, 0.02538071],
[ nan, nan]])
Coordinates:
* cohorts_a (cohorts_a) object 'co_0' 'co_1'
* cohorts_b (cohorts_b) object 'co_0' 'co_1' You can use the cohort names to get a particular Fst value: >>> ds["stat_Fst"].sel(cohorts_a="co_0", cohorts_b="co_1").values
0.02538071065989833 Similarly, |
assert_array_shape(AC, n_variants, n_cohorts, n_alleles) | ||
|
||
new_ds = Dataset({"cohort_allele_count": (("variants", "cohorts", "alleles"), AC)}) | ||
return conditional_merge_datasets(ds, new_ds, merge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite A quick question, I am trying to understand what this function (count_cohort_alleles
) is trying to do mathematically, is there a reference implementation or formula, which I can look at to understand this better? Or is it just a an attempt to do a better rewrite of an old implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aktech Good question (and this function definitely needs documentation!). It's needed because count_call_alleles
and count_variant_alleles
return allele counts for each sample and for all samples (respectively), whereas in this PR we need something between the two since we now have the concept of "cohort" (a set of samples): so we need per-cohort counts.
The counts array computed here is of shape (n_variants, n_cohorts, n_alleles)
, so we have a count for each variant/cohort/allele combination. Have a look at the tests (e.g. test_count_cohort_alleles__multi_variant_multi_sample
), and perhaps try calling the allele counting functions in a Python repl to get a feel for what they are doing.
This is very beautiful @tomwhite, I don't really have any substantive input. My only question really is, should we fill the lower diagonal of the output matrix also? |
Thanks @jeromekelleher. I agree on the point about filling the lower diagonal - the only reason I haven't is because I'm not sure if xarray/dask would do the computations twice. |
What about using numpy tricks like these? That should avoid the idea of redoing the computations. I guess as a general pattern we should just run the calculations again though - for some functions, |
This PR has conflicts, @tomwhite please rebase and push updated version 🙏 |
This latest update uses a generalized ufunc for computing divergence, which should be more efficient than using I think this is ready for review - changing the version of Fst (to hudson) can be done separately, as can being able to specify subsets of cohorts. We might want to discuss the naming convention for cohort dimensions:
|
47c6434
to
2f529f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
n_alleles = ds.dims["alleles"] | ||
|
||
ds = count_call_alleles(ds) | ||
AC, SC = da.asarray(ds.call_allele_count), da.asarray(ds.sample_cohort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we standardize on using ds. call_allele_count
vs ds['call_allele_count']
? I've been trying to use the former in examples only and stick to the latter in code, but not for great reasons other than consistency and hedging against potentially having variable names that aren't legal attributes or conflict with something in the xarray namespace. Presumably not using attributes becomes easier to standardize on with https://github.com/pystatgen/sgkit/pull/276.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. It's certainly shorter to say ds.call_allele_count
, but this form can't be used in assignments:
ds.call_allele_count = ...
I've changed the accessors in this PR, but we should probably discuss a bit more about our code standards, and what we recommend for user code.
------- | ||
Dataset containing variable `call_allele_count` of allele counts with | ||
shape (variants, cohorts, alleles) and values corresponding to | ||
the number of non-missing occurrences of each allele. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this shouldn't also return a (variants, cohorts) array that contains the number of present samples used in each aggregation. It doesn't do it now, but if count_call_alleles
returned -1 for all alleles when any (or potentially all) input calls were missing then I can see a count of the samples with non-negative allele counts being a useful downstream denominator.
Is that not important with the stats functions in popgen.py
now? It may make sense to put a warning on those functions and log an issue related to supporting it. It seems like only ignoring the missing values during the original allele count must introduce some kind of bias through all those aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I haven't thought about missingness with respect to the meaning of the functions though. Opened #290
sgkit/stats/popgen.py
Outdated
|
||
# c = cohorts | ||
@guvectorize( # type: ignore | ||
["void(float64[:], float64[:,:])"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think float16 and float32 overloads would make sense here. I haven't tested it, but it looks like it should be possible to get a float32 result out of diversity by feeding it allele counts that are float32 (or float16). That seems like a good way for users to control floating point precision so I think we should try to propagate those types like numpy does where we can, especially since np.sum does that and this function is similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a float32 overload. Numba doesn't have float16 as far as I can see: https://numba.pydata.org/numba-doc/dev/reference/types.html#numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, ty
sgkit/tests/test_popgen.py
Outdated
@@ -37,39 +40,70 @@ def ts_to_dataset(ts, samples=None): | |||
def test_diversity(size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that map_blocks is underlying these stats, I think chunking on variants and samples dimensions would be useful parameters for these tests. I'm imagining something like:
@pytest.mark.parametrize("size", [2, 3, 10, 100])
@pytest.mark.parametrize("chunks", [(-1, -1), (10, 10)])
def test_diversity(size, chunks):
ts = msprime.simulate(size, length=100, mutation_rate=0.05, random_seed=42)
ds = ts_to_dataset(ts).chunk(dict(zip(["samples", "variants"], chunks)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Adding this test uncovered two bugs: one to do with the wrong numblocks
index being used, and another where the guvectorize
function if the input is chunked along the samples dimension. I haven't been able to figure out how fix the latter yet, so I'm going to leave it for the moment. (Popgen doesn't need to chunk along the samples dimension just yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added warnings to the documentation about not supporting chunking in the samples dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @tomwhite! Those will be great examples to follow.
I had a few suggestions/questions so let me know when you'd like me to take a look again.
Use guvectorize'd implementation of count_cohort_alleles suggested by Eric Czech. Fix test_Tajimas_D Avoid divide by zero error Remove check for number of samples, since it is now handled by where clause Return dataset for diversity Return dataset for divergence Return dataset for Fst Return dataset for Tajimas_D Cohort coordinates Use xarray's combine_nested to build grid of cohort combinations Use conditional_merge_datasets Fix Fst implementation for n_cohorts > 2, and parameterize tests for n_cohorts. Return values in upper and lower diagonal array. Vectorized divergence Vectorized diversity pairs Add doc for count_cohort_alleles
c98fb2e
to
3c6bf4b
Compare
Thanks for the review @eric-czech. I have addressed your comments (and left some until later), so it should be ready for another look. |
Looks like we're ready to merge? |
This is an initial attempt at #240, which I'm posting as a draft for discussion.
The main idea (following @eric-czech's https://github.com/pystatgen/sgkit/issues/224#issuecomment-685061129) is to pass a single dataset to the
diversity
,divergence
,Fst
functions that includes asample_cohort
variable that indicates which cohort each sample belongs to (using an index, 0 = first cohort, 1 = second cohort, etc). Then, there is a newcount_cohort_alleles
function that counts the number of each allele for each variant and cohort, and which is used bydiversity
etc to perform computations within/between cohorts.A few questions for discussion:
count_cohort_alleles
? I tried one usinggroupby
, and one that should perform better using Dask'smap_blocks
(but that could also be much improved).sample_cohort
get specified? The tests just set it manually at the moment.Also, there's still more to do to get some of the tests passing, and to convert
Tajimas_D
to use the same approach.