Skip to content

Add count_call_alternate_alleles function #282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
6 tasks
eric-czech opened this issue Sep 25, 2020 · 3 comments
Open
6 tasks

Add count_call_alternate_alleles function #282

eric-czech opened this issue Sep 25, 2020 · 3 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Sep 25, 2020

I'm not so sure about that name but it seems like the most obvious choice given how our other counting functions are named.

This should do what https://scikit-allel.readthedocs.io/en/stable/model/ndarray.html#allel.Genotypes.to_n_alt does.

Likely tasks:

  • Ensure that count_call_alleles has been run
  • Set dtype so counts aren't int64
  • Use the above to sum non-reference allele counts (e.g. ds.call_allele_count[:, 1:].sum(dim='alleles'))
    • For handling missing data, it may make sense to add another numba gufunc or have an option that tells call_allele_count to count missing alleles too. That could get tricky with https://github.com/pystatgen/sgkit/issues/243 though, so it is probably even better if the function relies on the missingness mask instead (= slightly less efficient but more readable code).
  • Document the fact that we always assume the first allele is the reference allele in this function
  • Decide what to do with missing data, my preference is:
    • Default to the behavior of allel.Genotypes.to_n_alt(fill=-1), meaning that partially or completely missed calls result in a -1 count
    • Not even have an option to fill missing values with 0 -- I'm not sure why that's the default in scikit-allel. @alimanfoo is there an important application for that?
  • Change reference in PCA https://github.com/pystatgen/sgkit/pull/262
@eric-czech eric-czech added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 25, 2020
@eric-czech eric-czech mentioned this issue Sep 29, 2020
18 tasks
@timothymillar
Copy link
Collaborator

@eric-czech I think this could be calculated directly from genotype-calls similar to scikit-allel (source) this should also work with #243.
There is a bit of discussion about identifying partial/missing genotypes in this issue #223.

Following the scikit-allel code an implementation could look something like:

alternate_counts = (ds['genotype_calls'] > 0).sum(axis=-1)
partial = patial_genotype_calls(ds['genotype_calls'])
alternate_counts = xr.where(partial, fill, alternate_counts)

The function patial_genotype_calls (or a better name) should be aware of mixed-ploidy genotypes and only return True if a genotype contained a unknown allele -1.

@eric-czech
Copy link
Collaborator Author

So I'm following, you're saying patial_genotype_calls = lambda ds: (ds.call_genotype == -1).any(dim='ploidy') right (ignoring non-alleles)?

I would have no need for it, but would you ever use a switch that allows for partial calls to be included in the alternate counts? In other words, it would only return the fill value if all calls were missing rather than any one of them.

@timothymillar
Copy link
Collaborator

So I'm following, you're saying patial_genotype_calls = lambda ds: (ds.call_genotype == -1).any(dim='ploidy') right (ignoring non-alleles)?

Yes exactly.

would you ever use a switch that allows for partial calls to be included in the alternate counts?

I don't actually have a use case for either at the moment, just suggesting an implementation.

The scikit-allel version does actually count alleles in partial genotypes if fill=0 because of the if statement (possibly a bug?):

>>> import allel
>>> g = allel.GenotypeArray([
...     [[1,1],[1,1]],
...     [[1,1],[1,-1]]
... ])
>>> g.to_n_alt(fill=0)
array([[2, 2],
       [2, 1]], dtype=int8)

vs

>>> g.to_n_alt(fill=-1)
array([[ 2,  2],
       [ 2, -1]], dtype=int8)

Another option is to always count the alleles of partial genotypes (only returning the fill value if all calls are missing) and leave it up to the user to filter out the partial genotypes as in #223. This may be preferable to handling partial genotypes in multiple functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

2 participants