Skip to content

Filtering partial genotype calls #223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timothymillar opened this issue Sep 1, 2020 · 4 comments
Closed

Filtering partial genotype calls #223

timothymillar opened this issue Sep 1, 2020 · 4 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

Comments

@timothymillar
Copy link
Collaborator

It's often necessary to filter out partial genotype calls in order to avoid bias in derived statistics such as allele frequencies.

For example (at a single locus) the set of 3 tetraploid genotype calls 0/0/0/1 0/1/./. 0/0/1/1 would be filtered to 0/0/0/1 ./././. 0/0/1/1 and be treated as 2 observations for the purpose of calculating summary statistics.

In scikit-allel this can be achieved by setting the mask attribute which indicates genotype calls to be excluded from further calculations (It's worth noting that in scikit-allel the mask is of shape (variants, samples) unlike the sgkit mask which is of shape (variants, samples, ploidy) and is used in a different context).

A consideration is that the current API design is to accumulate calculated statistics into a single DataSet object (issue #103). Therefore any statistics calculated prior to removing partial genotypes are likely to be inaccurate after the removal of partial genotypes.

One approach would be to create a "new" dataset that does not copy additional values from the original.

def filter_partial_calls(ds: DataSet) -> DataSet:
    G = ds.call_genotype
    F = (G < 0).any(axis=-1)
    G1 = xr.where(F, -1, G)

    return sgkit.create_genotype_call_dataset(
        variant_contig_names=ds.contigs,
        variant_contig=ds.variant_contig,
        variant_position=ds.variant_position,
        variant_alleles=ds.variant_allele,
        sample_id=ds.sample_id,
        call_genotype=G1,
        call_genotype_phased=ds.call_genotype_phased,
        variant_id=ds.variant_id,
    )

This could also be broken into 2 more general functions

def partial_calls(ds: Dataset, merge: bool = True) -> Dataset:
    # Returns ds with array of booleans called `call_is_partial` with shape (variant, sample)
    ...

def filter_calls(ds: DataSet, bools: str) -> DataSet:
    # Filters calls based on array `bools` with shape (variant, sample)
    ...
@jeromekelleher
Copy link
Collaborator

Sounds good @timothymillar. I wonder how these specific considerations for filtering out different types of calls feed into the more general problem of grouping things. We started a discussion in this thread about what such an xarray-idiomatic grouping API might look like. Is this related, or a side-issue do you think?

@jeromekelleher
Copy link
Collaborator

Hmm, I guess the distinction here is that we're interested there in grouping by a set of samples that have some fixed properties across all variants, but here we're filtering out particular calls here. Still, it seems to come back to the fundamental question of "do we create a new subset dataset or augment the current one".

@hammer hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020
@timothymillar
Copy link
Collaborator Author

it seems to come back to the fundamental question of "do we create a new subset dataset or augment the current one".

Yes and I can already see that a flaw with my approach above is that there may be additional arrays within a dataset that will not be invalidated and should be carried across (e.g. a pedigree based kinship matrix). So perhaps the filtered array should be merged with a copy of the original dataset and we simply add a warning in the docstring.

timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 6, 2020
timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 19, 2020
timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 19, 2020
timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 20, 2020
mergify bot pushed a commit that referenced this issue Oct 20, 2020
@tomwhite
Copy link
Collaborator

Fixed in #308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.
Projects
None yet
Development

No branches or pull requests

4 participants