Filtering partial genotype calls #223

timothymillar · 2020-09-01T08:59:13Z

It's often necessary to filter out partial genotype calls in order to avoid bias in derived statistics such as allele frequencies.

For example (at a single locus) the set of 3 tetraploid genotype calls 0/0/0/1 0/1/./. 0/0/1/1 would be filtered to 0/0/0/1 ./././. 0/0/1/1 and be treated as 2 observations for the purpose of calculating summary statistics.

In scikit-allel this can be achieved by setting the mask attribute which indicates genotype calls to be excluded from further calculations (It's worth noting that in scikit-allel the mask is of shape (variants, samples) unlike the sgkit mask which is of shape (variants, samples, ploidy) and is used in a different context).

A consideration is that the current API design is to accumulate calculated statistics into a single DataSet object (issue #103). Therefore any statistics calculated prior to removing partial genotypes are likely to be inaccurate after the removal of partial genotypes.

One approach would be to create a "new" dataset that does not copy additional values from the original.

def filter_partial_calls(ds: DataSet) -> DataSet:
    G = ds.call_genotype
    F = (G < 0).any(axis=-1)
    G1 = xr.where(F, -1, G)

    return sgkit.create_genotype_call_dataset(
        variant_contig_names=ds.contigs,
        variant_contig=ds.variant_contig,
        variant_position=ds.variant_position,
        variant_alleles=ds.variant_allele,
        sample_id=ds.sample_id,
        call_genotype=G1,
        call_genotype_phased=ds.call_genotype_phased,
        variant_id=ds.variant_id,
    )

This could also be broken into 2 more general functions

def partial_calls(ds: Dataset, merge: bool = True) -> Dataset:
    # Returns ds with array of booleans called `call_is_partial` with shape (variant, sample)
    ...

def filter_calls(ds: DataSet, bools: str) -> DataSet:
    # Filters calls based on array `bools` with shape (variant, sample)
    ...

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2020-09-01T11:11:42Z

Sounds good @timothymillar. I wonder how these specific considerations for filtering out different types of calls feed into the more general problem of grouping things. We started a discussion in this thread about what such an xarray-idiomatic grouping API might look like. Is this related, or a side-issue do you think?

jeromekelleher · 2020-09-01T11:15:09Z

Hmm, I guess the distinction here is that we're interested there in grouping by a set of samples that have some fixed properties across all variants, but here we're filtering out particular calls here. Still, it seems to come back to the fundamental question of "do we create a new subset dataset or augment the current one".

timothymillar · 2020-09-01T21:04:27Z

it seems to come back to the fundamental question of "do we create a new subset dataset or augment the current one".

Yes and I can already see that a flaw with my approach above is that there may be additional arrays within a dataset that will not be invalidated and should be carried across (e.g. a pedigree based kinship matrix). So perhaps the filtered array should be merged with a copy of the original dataset and we simply add a warning in the docstring.

tomwhite · 2020-11-10T16:20:42Z

Fixed in #308

hammer added the core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. label Sep 1, 2020

timothymillar mentioned this issue Sep 10, 2020

Mixed ploidy genotype_call_dataset #243

Closed

timothymillar mentioned this issue Sep 30, 2020

Add count_call_alternate_alleles function #282

Open

6 tasks

timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 6, 2020

Add filter_partial_calls function sgkit-dev#223

df82b6a

timothymillar mentioned this issue Oct 6, 2020

Filter partial calls #308

Merged

timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 19, 2020

Add filter_partial_calls function sgkit-dev#223

ce92917

timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 19, 2020

Add filter_partial_calls function sgkit-dev#223

72e5d4f

timothymillar added a commit to timothymillar/sgkit that referenced this issue Oct 20, 2020

Add filter_partial_calls function sgkit-dev#223

0b95056

mergify bot pushed a commit that referenced this issue Oct 20, 2020

Add filter_partial_calls function #223

94610c5

tomwhite closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filtering partial genotype calls #223

Filtering partial genotype calls #223

timothymillar commented Sep 1, 2020

jeromekelleher commented Sep 1, 2020

Uh oh!

jeromekelleher commented Sep 1, 2020

Uh oh!

timothymillar commented Sep 1, 2020

Uh oh!

tomwhite commented Nov 10, 2020

Uh oh!

Filtering partial genotype calls #223

Filtering partial genotype calls #223

Comments

timothymillar commented Sep 1, 2020

jeromekelleher commented Sep 1, 2020

Uh oh!

jeromekelleher commented Sep 1, 2020

Uh oh!

timothymillar commented Sep 1, 2020

Uh oh!

tomwhite commented Nov 10, 2020

Uh oh!