-
Notifications
You must be signed in to change notification settings - Fork 35
Filtering partial genotype calls #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sounds good @timothymillar. I wonder how these specific considerations for filtering out different types of calls feed into the more general problem of grouping things. We started a discussion in this thread about what such an xarray-idiomatic grouping API might look like. Is this related, or a side-issue do you think? |
Hmm, I guess the distinction here is that we're interested there in grouping by a set of samples that have some fixed properties across all variants, but here we're filtering out particular calls here. Still, it seems to come back to the fundamental question of "do we create a new subset dataset or augment the current one". |
Yes and I can already see that a flaw with my approach above is that there may be additional arrays within a dataset that will not be invalidated and should be carried across (e.g. a pedigree based kinship matrix). So perhaps the filtered array should be merged with a copy of the original dataset and we simply add a warning in the docstring. |
Fixed in #308 |
It's often necessary to filter out partial genotype calls in order to avoid bias in derived statistics such as allele frequencies.
For example (at a single locus) the set of 3 tetraploid genotype calls
0/0/0/1 0/1/./. 0/0/1/1
would be filtered to0/0/0/1 ./././. 0/0/1/1
and be treated as 2 observations for the purpose of calculating summary statistics.In scikit-allel this can be achieved by setting the
mask
attribute which indicates genotype calls to be excluded from further calculations (It's worth noting that in scikit-allel themask
is of shape(variants, samples)
unlike the sgkitmask
which is of shape(variants, samples, ploidy)
and is used in a different context).A consideration is that the current API design is to accumulate calculated statistics into a single DataSet object (issue #103). Therefore any statistics calculated prior to removing partial genotypes are likely to be inaccurate after the removal of partial genotypes.
One approach would be to create a "new" dataset that does not copy additional values from the original.
This could also be broken into 2 more general functions
The text was updated successfully, but these errors were encountered: