Skip to content

Consider standardizing the API of pca and pc_relate #1077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timothymillar opened this issue Apr 27, 2023 · 4 comments
Open

Consider standardizing the API of pca and pc_relate #1077

timothymillar opened this issue Apr 27, 2023 · 4 comments
Labels
question Further information is requested

Comments

@timothymillar
Copy link
Collaborator

These were implemented early on, before the schema/variables where standardized. I'm not sure if there's any appetite to update them, but here are a few observations:

  • Both use custom methods to calculate an equivalent to call_dosage.

    • pca creates an undocumented variable called "call_alternate_allele_count" from call_allele_count (line)
    • pc_relate internally creates alternate allele counts directly from the call_genotype (line)
    • They would likely be more flexible if defined in terms of the call_dosage variable
    • call_dosage could be automatically computed from call_genotype for backwards compatibility, see Add count_call_alternate_alleles function #282
  • pc_relate does its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?

I was looking at the implementation after considering having "pc-relate" as an estimator option in genomic_relationship. AFAICT this should work so long as a sample_pc parameter was added (similar to ancestral_frequency for the VanRaden estimator). This would return relationships (as opposed to pc_relate_phi which is an estimate of kinship).

@timothymillar timothymillar added the question Further information is requested label Apr 27, 2023
@hammer
Copy link
Contributor

hammer commented Apr 27, 2023

cc @ravwojdyla and @eric-czech in case they have thoughts

@ravwojdyla
Copy link
Collaborator

pc_relate does its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?

@timothymillar the sgkit pc_relate was based of off the Hail implementation (https://github.com/pystatgen/sgkit/issues/24) also compared to GENESIS, which both handle MAF filter. +1 to standardizing, please let me know if I can help in any way.

@eric-czech
Copy link
Collaborator

pca creates an undocumented variable called "call_alternate_allele_count" from call_allele_count

+1 to standardizing as much as possible as well, but it will be important to keep in mind that the default scaler for pca only works with true alt allele counts. The relevant docs for it: https://github.com/pystatgen/sgkit/blob/0bccf6a660d04c95c1ac73a2408cc0deb11f27a9/sgkit/stats/preprocessing.py#L15-L23

IIRC this was to replicate scikit-allel PCA and get close to parity with hwe_normalized_pca by default. If we wanted to make this more generic so that it could run on dosages, then I would suggest this:

  1. We leave pca as it is
  2. We add two overload functions like decompose_call_alternate_alleles and decompose_dosages (naming consistent with sgkit.stats.aggregation as well as familiar for scikit-learn users) that take away the choice for variable and scaler, and call pca with appropriate settings for them

please let me know if I can help in any way.

Ditto! Let me know if any of that isn't clear or you see a better way.

@timothymillar
Copy link
Collaborator Author

Thanks @eric-czech, decompose_dosages is a good name. Just a thought, the call_dosage variable can optionally be an integer type. So, we could potentially still use it with the PattersonScaler and raise an error if it contains floats. Integer alternate allele counts for biallelic variants can already be computed using convert_call_to_index.

So, we could have a general API where the default variable is call_dosage. If absent, call_dosage is automatically computed as an integer array using convert_call_to_index. An error is raised If the patterson scalar is used and call_dosage is already an array of floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants