-
Notifications
You must be signed in to change notification settings - Fork 35
PCA implementation #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA implementation #262
Conversation
6d198c6
to
865c4e8
Compare
aaf48d1
to
40e9612
Compare
fyi @jeromekelleher I didn't use msprime for anything here because the features mentioned in https://github.com/pystatgen/sgkit/issues/23#issue-653340614 still aren't released afaik. Looks like it's been quite a while since the last one (Dec 2019). Is that likely to change anytime soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @eric-czech, just a minor point on my aversion to hard-coding default values in the method signature. I know \approx nothing about PCA though, so pinging @alimanfoo for a review.
sgkit/stats/pca.py
Outdated
n_components: int = 10, | ||
*, | ||
ploidy: Optional[int] = None, | ||
scaler: Union[BaseEstimator, str] = "patterson", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer having "None" as the default in the signature as it gives a bit more flexibility, both in the context of other parameters and over time as the API evolves. Are you sure that "patterson" will always be the right default value, in every context?
Same for other value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I can't think of a scenario where the default scaler becomes a choice made based on other inputs, but making it more flexible in case one comes up sounds good to me. It definitely makes sense for the algorithm
parameter. I changed both in pystatgen/sgkit@28a9b49.
There'll be a beta in the coming weeks, but hard to know when we'll ship a final release. I think you made the right choice, we can follow up with more detailed tests later. |
This PR has conflicts, @eric-czech please rebase and push updated version 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @eric-czech. No real substantive comments from me. It's good to see lots of very comprehensive tests here, and also that most of the PCA implementation details are upstream.
sgkit/stats/pca.py
Outdated
>>> import xarray as xr | ||
>>> import numpy as np | ||
>>> import sgkit as sg | ||
>>> from sgkit.testing import simulate_genotype_call_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is in the top-level now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1,6 +1,7 @@ | |||
numpy | |||
xarray | |||
dask[array] | |||
dask-ml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pulls in scikit-learn
, Dask distributed and some other dependencies. That's probably OK, but thinking about if there's any way to minimise transitive dependencies here.
@jeromekelleher should I wait for @alimanfoo to review or merge? Should have asked on the call. |
Codecov Report
@@ Coverage Diff @@
## master #262 +/- ##
==========================================
+ Coverage 97.12% 97.37% +0.25%
==========================================
Files 16 18 +2
Lines 1042 1144 +102
==========================================
+ Hits 1012 1114 +102
Misses 30 30
Continue to review full report at Codecov.
|
This PR has conflicts, @eric-czech please rebase and push updated version 🙏 |
No, if we haven't heard from him in a few days don't block (same goes for me, over the next few weeks). Merge away here I'd say. |
sgkit/stats/pca.py
Outdated
"`ploidy` must be specified explicitly when not present in dataset dimensions" | ||
) | ||
ploidy = ds.dims["ploidy"] | ||
if scaler is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could also say: scaler = scaler or "patterson"
, same for algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
sgkit/stats/pca.py
Outdated
) | ||
if algorithm is None: | ||
algorithm = "tsqr" | ||
if algorithm not in ["tsqr", "randomized"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ["tsqr", "randomized"]
should be a set {"tsqr", "randomized"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
return conditional_merge_datasets(ds, new_ds, merge) | ||
|
||
|
||
def pca( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function use sgkit.variables
, which would mean:
- using variable references instead of strings
- validate input
- validate output
- update the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the variables (pystatgen/sgkit@05fa77d#diff-cd77642bae81c13ab776800efe4ba498900a284bc2e4741d943b727f60b8b7e1R250-R284) and updated references (pystatgen/sgkit@05fa77d#diff-e66844198337edbcfa8f27fbcbcc58ef011272669fe9295119b0847c1b7cdf69R134). Squashed one too many commits before rebasing on the current master so I lost a clean delta, but that was basically all I changed.
sgkit/tests/test_preprocessing.py
Outdated
if use_nan and ac.dtype.kind != "f": | ||
return | ||
if use_nan: | ||
# Test that nan and negative sentinel values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is something wrong with this comment (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, fixed
|
||
|
||
def pca_stats(ds: Dataset, est: BaseEstimator, *, merge: bool = True) -> Dataset: | ||
""" Extract attributes from PCA estimator """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when you have a single line docstring, you add a space prefix/suffix, why? Is this documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has conflicts, @eric-czech please rebase and push updated version 🙏 |
Codecov Report
@@ Coverage Diff @@
## master #262 +/- ##
==========================================
+ Coverage 96.35% 96.55% +0.19%
==========================================
Files 26 28 +2
Lines 1866 1972 +106
==========================================
+ Hits 1798 1904 +106
Misses 68 68
Continue to review full report at Codecov.
|
Implementation for https://github.com/pystatgen/sgkit/issues/95.
Notes:
Upstream Fixes
Some upstream fixes necessary for this were:
Discussion
TODO
count_call_alternate_alleles
into issue