-
Notifications
You must be signed in to change notification settings - Fork 35
PCA #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA #227
Conversation
Throwing this back in here to show that we still care. I have some tests demonstrating that dask-ml PCA fails based on the shape of the data. |
There are two problems I see with dask-ml PCA so far:
Both of those are problems we should solve regardless of whether or not we subclass, vendor in, or omit code in sgkit. It's not looking like they'll be hard to fix upstream at the moment. To me, the simplest way to support PCA is to:
I can imagine that we might eventually find some reason to want to do PCA differently than how dask-ml and dask currently support it, but I can't see any reason why those things would only apply to genetics. Let me know if you disagree @jeromekelleher / @jerowe. I guess I wouldn't be too surprised if we one day vendor in a bunch of external code like this, though I'm struggling to find a reason why it's necessary now. |
I'll go with whatever the vote is, but my personal vote is that we should support PCA. The GenotypePCA class 74 lines of code, with 9 of those being a call to super in the init function so we have the same defaults as the GenotypePCA. The rest is just porting over docstrings so it's obvious to users where the differences are. I'd consider PCA to be a baseline in terms of core functionality myself. We can still subclass dask-ml PCA and just change the |
Transposing the input and supporting sign determinacy aren't specific to our domain though. Fixing svd_flip (dask/dask-ml#732) won't take much except a little time to figure out where dask/dask#3576 (comment) is at. I put in dask/dask#6591 yesterday to try to address the deeper svd issue with column chunkings (unrelated to svd_flip). If we would like to wrap this all up for users now, then this interface would be consistent with the rest of the API: from sklearn.base import BaseEstimator as Estimator
def pca(ds: Dataset, n_components: int, est: Optional[Estimator] = None, ...) -> Tuple[Dataset, Estimator]:
# Make sure allele count is present
if est is None:
est = ... # Create and fit scaling / PCA pipeline estimator (sklearn or Dask-ML, depending on chunking)
# Attach results to Dataset
return ds, est Then we could wrap the utilities rather than extending them in a way that involves copy+pasting so much code in order to work around domain-agnostic upstream limitations. EDIT I can also see value in instead starting with a function like |
Closing this in favor of #262 |
I couldn't push to pystatgen/sgkit so I'm reopening this PR here.
Continues - https://github.com/pystatgen/sgkit/pull/123
Linked Issue - https://github.com/pystatgen/sgkit/issues/95