-
-
Notifications
You must be signed in to change notification settings - Fork 60
model_builder scikit-learn integration #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for checking it out @pdb5627, this is highly relevant for us. The issue with accepting X and y like sklearn does is that PyMC models do not necessarily just have a single input and output data, so we opted to something more general at the cost of breaking with sklearn API. Having a deeper think about this, however, is that there will always be data for a likelihood, the As you can tell, that's a nut we haven't been able to crack. Do you have any ideas? |
In general I think scikit-learn can handle multiple input and multiple output just fine, but it does expect the shapes to match. Maybe one way to handle it would be to encapsulate the input data handling separate from the model parameter inference so that if necessary subclasses could customize the way input parameters are passed. So the def fit(self, X, y, **kwargs):
# Normalize X and y to an np.array or pd.DataFrame or whatever
X, y = self._validate_input(X, y, **kwargs)
# Build the pymc model. Must be implemented by subclasses.
self.build_model(X, y, **kwargs)
# Sample the model
self.idata = self.sample_model(**kwargs)
self.is_fitted_ = True
return self The places where the data structure used for X and y will matter would be in |
@pdb5627 Good points, but then isn't just overriding |
In its current state, .fit also has the inference calls in it, so overriding it means copying all that code and missing out on potential future changes/improvements to that code. I guess it's debatable whether the "default" should be scikit-learn compatible or not. I think being able to use scikit-learn for scaling, feature encoding or tranformation, etc would be helpful for many models to reduce boilerplate for those things.
One way to avoid having to choose one way or the other would be to have two variants available to subclass from, one that is "scikit-like" but not fully compatible nor dependent on scikit-learn, and another that can easily be integrated into a scikit-learn workflow, possibly with scikit-learn as an optional dependency (e.g. for handling data validation).
What do you see as the advantages of the current interface? How important is it not to break?
…On April 28, 2023 2:37:02 AM GMT+03:00, Thomas Wiecki ***@***.***> wrote:
@pdb5627 Good points, but then isn't just overriding `.fit()` with the X, y call structure already possible?
--
Reply to this email directly or view it on GitHub:
#155 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
@pdb5627 I think that's worth a try, having sklearn-compatibility would be a huge plus. Want to do a PR? |
I think it's actually something very close to what we're already doing now, in the latest (yet unreleased stage of the model builder) the build_model function does all that is mentioned anyway, but the only difference I see now is that we're not really accepting separated predictors and the observed data (cause I assume that's what the y is supposed to be here). Having that in mind if the data split is all that is needed is seems like an obvious choice, because all classes that we wanted to inherit model builder will need to have their own data preprocessing method, so in a way the projects that will lean more towards scikitlearn will follow the same convention, just in a slightly different way. @pdb5627 I'm happy to collaborate on this project if you'd like some assistance, I'd love to have a quick call with you to discuss it further |
@michaelraczycki Thanks for the offer of assistance. I put together a draft PR based on what I had in mind. Maybe you could take a look and see what you think, then if you want we can find a time to talk. |
Is it possible to bring back the model builder (or another class) that accepts a general number of variables rather than just So having a |
@theorashid Yes, I think we should have 2 classes. Is that something you could make a PR on? |
I could, but is there someone better placed who has been working on these classes for the past year? There's also an open PR #249 that I don't want to clash with. If not, I would be happy to pick it up later in the month, with some direction on which parts of the old ModelBuilder/BayesianEstimator classes to bring back. |
@theorashid Let's try and get #249 merged to unblock this effort. Would be great to get your help on this. I think |
I am working on making an existing scikit-learn model pipeline produce probabilistic output. To do that, I used model_builder to make a pymc model that could integrate into a scikit-learn
Pipeline
, including standardization of inputs and outputs. However, I find that the current API doesn't seem suitable for this. I made my own modifications to theModelBuilder
class and exampleLinearModel
subclass to get it to work. I think the main change was to have thefit
andpredict
methods take X and y as separate parameters rather than as members of a data dict with specially-named keys. My reference for the scikit-learn estimator API is the scikit-learn documentation and template for TemplateEstimator.I very well might be one the wrong track (or at least on a different one than what model_builder intends), but what I came up with seems to work for being able to apply
sklearn.preprocessing.StandardScaler
to inputs and to point outputs usingsklearn.compose.TransformedTargetRegressor
. These seem like reasonable goals forModelBuilder
subclasses to be able to integrate with, so maybe tests and/or examples of such would be good.Any thoughts? I'm happy to contribute what I can.
The text was updated successfully, but these errors were encountered: