Skip to content

model_builder scikit-learn integration #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pdb5627 opened this issue Apr 27, 2023 · 11 comments · Fixed by #161
Closed

model_builder scikit-learn integration #155

pdb5627 opened this issue Apr 27, 2023 · 11 comments · Fixed by #161

Comments

@pdb5627
Copy link
Contributor

pdb5627 commented Apr 27, 2023

I am working on making an existing scikit-learn model pipeline produce probabilistic output. To do that, I used model_builder to make a pymc model that could integrate into a scikit-learn Pipeline, including standardization of inputs and outputs. However, I find that the current API doesn't seem suitable for this. I made my own modifications to the ModelBuilder class and example LinearModel subclass to get it to work. I think the main change was to have the fit and predict methods take X and y as separate parameters rather than as members of a data dict with specially-named keys. My reference for the scikit-learn estimator API is the scikit-learn documentation and template for TemplateEstimator.

I very well might be one the wrong track (or at least on a different one than what model_builder intends), but what I came up with seems to work for being able to apply sklearn.preprocessing.StandardScaler to inputs and to point outputs using sklearn.compose.TransformedTargetRegressor. These seem like reasonable goals for ModelBuilder subclasses to be able to integrate with, so maybe tests and/or examples of such would be good.

Any thoughts? I'm happy to contribute what I can.

@twiecki
Copy link
Member

twiecki commented Apr 27, 2023

Thanks for checking it out @pdb5627, this is highly relevant for us. The issue with accepting X and y like sklearn does is that PyMC models do not necessarily just have a single input and output data, so we opted to something more general at the cost of breaking with sklearn API.

Having a deeper think about this, however, is that there will always be data for a likelihood, the y. Almost always there will be input data X. There might be multiple X however. In that case, maybe the user just grabs the relevant columns from X? They could have different shapes though.

As you can tell, that's a nut we haven't been able to crack. Do you have any ideas?

CC @michaelraczycki

@pdb5627
Copy link
Contributor Author

pdb5627 commented Apr 27, 2023

In general I think scikit-learn can handle multiple input and multiple output just fine, but it does expect the shapes to match. Maybe one way to handle it would be to encapsulate the input data handling separate from the model parameter inference so that if necessary subclasses could customize the way input parameters are passed. So the fit method would be a template method something like this:

def fit(self, X, y, **kwargs):
    # Normalize X and y to an np.array or pd.DataFrame or whatever
    X, y = self._validate_input(X, y, **kwargs)
    # Build the pymc model. Must be implemented by subclasses.
    self.build_model(X, y, **kwargs)
    # Sample the model
    self.idata = self.sample_model(**kwargs)
 
    self.is_fitted_ = True
    return self

The places where the data structure used for X and y will matter would be in build_model and in _data_setter. The subclass will implement those, so if breaking of compatibility with scikit-learn is needed, the subclass can decide to do so. I don't think enabling compatibility with scikit-learn restricts generality except that X and y have to be separate parameters.

@twiecki
Copy link
Member

twiecki commented Apr 27, 2023

@pdb5627 Good points, but then isn't just overriding .fit() with the X, y call structure already possible?

@pdb5627
Copy link
Contributor Author

pdb5627 commented Apr 28, 2023 via email

@twiecki
Copy link
Member

twiecki commented Apr 28, 2023

@pdb5627 I think that's worth a try, having sklearn-compatibility would be a huge plus. Want to do a PR?

@michaelraczycki
Copy link
Collaborator

Thanks for checking it out @pdb5627, this is highly relevant for us. The issue with accepting X and y like sklearn does is that PyMC models do not necessarily just have a single input and output data, so we opted to something more general at the cost of breaking with sklearn API.

Having a deeper think about this, however, is that there will always be data for a likelihood, the y. Almost always there will be input data X. There might be multiple X however. In that case, maybe the user just grabs the relevant columns from X? They could have different shapes though.

As you can tell, that's a nut we haven't been able to crack. Do you have any ideas?

CC @michaelraczycki

I think it's actually something very close to what we're already doing now, in the latest (yet unreleased stage of the model builder) the build_model function does all that is mentioned anyway, but the only difference I see now is that we're not really accepting separated predictors and the observed data (cause I assume that's what the y is supposed to be here). Having that in mind if the data split is all that is needed is seems like an obvious choice, because all classes that we wanted to inherit model builder will need to have their own data preprocessing method, so in a way the projects that will lean more towards scikitlearn will follow the same convention, just in a slightly different way.

@pdb5627 I'm happy to collaborate on this project if you'd like some assistance, I'd love to have a quick call with you to discuss it further

@pdb5627
Copy link
Contributor Author

pdb5627 commented Apr 30, 2023

@michaelraczycki Thanks for the offer of assistance. I put together a draft PR based on what I had in mind. Maybe you could take a look and see what you think, then if you want we can find a time to talk.

@theorashid
Copy link
Contributor

Is it possible to bring back the model builder (or another class) that accepts a general number of variables rather than just X and y?

So having a SklearnEstimator (or BayesianEstimator) that inherits from ModelBuilder. Or whatever you want to call it – basically, bringing it back but with all the edits that have been made since

@twiecki
Copy link
Member

twiecki commented Nov 1, 2023

@theorashid Yes, I think we should have 2 classes. Is that something you could make a PR on?

@theorashid
Copy link
Contributor

I could, but is there someone better placed who has been working on these classes for the past year? There's also an open PR #249 that I don't want to clash with.

If not, I would be happy to pick it up later in the month, with some direction on which parts of the old ModelBuilder/BayesianEstimator classes to bring back.

@twiecki
Copy link
Member

twiecki commented Nov 1, 2023

@theorashid Let's try and get #249 merged to unblock this effort. Would be great to get your help on this. I think ModelBuilder should have the fit() api from before, and BayesianEstimator the sklearn-like API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants