model_builder scikit-learn integration #155

pdb5627 · 2023-04-27T14:24:33Z

I am working on making an existing scikit-learn model pipeline produce probabilistic output. To do that, I used model_builder to make a pymc model that could integrate into a scikit-learn Pipeline, including standardization of inputs and outputs. However, I find that the current API doesn't seem suitable for this. I made my own modifications to the ModelBuilder class and example LinearModel subclass to get it to work. I think the main change was to have the fit and predict methods take X and y as separate parameters rather than as members of a data dict with specially-named keys. My reference for the scikit-learn estimator API is the scikit-learn documentation and template for TemplateEstimator.

I very well might be one the wrong track (or at least on a different one than what model_builder intends), but what I came up with seems to work for being able to apply sklearn.preprocessing.StandardScaler to inputs and to point outputs using sklearn.compose.TransformedTargetRegressor. These seem like reasonable goals for ModelBuilder subclasses to be able to integrate with, so maybe tests and/or examples of such would be good.

Any thoughts? I'm happy to contribute what I can.

The text was updated successfully, but these errors were encountered:

twiecki · 2023-04-27T14:33:37Z

Thanks for checking it out @pdb5627, this is highly relevant for us. The issue with accepting X and y like sklearn does is that PyMC models do not necessarily just have a single input and output data, so we opted to something more general at the cost of breaking with sklearn API.

Having a deeper think about this, however, is that there will always be data for a likelihood, the y. Almost always there will be input data X. There might be multiple X however. In that case, maybe the user just grabs the relevant columns from X? They could have different shapes though.

As you can tell, that's a nut we haven't been able to crack. Do you have any ideas?

CC @michaelraczycki

pdb5627 · 2023-04-27T19:48:34Z

In general I think scikit-learn can handle multiple input and multiple output just fine, but it does expect the shapes to match. Maybe one way to handle it would be to encapsulate the input data handling separate from the model parameter inference so that if necessary subclasses could customize the way input parameters are passed. So the fit method would be a template method something like this:

def fit(self, X, y, **kwargs):
    # Normalize X and y to an np.array or pd.DataFrame or whatever
    X, y = self._validate_input(X, y, **kwargs)
    # Build the pymc model. Must be implemented by subclasses.
    self.build_model(X, y, **kwargs)
    # Sample the model
    self.idata = self.sample_model(**kwargs)
 
    self.is_fitted_ = True
    return self

The places where the data structure used for X and y will matter would be in build_model and in _data_setter. The subclass will implement those, so if breaking of compatibility with scikit-learn is needed, the subclass can decide to do so. I don't think enabling compatibility with scikit-learn restricts generality except that X and y have to be separate parameters.

twiecki · 2023-04-27T23:36:52Z

@pdb5627 Good points, but then isn't just overriding .fit() with the X, y call structure already possible?

pdb5627 · 2023-04-28T05:39:31Z

In its current state, .fit also has the inference calls in it, so overriding it means copying all that code and missing out on potential future changes/improvements to that code. I guess it's debatable whether the "default" should be scikit-learn compatible or not. I think being able to use scikit-learn for scaling, feature encoding or tranformation, etc would be helpful for many models to reduce boilerplate for those things. One way to avoid having to choose one way or the other would be to have two variants available to subclass from, one that is "scikit-like" but not fully compatible nor dependent on scikit-learn, and another that can easily be integrated into a scikit-learn workflow, possibly with scikit-learn as an optional dependency (e.g. for handling data validation). What do you see as the advantages of the current interface? How important is it not to break?

…

On April 28, 2023 2:37:02 AM GMT+03:00, Thomas Wiecki ***@***.***> wrote: @pdb5627 Good points, but then isn't just overriding `.fit()` with the X, y call structure already possible? -- Reply to this email directly or view it on GitHub: #155 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

twiecki · 2023-04-28T10:22:57Z

@pdb5627 I think that's worth a try, having sklearn-compatibility would be a huge plus. Want to do a PR?

michaelraczycki · 2023-04-28T12:10:26Z

Thanks for checking it out @pdb5627, this is highly relevant for us. The issue with accepting X and y like sklearn does is that PyMC models do not necessarily just have a single input and output data, so we opted to something more general at the cost of breaking with sklearn API.

Having a deeper think about this, however, is that there will always be data for a likelihood, the y. Almost always there will be input data X. There might be multiple X however. In that case, maybe the user just grabs the relevant columns from X? They could have different shapes though.

As you can tell, that's a nut we haven't been able to crack. Do you have any ideas?

CC @michaelraczycki

I think it's actually something very close to what we're already doing now, in the latest (yet unreleased stage of the model builder) the build_model function does all that is mentioned anyway, but the only difference I see now is that we're not really accepting separated predictors and the observed data (cause I assume that's what the y is supposed to be here). Having that in mind if the data split is all that is needed is seems like an obvious choice, because all classes that we wanted to inherit model builder will need to have their own data preprocessing method, so in a way the projects that will lean more towards scikitlearn will follow the same convention, just in a slightly different way.

@pdb5627 I'm happy to collaborate on this project if you'd like some assistance, I'd love to have a quick call with you to discuss it further

pdb5627 · 2023-04-30T12:47:59Z

@michaelraczycki Thanks for the offer of assistance. I put together a draft PR based on what I had in mind. Maybe you could take a look and see what you think, then if you want we can find a time to talk.

theorashid · 2023-11-01T09:12:39Z

Is it possible to bring back the model builder (or another class) that accepts a general number of variables rather than just X and y?

So having a SklearnEstimator (or BayesianEstimator) that inherits from ModelBuilder. Or whatever you want to call it – basically, bringing it back but with all the edits that have been made since

twiecki · 2023-11-01T09:51:51Z

@theorashid Yes, I think we should have 2 classes. Is that something you could make a PR on?

theorashid · 2023-11-01T10:04:35Z

I could, but is there someone better placed who has been working on these classes for the past year? There's also an open PR #249 that I don't want to clash with.

If not, I would be happy to pick it up later in the month, with some direction on which parts of the old ModelBuilder/BayesianEstimator classes to bring back.

twiecki · 2023-11-01T10:10:15Z

@theorashid Let's try and get #249 merged to unblock this effort. Would be great to get your help on this. I think ModelBuilder should have the fit() api from before, and BayesianEstimator the sklearn-like API.

pdb5627 mentioned this issue Apr 30, 2023

Initial draft of version of model_builder to work with scikit-learn. #161

Merged

michaelraczycki closed this as completed in #161 May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

model_builder scikit-learn integration #155

model_builder scikit-learn integration #155

pdb5627 commented Apr 27, 2023

twiecki commented Apr 27, 2023

Uh oh!

pdb5627 commented Apr 27, 2023

Uh oh!

twiecki commented Apr 27, 2023

Uh oh!

pdb5627 commented Apr 28, 2023 via email

Uh oh!

twiecki commented Apr 28, 2023

Uh oh!

michaelraczycki commented Apr 28, 2023

Uh oh!

pdb5627 commented Apr 30, 2023

Uh oh!

theorashid commented Nov 1, 2023

Uh oh!

twiecki commented Nov 1, 2023

Uh oh!

theorashid commented Nov 1, 2023

Uh oh!

twiecki commented Nov 1, 2023

Uh oh!

model_builder scikit-learn integration #155

model_builder scikit-learn integration #155

Comments

pdb5627 commented Apr 27, 2023

twiecki commented Apr 27, 2023

Uh oh!

pdb5627 commented Apr 27, 2023

Uh oh!

twiecki commented Apr 27, 2023

Uh oh!

pdb5627 commented Apr 28, 2023 via email

Uh oh!

twiecki commented Apr 28, 2023

Uh oh!

michaelraczycki commented Apr 28, 2023

Uh oh!

pdb5627 commented Apr 30, 2023

Uh oh!

theorashid commented Nov 1, 2023

Uh oh!

twiecki commented Nov 1, 2023

Uh oh!

theorashid commented Nov 1, 2023

Uh oh!

twiecki commented Nov 1, 2023

Uh oh!