Skip to content

Changes show_models() function to return a dictionary of models in ensemble #1321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 25, 2021

Conversation

sagar-kaushik
Copy link
Contributor

@sagar-kaushik sagar-kaushik commented Nov 25, 2021

Summary

Currently the show_models() function returns an str that has to be manually parsed with no way to access the models. I have changed it so that it returns a dictionary of models in ensemble and their information. This helps fix issue #1298 and the issues mentioned inside that thread.

What's changed

Using show_models() will return a dictionary where the key would be model_id. Each entry is a model dictionary which contains the following:

  • model_id
  • rank
  • ensemble_weight
  • data preprocessor
  • balancing
  • feature_preprocessor
  • regressor or classifier (autosklearn wrapped model)
  • sklearn model

Example

import sklearn.datasets
import sklearn.metrics

import autosklearn.regression
import matplotlib.pyplot as plt

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_regression_example_tmp',
)
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

Output:

{
25: {'model_id': 25.0, 'rank': 1.0, 'ensemble_weight': 0.46, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7ff2c06588d0>, 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7ff2c057bd50>, 'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7ff2c057ba90>, 'sklearn_model': SGDRegressor(alpha=0.0006517033225329654, epsilon=0.012150149892783745,
             eta0=0.016444224834275295, l1_ratio=1.7462342366289323e-09,
             loss='epsilon_insensitive', max_iter=16, penalty='elasticnet',
             power_t=0.21521743568582094, random_state=1,
             tol=0.002431731981071206, warm_start=True)}, 
6: {'model_id': 6.0, 'rank': 2.0, 'ensemble_weight': 0.32, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7ff2c05b3f50>, 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7ff2c065c990>, 'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7ff2c057ba10>, 'sklearn_model': ARDRegression(alpha_1=0.0003701926442639788, alpha_2=2.2118001735899097e-07,
              copy_X=False, lambda_1=1.2037591637980971e-06,
              lambda_2=4.358378124977852e-09,
              threshold_lambda=1136.5286041327277, tol=0.021944240404849075)}, ....

@sagar-kaushik sagar-kaushik changed the title Change show_models() function to return a dictionary of models in ensemble Changes show_models() function to return a dictionary of models in ensemble Nov 25, 2021
@codecov
Copy link

codecov bot commented Nov 25, 2021

Codecov Report

Merging #1321 (827dc81) into development (f22c986) will decrease coverage by 0.01%.
The diff coverage is 98.07%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1321      +/-   ##
===============================================
- Coverage        88.06%   88.05%   -0.02%     
===============================================
  Files              140      140              
  Lines            11163    10993     -170     
===============================================
- Hits              9831     9680     -151     
+ Misses            1332     1313      -19     
Impacted Files Coverage Δ
autosklearn/automl.py 88.05% <97.56%> (+0.95%) ⬆️
autosklearn/estimators.py 93.77% <100.00%> (+0.35%) ⬆️
autosklearn/util/single_thread_client.py 90.90% <0.00%> (-2.28%) ⬇️
...tosklearn/metalearning/metafeatures/metafeature.py 61.00% <0.00%> (-1.14%) ⬇️
autosklearn/evaluation/abstract_evaluator.py 92.27% <0.00%> (-0.78%) ⬇️
autosklearn/pipeline/components/regression/mlp.py 95.31% <0.00%> (-0.53%) ⬇️
autosklearn/metrics/__init__.py 90.97% <0.00%> (-0.31%) ⬇️
autosklearn/pipeline/components/base.py 78.78% <0.00%> (-0.26%) ⬇️
...sklearn/pipeline/components/regression/__init__.py 83.52% <0.00%> (-0.20%) ⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f22c986...827dc81. Read the comment docs.

table.sort_values(by='cost', inplace=True)
table['rank']=range(1,len(table)+1)

for (_, model_id, _), model in self.models_.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.models_ may be empty if a user specifies using cross validation, in which case self.cv_models_ is set. We want to unify them into one thing but at the moment that doesn't happen.

self.cv_models_

For this a test probably needs to be created, I will elaborate on creating a test in a moment.

Copy link
Contributor Author

@sagar-kaushik sagar-kaushik Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not know that. So do you want me to put some condition to check for this and use either self.models_ or self.cv_models_ inside the loop, depending on whichever is non-empty, or should I just create the test for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can either check if the resampling_strategy is cv or more useful is probably just check which one is empty.

model_dict[step[0]]=step[1]

#Adding sklearn model to the model dictionary
model_dict['sklearn_model']=model.steps[-1][1].choice.estimator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of duplicated from including the step above but I agree with your choice here, it lets a user specifically get the auto-sklearn wrapped version or the sklearn estimator which I suspect is of more relevance to people. No change needed, just commenting that I agree with it.

However I might still explicitly split out the tuple

model_type, autosklearn_wrapped_model = model.steps[-1]
model_dict['sklearn_model'] = autosklearn_wrapped_model.choice.estimator

If the automatic checks give you a hard time with model_type not being used, you can do the following

autosklearn_wrapped_model = model.steps[-1][1]

@eddiebergman
Copy link
Contributor

eddiebergman commented Nov 26, 2021

All in all, looks very useful and thanks for contributing! I apologize for the lengthy comments, but since it's your first time contributing, I just want to give as much info as possible, contributing to any other large repo will likely be quite similar and hopefully, if you'd like to contribute any more nice features, you'll be well aware of the steps!

This part of the contribution guide will be helpful for the next steps.

Documentation

The function as it used to be before you fixed it wasn't very good so the documentation wasn't well written. Now that it seems useful, it'll be good to add documentation so that users will actually want to use it. Check the linked section for how to build and view it locally.

Code Style

We follow a code style so that the code looks mostly the same and is easier to read. You can see when you check the check tab at the top, here's a link to the failed test . Check the linked section for how to run it locally to save time.

Testing

As mentioned in one comment, I'm not sure this will work if a user uses cv as their resampling_strategy. This is why we need to have some tests somewhere.

I would recommend mostly copying the scaffold of this test with an added parameterization. The scaffold for it would look something like:

@pytest.mark.parametrize('estimator', [AutoSklearnClassifer])
@pytest.mark.parametrize('resampling_strategy', ['holdout'])
@pytest.mark.parametrize('X', [
    np.asarray([[1, 1, 1]] * 50 + [[2, 2, 2]] * 50)
])
@pytest.mark.parametrize('X', [
    np.asarray([1] * 50 + [2] * 50)
]) 
def test_show_models_with_holdout(
    tmp_dir: str,
    dask_client: dask.distributed.Client,
    estimator: AutoSklearnEstimator,
    resampling_strategy: str,
    X: np.ndarray,
    y; np.ndarray
) -> None:
    """
    Parameters
    ----------
    tmp_dir: str
        The temporary directory to use for this test
        
    dask_client: dask.distributed.Client
         The dask client to use for this test
         
    estimator: AutoSklearnEstimator
         The estimator to train
         
    resampling_strategy: str
         The resampling strategy to use
    
    X: np.ndarray
        The X data to use for this estimator
        
    y: np.ndarray
         The targets to use for this estimator
 
    Expects
    -------
    * Something you expect to be true
    * Something else you expect to be true
    """
    automl = estimator(
        time_left_for_this_task=60, # Run a little longer for CV
        per_run_time_limit=5,
        tmp_folder=tmp_dir,
        resampling_strategy=resampling_strategy,
        dask_client=dask_client,
    )
    automl.fit(X, y)

    output = automl.show_models()
    
    # assert all the keys we expect are in there and that they have values
    # assert when estimator == AutoSklearnRegressor, we have a "regressor" entry and for classifier
    # assert whatever else you think should be checked

@sagar-kaushik
Copy link
Contributor Author

sagar-kaushik commented Nov 26, 2021

Thank you so much @eddiebergman for considering that I am new at this and giving me all the details. I will make all the changes you have mentioned and get back to you asap.

@sagar-kaushik
Copy link
Contributor Author

@eddiebergman I have made all the changes that you had suggested, except for implementing the test. I will do that in a few days. Please check if rest of the things seem in order to you.

@sagar-kaushik
Copy link
Contributor Author

sagar-kaushik commented Nov 29, 2021

Hello @eddiebergman! So I checked what happens if we use 'cv' as the resampling_strategy. It works fine except that the sklearn model shows 'None'. What can be done here? And do you think a test is still necessary for this condition?

Also, I have added a condition to check if table_dict is empty, which means no model was found inside ensemble. I faced this issue when I reduced the 'time_left_for_this_task' to 30 (too low I know). So if the ensemble contains nothing, a ValueError is raised with this message:

ValueError: No model found. Try increasing 'time_left_for_this_task'.

Is that alright?

@eddiebergman
Copy link
Contributor

eddiebergman commented Nov 29, 2021

Hi @userfindingself,

So it seems weird that the sklearn_model shows None when using cv, it definitely exists somewhere as we predict with it but it needs to be found. Once you push your latest changes I can have a look.

Hmmm, there should always be at least one model. In the case that there was not enough time to find a good configuraiton, we use sklearn's Dummy<X> where X is Classifier or Regressor.

For extra background info, we have two ensemble classes Ensemble found in ensemble_selection.py and SingleBest in single_best.py.

Regarding table_dict, I do agree that you should leave the check in and I think an error is what should be raised so that's a nice idea. I would perhaps change it to a RuntimeError as a ValueError is usually based on parameters, the users fault, where as RuntimeError is usually something that is the libraries fault.

Raised when a function gets an argument of correct type but improper value.

Raised when an error is detected that doesn’t fall in any of the other categories. The associated value is a string indicating what precisely went wrong.

You can push your latest changes and I can try it out and point you in the right direction from there! :)

@sagar-kaushik
Copy link
Contributor Author

sagar-kaushik commented Nov 29, 2021

Yeah it's pretty weird because rest of the model steps were there. The 'autosklearn wrapped model' doesn't give any 'sklearn model' with .choice.estimator if cross-validation strategy is used.

Not sure how the dictionary was empty then, maybe you can try to recreate it?

And thanks! Changed it to RuntimeError. I haven't changed anything else, but I have still pushed the latest code.

@eddiebergman
Copy link
Contributor

Thanks, I will take a look tomorrow :)

@sagar-kaushik
Copy link
Contributor Author

Hey @eddiebergman, so what do you think we should about cv resampling strategy case?

@eddiebergman
Copy link
Contributor

Hi @userfindingself,

Sorry for the delay, busy week. I will take a look now :)

@eddiebergman
Copy link
Contributor

eddiebergman commented Dec 3, 2021

So I was playing around with it and we now need to make a decision about how to best return results for a cv trained model.

In short, we can't keep a unified dict because there is just more difference for cv models that need to be returned if we want to be fully transparent.


For some background, when trained with resampling_strategy = "cv", the models returned by cv_models_.values() are VotingClassifier and VotingRegressor, which contain the configurations. These id's and hyperparameters are the exact same as the ones you'll see in models_.items() but those appear to be untrained, where as the cv_models_ contain the trained pipelines.

print(model.automl_.cv_models_)

{(1, 8, 0.0): VotingRegressor(estimators=None),
 (1, 11, 0.0): VotingRegressor(estimators=None),
 (1, 15, 0.0): VotingRegressor(estimators=None),
 (1, 14, 0.0): VotingRegressor(estimators=None),
 (1, 5, 0.0): VotingRegressor(estimators=None)}

You'll notice that the estimators parameter is None. That's because we slightly abuse the sklearn VotingX classes and set their private variable estimators_, with the trailing _. This is because the VotingX classes don't accept pretrained models by default so we manually set them somewhere.

So if we look at one of them in particular:

cv_model_zero = list(model.automl_.cv_models.values())[0]
# VotingRegressor(estimators=None)

Printing out their estimators and viewing their hyperparameters, you can see they're all identical (scroll right for some pleasant text alignment).

cv_model_zero.estimators_
[SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False})]

The problem here is that each pipeline step will actually be trained slightly different as each pipeline here has the same hyperparameters but is trained on seperate folds of the data. Hence, they will produce slightly different output given the same input, which one do we return to the user?

To keep roughly the same interface as for when holdout is used, we'll probably need to complicate things slightly as there is simply just more to include for cv_models_.

I would propose:

[
    <model_id> : {
        "model_id": <model_id>,
        "rank": <rank>,
        ...
        "estimators": [
            {
                "data_preprocessor": <>,
                "feature_preprocessor": <>,
                "classifier"/"regressor": <>,
                "sklearn_model": <>
            }
        ]
    }
]

len(models[id]["estimators"]) == len(cv_model_zero.estimators_)

I would leave the implementation and testing with you, if you like, so you have full control over how it's done :)

Just a breif snippet to guide things:

is_cv = self.resampling_strategy == "cv"
models = self.cv_models_ if is_cv else self.models_
for (_, model_id, _), model in models.items():
    
    ... # Same for setting previous steps
    
    if is_cv:
        ... # populate the "estimators" value
    else:
        ... # as before    

@eddiebergman
Copy link
Contributor

Also a breif thing I noticed, it appears the "model_id" value in the dict is a float, we probably want to convert this to an int.

@sagar-kaushik
Copy link
Contributor Author

Also a breif thing I noticed, it appears the "model_id" value in the dict is a float, we probably want to convert this to an int.

Yeah I tried doing simple type conversion to make it int but that didn't work. I will again look into it.

I will need to delay 'cv' resampling a bit because I am also busy for next few days.

Thanks for explaining how exactly the pipelines are stored. :)

@sagar-kaushik
Copy link
Contributor Author

Hello @eddiebergman! I have changed the data type of "model_id" from float to int.

Can you tell me what you meant here:

Hence, they will produce slightly different output given the same input, which one do we return to the user?

Do you think we should select any model?

Also, in your proposal of dictionary for cv models, the models are inside a dict which is value of the "estimators" key. Should I do the same for 'holdout' strategy for uniformity? Thanks!

@eddiebergman
Copy link
Contributor

eddiebergman commented Dec 12, 2021

Can you tell me what you meant here:

Hence, they will produce slightly different output given the same input, which one do we return to the user?

  • Suppose you have two pipelines with the same hyperparameters and setup. If you train both pipelines with different data, they will produce different output given the same input. Hence, we would need to return both pipelines to be fully transparent. The difference here is that we have one pipeline per cv fold.

Do you think we should select any model?
If by any you mean should we even return a model, definitely. I also think we should return the VotingX model created by cross validation.

I'm not really sure if it makes sense for uniformity purposes, either way, the end user using this dict will have to be aware of the difference as using 'holdout' leads them to the unituitive ["estimators"][0]["sklearn_model"] step in show_models()[id]["estimators"][0]["sklearn_model"] just for checking a holdout model.

To be complete, here is my suggested setup but feel free to suggest your own. In general, I don't think there a way to have uniform access given the choice of "classifier"/"regressor" and "cv"/"holdout". Keeping a contract of uniformity will also make this difficult to update in the future if we were to implement something such as stacking for example. This is an example of where clear documentation of what is to be expected is crucial.

# "holdout", as it was before
[
    <model_id> : {
        "model_id": <>,
        "rank": <>,
        "ensemble_weight": <>,
        "data_preprocessor": <>,
        "feature_preprocessor": <>,
        "classifier"/"regressor": <>,
        "sklearn_model": <>,
    }
]

# "cv", with the added "voting_model" key-val
[
    <model_id> : {
        "model_id": <>,
        "rank": <>,
        "ensemble_weight": <>,
        "voting_model": <VotingX, the model full of each pipeline in "estimators">
        "estimators": [
            {
                "data_preprocessor": <>,
                "feature_preprocessor": <>,
                "classifier"/"regressor": <>,
                "sklearn_model": <>
            }
        ]
    }
]

@sagar-kaushik
Copy link
Contributor Author

The difference here is that we have one pipeline per cv fold.

Yes, so should we return all the pipelines? If we have say 'n' cv folds, do we need to return 'n' pipelines?

If by any you mean should we even return a model, definitely.

I wanted to know which model should be returned because as you have mentioned, there's one for each cv fold. I should have been more specific, sorry.

To be more clear, model.automl_.cv_models_ contains trained pipelines.
Using your example,

cv_model_zero = list(model.automl_.cv_models.values())[0]
cv_model_zero.estimators_

would give many models, each trained on a different cv fold. My question is same as what you had originally written

which one do we return to the user?

I think your suggested implementation is perfectly fine. :)
I will start working on it as soon as I understand which model is to be returned.

@eddiebergman
Copy link
Contributor

Apologies, I'm not entirely sure where the confusion is, "estimators" in the dictionary above returns a list, one entry for each model int the VotingX. If there's still something that's not clear, can you copy the dict format above and indicate which one needs clarification?

@sagar-kaushik
Copy link
Contributor Author

Oh I am extremely sorry, I just didn't notice that a list is being returned. I will code it now, thank you!

@sagar-kaushik
Copy link
Contributor Author

Hello! I am getting the following error when I try to train the AutoSklearnRegressor with X and y arrays you have provided:

ValueError: AutoMLRegressor does not support task binary

What is the issue? And is it okay if I use any other dataset for this task?

@eddiebergman
Copy link
Contributor

Yeah that's fine, use a different y value, essentially if there's only two numeric values, it's autodetected as a binary classification task

@sagar-kaushik
Copy link
Contributor Author

Hey @eddiebergman. I have created the two tests. I have done this for the first time, so please tell me if it looks alright to you.

Copy link
Contributor

@eddiebergman eddiebergman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good, the main change is just to do a set comparison rather than a list

@sagar-kaushik
Copy link
Contributor Author

I have made the changes. Thanks for being helpful! :)

@eddiebergman
Copy link
Contributor

Just waiting on the tests to finish. @mfeurer would you like to review this before merging, I'm happy with it as it is :)

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution. This looks very nice, but I think you also need to update the examples when they describe the use of show_models. I just had a brief look and you need to update:

@sagar-kaushik
Copy link
Contributor Author

Hello @mfeurer and @eddiebergman! Merry Christmas!

Sorry I was a bit busy for the last few days, but I have made changes to the function description in examples. I noticed that print(automl.show_models()) doesn't print very nicely, the formatting is messed up a bit because it is a dict I think. Should I print each entry iteratively instead?

@eddiebergman
Copy link
Contributor

A builtin solution to python for pretty printing things. I would go with this and then I think we're happy with the PR :)

from pprint import pprint
pprint(automl.show_models(), indent=4)

However dictionaries give no guarantee on order of things but that's fine, it's just for demonstration purposes and the get models with weights can be used for getting a high level look.

@sagar-kaushik
Copy link
Contributor Author

Great! I have used pretty printing for all the occurrences of show_models() in examples. Thank you so much for helping me along the way with my first open source contribution, I really learnt a lot. :)

@eddiebergman eddiebergman merged commit 84cabf0 into automl:development Dec 25, 2021
@eddiebergman
Copy link
Contributor

Hi @userfindingself,

All looks good to me, I've merged it with development and it'll be available in the next release 0.15.0. We'll make sure to add you into the release notes as well :)

I expect this release will be around January. If you look at our PR's which could effect performance, we need to make sure none of them change performance too negatively so we're doing some extensive testing with automlbenchmark, hence the delay.

Thanks for your contribution! Please feel free to contribute again if you ever feel like it. I'm happy to help along and it would be a lot smoother now that you know how it works for our setup :)

Happy Holidays ☃️

P.S. If you have time, we would also appreciate a little feedback on how contributing was for you, what was good, what was bad, what resources were we lacking to help you or if anything was frustrating about the process. Any thoughts you think might help us improve both your experience and encourage new-to-open-source contributers. My email is on my profile if you have the time and would like to share :)

github-actions bot pushed a commit that referenced this pull request Dec 25, 2021
@sagar-kaushik
Copy link
Contributor Author

Yes, I would love to contribute more. Currently I am preparing for job interviews and as soon as I am done with those, I will have some leeway to start contributing again. Thank you for all your help!

And yes I will surely connect with you for the feedback. But honestly, I don't think any improvement is needed/possible. :)

Happy Holidays!

@sagar-kaushik sagar-kaushik deleted the my_branch branch December 25, 2021 16:12
@eddiebergman eddiebergman linked an issue Jan 2, 2022 that may be closed by this pull request
@eddiebergman eddiebergman mentioned this pull request Jan 24, 2022
eddiebergman pushed a commit that referenced this pull request Jan 25, 2022
…semble (#1321)

* Changed show_models() function to return a dictionary of models in the ensemble instead of a string
@eddiebergman eddiebergman mentioned this pull request Jan 25, 2022
eddiebergman pushed a commit that referenced this pull request Aug 18, 2022
…semble (#1321)

* Changed show_models() function to return a dictionary of models in the ensemble instead of a string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve user method of seeing pipelines generated
3 participants