Changes show_models() function to return a dictionary of models in ensemble #1321

sagar-kaushik · 2021-11-25T14:40:45Z

Summary

Currently the show_models() function returns an str that has to be manually parsed with no way to access the models. I have changed it so that it returns a dictionary of models in ensemble and their information. This helps fix issue #1298 and the issues mentioned inside that thread.

What's changed

Using show_models() will return a dictionary where the key would be model_id. Each entry is a model dictionary which contains the following:

model_id
rank
ensemble_weight
data preprocessor
balancing
feature_preprocessor
regressor or classifier (autosklearn wrapped model)
sklearn model

Example

import sklearn.datasets
import sklearn.metrics

import autosklearn.regression
import matplotlib.pyplot as plt

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_regression_example_tmp',
)
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

Output:

{
25: {'model_id': 25.0, 'rank': 1.0, 'ensemble_weight': 0.46, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7ff2c06588d0>, 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7ff2c057bd50>, 'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7ff2c057ba90>, 'sklearn_model': SGDRegressor(alpha=0.0006517033225329654, epsilon=0.012150149892783745,
             eta0=0.016444224834275295, l1_ratio=1.7462342366289323e-09,
             loss='epsilon_insensitive', max_iter=16, penalty='elasticnet',
             power_t=0.21521743568582094, random_state=1,
             tol=0.002431731981071206, warm_start=True)}, 
6: {'model_id': 6.0, 'rank': 2.0, 'ensemble_weight': 0.32, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7ff2c05b3f50>, 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7ff2c065c990>, 'regressor': <autosklearn.pipeline.components.regression.RegressorChoice object at 0x7ff2c057ba10>, 'sklearn_model': ARDRegression(alpha_1=0.0003701926442639788, alpha_2=2.2118001735899097e-07,
              copy_X=False, lambda_1=1.2037591637980971e-06,
              lambda_2=4.358378124977852e-09,
              threshold_lambda=1136.5286041327277, tol=0.021944240404849075)}, ....

…e ensemble instead of a string

codecov · 2021-11-25T15:24:10Z

Codecov Report

Merging #1321 (827dc81) into development (f22c986) will decrease coverage by 0.01%.
The diff coverage is 98.07%.

@@               Coverage Diff               @@
##           development    #1321      +/-   ##
===============================================
- Coverage        88.06%   88.05%   -0.02%     
===============================================
  Files              140      140              
  Lines            11163    10993     -170     
===============================================
- Hits              9831     9680     -151     
+ Misses            1332     1313      -19

Impacted Files	Coverage Δ
autosklearn/automl.py	`88.05% <97.56%> (+0.95%)`	⬆️
autosklearn/estimators.py	`93.77% <100.00%> (+0.35%)`	⬆️
autosklearn/util/single_thread_client.py	`90.90% <0.00%> (-2.28%)`	⬇️
...tosklearn/metalearning/metafeatures/metafeature.py	`61.00% <0.00%> (-1.14%)`	⬇️
autosklearn/evaluation/abstract_evaluator.py	`92.27% <0.00%> (-0.78%)`	⬇️
autosklearn/pipeline/components/regression/mlp.py	`95.31% <0.00%> (-0.53%)`	⬇️
autosklearn/metrics/__init__.py	`90.97% <0.00%> (-0.31%)`	⬇️
autosklearn/pipeline/components/base.py	`78.78% <0.00%> (-0.26%)`	⬇️
...sklearn/pipeline/components/regression/__init__.py	`83.52% <0.00%> (-0.20%)`	⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f22c986...827dc81. Read the comment docs.

autosklearn/automl.py

eddiebergman · 2021-11-26T12:08:06Z

autosklearn/automl.py

+        table.sort_values(by='cost', inplace=True)
+        table['rank']=range(1,len(table)+1)
+
+        for (_, model_id, _), model in self.models_.items():


self.models_ may be empty if a user specifies using cross validation, in which case self.cv_models_ is set. We want to unify them into one thing but at the moment that doesn't happen.

self.cv_models_

For this a test probably needs to be created, I will elaborate on creating a test in a moment.

Did not know that. So do you want me to put some condition to check for this and use either self.models_ or self.cv_models_ inside the loop, depending on whichever is non-empty, or should I just create the test for now?

Yeah, you can either check if the resampling_strategy is cv or more useful is probably just check which one is empty.

autosklearn/automl.py

eddiebergman · 2021-11-26T12:21:09Z

autosklearn/automl.py

+                model_dict[step[0]]=step[1]
+
+            #Adding sklearn model to the model dictionary
+            model_dict['sklearn_model']=model.steps[-1][1].choice.estimator


This is sort of duplicated from including the step above but I agree with your choice here, it lets a user specifically get the auto-sklearn wrapped version or the sklearn estimator which I suspect is of more relevance to people. No change needed, just commenting that I agree with it.

However I might still explicitly split out the tuple

model_type, autosklearn_wrapped_model = model.steps[-1] model_dict['sklearn_model'] = autosklearn_wrapped_model.choice.estimator

If the automatic checks give you a hard time with model_type not being used, you can do the following

autosklearn_wrapped_model = model.steps[-1][1]

autosklearn/automl.py

eddiebergman · 2021-11-26T12:53:26Z

All in all, looks very useful and thanks for contributing! I apologize for the lengthy comments, but since it's your first time contributing, I just want to give as much info as possible, contributing to any other large repo will likely be quite similar and hopefully, if you'd like to contribute any more nice features, you'll be well aware of the steps!

This part of the contribution guide will be helpful for the next steps.

Documentation

The function as it used to be before you fixed it wasn't very good so the documentation wasn't well written. Now that it seems useful, it'll be good to add documentation so that users will actually want to use it. Check the linked section for how to build and view it locally.

Code Style

We follow a code style so that the code looks mostly the same and is easier to read. You can see when you check the check tab at the top, here's a link to the failed test . Check the linked section for how to run it locally to save time.

Testing

As mentioned in one comment, I'm not sure this will work if a user uses cv as their resampling_strategy. This is why we need to have some tests somewhere.

I would recommend mostly copying the scaffold of this test with an added parameterization. The scaffold for it would look something like:

@pytest.mark.parametrize('estimator', [AutoSklearnClassifer])
@pytest.mark.parametrize('resampling_strategy', ['holdout'])
@pytest.mark.parametrize('X', [
    np.asarray([[1, 1, 1]] * 50 + [[2, 2, 2]] * 50)
])
@pytest.mark.parametrize('X', [
    np.asarray([1] * 50 + [2] * 50)
]) 
def test_show_models_with_holdout(
    tmp_dir: str,
    dask_client: dask.distributed.Client,
    estimator: AutoSklearnEstimator,
    resampling_strategy: str,
    X: np.ndarray,
    y; np.ndarray
) -> None:
    """
    Parameters
    ----------
    tmp_dir: str
        The temporary directory to use for this test
        
    dask_client: dask.distributed.Client
         The dask client to use for this test
         
    estimator: AutoSklearnEstimator
         The estimator to train
         
    resampling_strategy: str
         The resampling strategy to use
    
    X: np.ndarray
        The X data to use for this estimator
        
    y: np.ndarray
         The targets to use for this estimator
 
    Expects
    -------
    * Something you expect to be true
    * Something else you expect to be true
    """
    automl = estimator(
        time_left_for_this_task=60, # Run a little longer for CV
        per_run_time_limit=5,
        tmp_folder=tmp_dir,
        resampling_strategy=resampling_strategy,
        dask_client=dask_client,
    )
    automl.fit(X, y)

    output = automl.show_models()
    
    # assert all the keys we expect are in there and that they have values
    # assert when estimator == AutoSklearnRegressor, we have a "regressor" entry and for classifier
    # assert whatever else you think should be checked

sagar-kaushik · 2021-11-26T17:16:12Z

Thank you so much @eddiebergman for considering that I am new at this and giving me all the details. I will make all the changes you have mentioned and get back to you asap.

sagar-kaushik · 2021-11-27T14:31:38Z

@eddiebergman I have made all the changes that you had suggested, except for implementing the test. I will do that in a few days. Please check if rest of the things seem in order to you.

autosklearn/estimators.py

sagar-kaushik · 2021-11-29T12:39:54Z

Hello @eddiebergman! So I checked what happens if we use 'cv' as the resampling_strategy. It works fine except that the sklearn model shows 'None'. What can be done here? And do you think a test is still necessary for this condition?

Also, I have added a condition to check if table_dict is empty, which means no model was found inside ensemble. I faced this issue when I reduced the 'time_left_for_this_task' to 30 (too low I know). So if the ensemble contains nothing, a ValueError is raised with this message:

ValueError: No model found. Try increasing 'time_left_for_this_task'.

Is that alright?

eddiebergman · 2021-11-29T13:15:30Z

Hi @userfindingself,

So it seems weird that the sklearn_model shows None when using cv, it definitely exists somewhere as we predict with it but it needs to be found. Once you push your latest changes I can have a look.

Hmmm, there should always be at least one model. In the case that there was not enough time to find a good configuraiton, we use sklearn's Dummy<X> where X is Classifier or Regressor.

For extra background info, we have two ensemble classes Ensemble found in ensemble_selection.py and SingleBest in single_best.py.

Regarding table_dict, I do agree that you should leave the check in and I think an error is what should be raised so that's a nice idea. I would perhaps change it to a RuntimeError as a ValueError is usually based on parameters, the users fault, where as RuntimeError is usually something that is the libraries fault.

ValueError

Raised when a function gets an argument of correct type but improper value.

RuntimeError

Raised when an error is detected that doesn’t fall in any of the other categories. The associated value is a string indicating what precisely went wrong.

You can push your latest changes and I can try it out and point you in the right direction from there! :)

sagar-kaushik · 2021-11-29T19:23:30Z

Yeah it's pretty weird because rest of the model steps were there. The 'autosklearn wrapped model' doesn't give any 'sklearn model' with .choice.estimator if cross-validation strategy is used.

Not sure how the dictionary was empty then, maybe you can try to recreate it?

And thanks! Changed it to RuntimeError. I haven't changed anything else, but I have still pushed the latest code.

eddiebergman · 2021-11-29T21:40:39Z

Thanks, I will take a look tomorrow :)

sagar-kaushik · 2021-12-02T16:19:54Z

Hey @eddiebergman, so what do you think we should about cv resampling strategy case?

eddiebergman · 2021-12-03T10:02:00Z

Hi @userfindingself,

Sorry for the delay, busy week. I will take a look now :)

eddiebergman · 2021-12-03T11:19:11Z

So I was playing around with it and we now need to make a decision about how to best return results for a cv trained model.

In short, we can't keep a unified dict because there is just more difference for cv models that need to be returned if we want to be fully transparent.

For some background, when trained with resampling_strategy = "cv", the models returned by cv_models_.values() are VotingClassifier and VotingRegressor, which contain the configurations. These id's and hyperparameters are the exact same as the ones you'll see in models_.items() but those appear to be untrained, where as the cv_models_ contain the trained pipelines.

print(model.automl_.cv_models_)

{(1, 8, 0.0): VotingRegressor(estimators=None),
 (1, 11, 0.0): VotingRegressor(estimators=None),
 (1, 15, 0.0): VotingRegressor(estimators=None),
 (1, 14, 0.0): VotingRegressor(estimators=None),
 (1, 5, 0.0): VotingRegressor(estimators=None)}

You'll notice that the estimators parameter is None. That's because we slightly abuse the sklearn VotingX classes and set their private variable estimators_, with the trailing _. This is because the VotingX classes don't accept pretrained models by default so we manually set them somewhere.

So if we look at one of them in particular:

cv_model_zero = list(model.automl_.cv_models.values())[0]
# VotingRegressor(estimators=None)

Printing out their estimators and viewing their hyperparameters, you can see they're all identical (scroll right for some pleasant text alignment).

cv_model_zero.estimators_
[SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False}),
 SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
 dataset_properties={
   'task': 4,
   'sparse': False,
   'multioutput': False,
   'target_type': 'regression',
   'signed': False})]

The problem here is that each pipeline step will actually be trained slightly different as each pipeline here has the same hyperparameters but is trained on seperate folds of the data. Hence, they will produce slightly different output given the same input, which one do we return to the user?

To keep roughly the same interface as for when holdout is used, we'll probably need to complicate things slightly as there is simply just more to include for cv_models_.

I would propose:

[
    <model_id> : {
        "model_id": <model_id>,
        "rank": <rank>,
        ...
        "estimators": [
            {
                "data_preprocessor": <>,
                "feature_preprocessor": <>,
                "classifier"/"regressor": <>,
                "sklearn_model": <>
            }
        ]
    }
]

len(models[id]["estimators"]) == len(cv_model_zero.estimators_)

I would leave the implementation and testing with you, if you like, so you have full control over how it's done :)

Just a breif snippet to guide things:

is_cv = self.resampling_strategy == "cv"
models = self.cv_models_ if is_cv else self.models_
for (_, model_id, _), model in models.items():
    
    ... # Same for setting previous steps
    
    if is_cv:
        ... # populate the "estimators" value
    else:
        ... # as before

eddiebergman · 2021-12-03T11:21:38Z

Also a breif thing I noticed, it appears the "model_id" value in the dict is a float, we probably want to convert this to an int.

sagar-kaushik · 2021-12-04T17:51:16Z

Also a breif thing I noticed, it appears the "model_id" value in the dict is a float, we probably want to convert this to an int.

Yeah I tried doing simple type conversion to make it int but that didn't work. I will again look into it.

I will need to delay 'cv' resampling a bit because I am also busy for next few days.

Thanks for explaining how exactly the pipelines are stored. :)

sagar-kaushik · 2021-12-10T20:18:42Z

Hello @eddiebergman! I have changed the data type of "model_id" from float to int.

Can you tell me what you meant here:

Hence, they will produce slightly different output given the same input, which one do we return to the user?

Do you think we should select any model?

Also, in your proposal of dictionary for cv models, the models are inside a dict which is value of the "estimators" key. Should I do the same for 'holdout' strategy for uniformity? Thanks!

eddiebergman · 2021-12-12T15:12:11Z

Can you tell me what you meant here:

Hence, they will produce slightly different output given the same input, which one do we return to the user?

Suppose you have two pipelines with the same hyperparameters and setup. If you train both pipelines with different data, they will produce different output given the same input. Hence, we would need to return both pipelines to be fully transparent. The difference here is that we have one pipeline per cv fold.

Do you think we should select any model?
If by any you mean should we even return a model, definitely. I also think we should return the VotingX model created by cross validation.

I'm not really sure if it makes sense for uniformity purposes, either way, the end user using this dict will have to be aware of the difference as using 'holdout' leads them to the unituitive ["estimators"][0]["sklearn_model"] step in show_models()[id]["estimators"][0]["sklearn_model"] just for checking a holdout model.

To be complete, here is my suggested setup but feel free to suggest your own. In general, I don't think there a way to have uniform access given the choice of "classifier"/"regressor" and "cv"/"holdout". Keeping a contract of uniformity will also make this difficult to update in the future if we were to implement something such as stacking for example. This is an example of where clear documentation of what is to be expected is crucial.

# "holdout", as it was before
[
    <model_id> : {
        "model_id": <>,
        "rank": <>,
        "ensemble_weight": <>,
        "data_preprocessor": <>,
        "feature_preprocessor": <>,
        "classifier"/"regressor": <>,
        "sklearn_model": <>,
    }
]

# "cv", with the added "voting_model" key-val
[
    <model_id> : {
        "model_id": <>,
        "rank": <>,
        "ensemble_weight": <>,
        "voting_model": <VotingX, the model full of each pipeline in "estimators">
        "estimators": [
            {
                "data_preprocessor": <>,
                "feature_preprocessor": <>,
                "classifier"/"regressor": <>,
                "sklearn_model": <>
            }
        ]
    }
]

sagar-kaushik · 2021-12-12T18:01:12Z

The difference here is that we have one pipeline per cv fold.

Yes, so should we return all the pipelines? If we have say 'n' cv folds, do we need to return 'n' pipelines?

If by any you mean should we even return a model, definitely.

I wanted to know which model should be returned because as you have mentioned, there's one for each cv fold. I should have been more specific, sorry.

To be more clear, model.automl_.cv_models_ contains trained pipelines.
Using your example,

cv_model_zero = list(model.automl_.cv_models.values())[0]
cv_model_zero.estimators_

would give many models, each trained on a different cv fold. My question is same as what you had originally written

which one do we return to the user?

I think your suggested implementation is perfectly fine. :)
I will start working on it as soon as I understand which model is to be returned.

eddiebergman · 2021-12-12T18:11:53Z

Apologies, I'm not entirely sure where the confusion is, "estimators" in the dictionary above returns a list, one entry for each model int the VotingX. If there's still something that's not clear, can you copy the dict format above and indicate which one needs clarification?

sagar-kaushik · 2021-12-13T10:06:27Z

Oh I am extremely sorry, I just didn't notice that a list is being returned. I will code it now, thank you!

sagar-kaushik · 2021-12-18T07:46:09Z

Hello! I am getting the following error when I try to train the AutoSklearnRegressor with X and y arrays you have provided:

ValueError: AutoMLRegressor does not support task binary

What is the issue? And is it okay if I use any other dataset for this task?

eddiebergman · 2021-12-18T10:59:14Z

Yeah that's fine, use a different y value, essentially if there's only two numeric values, it's autodetected as a binary classification task

…mpling strategy

sagar-kaushik · 2021-12-20T09:28:02Z

Hey @eddiebergman. I have created the two tests. I have done this for the first time, so please tell me if it looks alright to you.

eddiebergman

Generally good, the main change is just to do a set comparison rather than a list

test/test_automl/test_estimators.py

sagar-kaushik · 2021-12-20T11:36:20Z

I have made the changes. Thanks for being helpful! :)

eddiebergman · 2021-12-20T11:54:35Z

Just waiting on the tests to finish. @mfeurer would you like to review this before merging, I'm happy with it as it is :)

mfeurer

Thank you very much for the contribution. This looks very nice, but I think you also need to update the examples when they describe the use of show_models. I just had a brief look and you need to update:

https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_get_pipeline_components.html#report-the-models-found-by-auto-sklearn

sagar-kaushik · 2021-12-24T19:59:07Z

Hello @mfeurer and @eddiebergman! Merry Christmas!

Sorry I was a bit busy for the last few days, but I have made changes to the function description in examples. I noticed that print(automl.show_models()) doesn't print very nicely, the formatting is messed up a bit because it is a dict I think. Should I print each entry iteratively instead?

eddiebergman · 2021-12-24T20:14:19Z

A builtin solution to python for pretty printing things. I would go with this and then I think we're happy with the PR :)

from pprint import pprint
pprint(automl.show_models(), indent=4)

However dictionaries give no guarantee on order of things but that's fine, it's just for demonstration purposes and the get models with weights can be used for getting a high level look.

sagar-kaushik · 2021-12-25T08:00:19Z

Great! I have used pretty printing for all the occurrences of show_models() in examples. Thank you so much for helping me along the way with my first open source contribution, I really learnt a lot. :)

eddiebergman · 2021-12-25T12:55:39Z

Hi @userfindingself,

All looks good to me, I've merged it with development and it'll be available in the next release 0.15.0. We'll make sure to add you into the release notes as well :)

I expect this release will be around January. If you look at our PR's which could effect performance, we need to make sure none of them change performance too negatively so we're doing some extensive testing with automlbenchmark, hence the delay.

Thanks for your contribution! Please feel free to contribute again if you ever feel like it. I'm happy to help along and it would be a lot smoother now that you know how it works for our setup :)

Happy Holidays ☃️

P.S. If you have time, we would also appreciate a little feedback on how contributing was for you, what was good, what was bad, what resources were we lacking to help you or if anything was frustrating about the process. Any thoughts you think might help us improve both your experience and encourage new-to-open-source contributers. My email is on my profile if you have the time and would like to share :)

…of models in ensemble (#1321)

sagar-kaushik · 2021-12-25T16:12:17Z

Yes, I would love to contribute more. Currently I am preparing for job interviews and as soon as I am done with those, I will have some leeway to start contributing again. Thank you for all your help!

And yes I will surely connect with you for the feedback. But honestly, I don't think any improvement is needed/possible. :)

Happy Holidays!

…semble (#1321) * Changed show_models() function to return a dictionary of models in the ensemble instead of a string

Changed show_models() function to return a dictionary of models in th…

b4315dc

…e ensemble instead of a string

sagar-kaushik changed the title ~~Change show_models() function to return a dictionary of models in ensemble~~ Changes show_models() function to return a dictionary of models in ensemble Nov 25, 2021

eddiebergman requested changes Nov 26, 2021

View reviewed changes

eddiebergman reviewed Nov 26, 2021

View reviewed changes

autosklearn/automl.py Outdated Show resolved Hide resolved

eddiebergman reviewed Nov 26, 2021

View reviewed changes

autosklearn/automl.py Outdated Show resolved Hide resolved

eddiebergman added the PR: In progress label Nov 26, 2021

Made the suggested changes

385ba40

sagar-kaushik commented Nov 27, 2021

View reviewed changes

autosklearn/estimators.py Show resolved Hide resolved

eddiebergman approved these changes Nov 27, 2021

View reviewed changes

Added condition to raise RuntimeError if no model in ensemble

b5afa72

eddiebergman added PR: Major and removed PR: Major labels Dec 1, 2021

eddiebergman mentioned this pull request Dec 18, 2021

convert to scikit learn code. #388

Open

Created tests for show_models() function with 'holdout' and 'cv' resa…

47cc8b5

…mpling strategy

eddiebergman requested changes Dec 20, 2021

View reviewed changes

Made suggested changes to the show_models() test functions

04a9a78

mfeurer reviewed Dec 21, 2021

View reviewed changes

Changed show_models() function description in examples

389899f

Pretty printing the show_models() dictionary

827dc81

eddiebergman approved these changes Dec 25, 2021

View reviewed changes

eddiebergman merged commit 84cabf0 into automl:development Dec 25, 2021

github-actions bot pushed a commit that referenced this pull request Dec 25, 2021

Sagar Kaushik: Changes show_models() function to return a dictionary …

5208dc2

…of models in ensemble (#1321)

sagar-kaushik deleted the my_branch branch December 25, 2021 16:12

eddiebergman linked an issue Jan 2, 2022 that may be closed by this pull request

Improve user method of seeing pipelines generated #1298

Closed

eddiebergman mentioned this pull request Jan 2, 2022

Improve user method of seeing pipelines generated #1298

Closed

This was referenced Jan 10, 2022

Model Size is big in auto sklearn #1359

Closed

AutoML::fit_ensemble with ensemble_size =0 causes crash #1365

Closed

DariaTkachova mentioned this pull request Jan 21, 2022

Is there a way to retrieve all trained models during the trial, not just the ones in the best model? #1376

Closed

eddiebergman mentioned this pull request Jan 24, 2022

V0.14.4 #1378

Merged

eddiebergman pushed a commit that referenced this pull request Jan 25, 2022

Changes show_models() function to return a dictionary of models in en…

4f73391

…semble (#1321) * Changed show_models() function to return a dictionary of models in the ensemble instead of a string

eddiebergman mentioned this pull request Jan 25, 2022

V0.14.4 #1379

Merged

eddiebergman pushed a commit that referenced this pull request Aug 18, 2022

Changes show_models() function to return a dictionary of models in en…

0a4949d

…semble (#1321) * Changed show_models() function to return a dictionary of models in the ensemble instead of a string

Changes show_models() function to return a dictionary of models in ensemble #1321

Changes show_models() function to return a dictionary of models in ensemble #1321

Uh oh!

Conversation

sagar-kaushik commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's changed

Example

Uh oh!

codecov bot commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eddiebergman Nov 26, 2021

Choose a reason for hiding this comment

Uh oh!

sagar-kaushik Nov 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eddiebergman Nov 27, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eddiebergman Nov 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eddiebergman commented Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation

Code Style

Testing

Uh oh!

sagar-kaushik commented Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sagar-kaushik commented Nov 27, 2021

Uh oh!

Uh oh!

sagar-kaushik commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eddiebergman commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sagar-kaushik commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eddiebergman commented Nov 29, 2021

Uh oh!

sagar-kaushik commented Dec 2, 2021

Uh oh!

eddiebergman commented Dec 3, 2021

Uh oh!

eddiebergman commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eddiebergman commented Dec 3, 2021

Uh oh!

sagar-kaushik commented Dec 4, 2021

Uh oh!

sagar-kaushik commented Dec 10, 2021

Uh oh!

eddiebergman commented Dec 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sagar-kaushik commented Dec 12, 2021

Uh oh!

eddiebergman commented Dec 12, 2021

Uh oh!

sagar-kaushik commented Dec 13, 2021

Uh oh!

sagar-kaushik commented Dec 18, 2021

Uh oh!

sagar-kaushik commented Nov 25, 2021 •

edited

Loading

codecov bot commented Nov 25, 2021 •

edited

Loading

sagar-kaushik Nov 27, 2021 •

edited

Loading

eddiebergman commented Nov 26, 2021 •

edited

Loading

sagar-kaushik commented Nov 26, 2021 •

edited

Loading

sagar-kaushik commented Nov 29, 2021 •

edited

Loading

eddiebergman commented Nov 29, 2021 •

edited

Loading

sagar-kaushik commented Nov 29, 2021 •

edited

Loading

eddiebergman commented Dec 3, 2021 •

edited

Loading

eddiebergman commented Dec 12, 2021 •

edited

Loading