-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Changes show_models() function to return a dictionary of models in ensemble #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e ensemble instead of a string
Codecov Report
@@ Coverage Diff @@
## development #1321 +/- ##
===============================================
- Coverage 88.06% 88.05% -0.02%
===============================================
Files 140 140
Lines 11163 10993 -170
===============================================
- Hits 9831 9680 -151
+ Misses 1332 1313 -19
Continue to review full report at Codecov.
|
autosklearn/automl.py
Outdated
table.sort_values(by='cost', inplace=True) | ||
table['rank']=range(1,len(table)+1) | ||
|
||
for (_, model_id, _), model in self.models_.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.models_
may be empty if a user specifies using cross validation, in which case self.cv_models_
is set. We want to unify them into one thing but at the moment that doesn't happen.
self.cv_models_
For this a test probably needs to be created, I will elaborate on creating a test in a moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not know that. So do you want me to put some condition to check for this and use either self.models_
or self.cv_models_
inside the loop, depending on whichever is non-empty, or should I just create the test for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you can either check if the resampling_strategy
is cv
or more useful is probably just check which one is empty.
autosklearn/automl.py
Outdated
model_dict[step[0]]=step[1] | ||
|
||
#Adding sklearn model to the model dictionary | ||
model_dict['sklearn_model']=model.steps[-1][1].choice.estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sort of duplicated from including the step above but I agree with your choice here, it lets a user specifically get the auto-sklearn wrapped version or the sklearn estimator which I suspect is of more relevance to people. No change needed, just commenting that I agree with it.
However I might still explicitly split out the tuple
model_type, autosklearn_wrapped_model = model.steps[-1]
model_dict['sklearn_model'] = autosklearn_wrapped_model.choice.estimator
If the automatic checks give you a hard time with model_type
not being used, you can do the following
autosklearn_wrapped_model = model.steps[-1][1]
All in all, looks very useful and thanks for contributing! I apologize for the lengthy comments, but since it's your first time contributing, I just want to give as much info as possible, contributing to any other large repo will likely be quite similar and hopefully, if you'd like to contribute any more nice features, you'll be well aware of the steps! This part of the contribution guide will be helpful for the next steps. DocumentationThe function as it used to be before you fixed it wasn't very good so the documentation wasn't well written. Now that it seems useful, it'll be good to add documentation so that users will actually want to use it. Check the linked section for how to build and view it locally. Code StyleWe follow a code style so that the code looks mostly the same and is easier to read. You can see when you check the TestingAs mentioned in one comment, I'm not sure this will work if a user uses I would recommend mostly copying the scaffold of this test with an added parameterization. The scaffold for it would look something like: @pytest.mark.parametrize('estimator', [AutoSklearnClassifer])
@pytest.mark.parametrize('resampling_strategy', ['holdout'])
@pytest.mark.parametrize('X', [
np.asarray([[1, 1, 1]] * 50 + [[2, 2, 2]] * 50)
])
@pytest.mark.parametrize('X', [
np.asarray([1] * 50 + [2] * 50)
])
def test_show_models_with_holdout(
tmp_dir: str,
dask_client: dask.distributed.Client,
estimator: AutoSklearnEstimator,
resampling_strategy: str,
X: np.ndarray,
y; np.ndarray
) -> None:
"""
Parameters
----------
tmp_dir: str
The temporary directory to use for this test
dask_client: dask.distributed.Client
The dask client to use for this test
estimator: AutoSklearnEstimator
The estimator to train
resampling_strategy: str
The resampling strategy to use
X: np.ndarray
The X data to use for this estimator
y: np.ndarray
The targets to use for this estimator
Expects
-------
* Something you expect to be true
* Something else you expect to be true
"""
automl = estimator(
time_left_for_this_task=60, # Run a little longer for CV
per_run_time_limit=5,
tmp_folder=tmp_dir,
resampling_strategy=resampling_strategy,
dask_client=dask_client,
)
automl.fit(X, y)
output = automl.show_models()
# assert all the keys we expect are in there and that they have values
# assert when estimator == AutoSklearnRegressor, we have a "regressor" entry and for classifier
# assert whatever else you think should be checked |
Thank you so much @eddiebergman for considering that I am new at this and giving me all the details. I will make all the changes you have mentioned and get back to you asap. |
@eddiebergman I have made all the changes that you had suggested, except for implementing the test. I will do that in a few days. Please check if rest of the things seem in order to you. |
Hello @eddiebergman! So I checked what happens if we use Also, I have added a condition to check if
Is that alright? |
Hi @userfindingself, So it seems weird that the Hmmm, there should always be at least one model. In the case that there was not enough time to find a good configuraiton, we use For extra background info, we have two ensemble classes Regarding
You can push your latest changes and I can try it out and point you in the right direction from there! :) |
Yeah it's pretty weird because rest of the Not sure how the dictionary was empty then, maybe you can try to recreate it? And thanks! Changed it to |
Thanks, I will take a look tomorrow :) |
Hey @eddiebergman, so what do you think we should about |
Hi @userfindingself, Sorry for the delay, busy week. I will take a look now :) |
So I was playing around with it and we now need to make a decision about how to best return results for a cv trained model. In short, we can't keep a unified dict because there is just more difference for For some background, when trained with print(model.automl_.cv_models_)
{(1, 8, 0.0): VotingRegressor(estimators=None),
(1, 11, 0.0): VotingRegressor(estimators=None),
(1, 15, 0.0): VotingRegressor(estimators=None),
(1, 14, 0.0): VotingRegressor(estimators=None),
(1, 5, 0.0): VotingRegressor(estimators=None)} You'll notice that the So if we look at one of them in particular: cv_model_zero = list(model.automl_.cv_models.values())[0]
# VotingRegressor(estimators=None) Printing out their estimators and viewing their hyperparameters, you can see they're all identical (scroll right for some pleasant text alignment). cv_model_zero.estimators_
[SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
dataset_properties={
'task': 4,
'sparse': False,
'multioutput': False,
'target_type': 'regression',
'signed': False}),
SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
dataset_properties={
'task': 4,
'sparse': False,
'multioutput': False,
'target_type': 'regression',
'signed': False}),
SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
dataset_properties={
'task': 4,
'sparse': False,
'multioutput': False,
'target_type': 'regression',
'signed': False}),
SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
dataset_properties={
'task': 4,
'sparse': False,
'multioutput': False,
'target_type': 'regression',
'signed': False}),
SimpleRegressionPipeline({'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'regressor:__choice__': 'gaussian_process', 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:feature_agglomeration:affinity': 'euclidean', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 107, 'feature_preprocessor:feature_agglomeration:pooling_func': 'median', 'regressor:gaussian_process:alpha': 0.42928092501196696, 'regressor:gaussian_process:thetaL': 1.4435895787652725e-07, 'regressor:gaussian_process:thetaU': 8.108685026706572, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 268, 'data_preprocessor:feature_type:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'uniform'},
dataset_properties={
'task': 4,
'sparse': False,
'multioutput': False,
'target_type': 'regression',
'signed': False})] The problem here is that each pipeline step will actually be trained slightly different as each pipeline here has the same hyperparameters but is trained on seperate folds of the data. Hence, they will produce slightly different output given the same input, which one do we return to the user? To keep roughly the same interface as for when holdout is used, we'll probably need to complicate things slightly as there is simply just more to include for I would propose: [
<model_id> : {
"model_id": <model_id>,
"rank": <rank>,
...
"estimators": [
{
"data_preprocessor": <>,
"feature_preprocessor": <>,
"classifier"/"regressor": <>,
"sklearn_model": <>
}
]
}
]
len(models[id]["estimators"]) == len(cv_model_zero.estimators_) I would leave the implementation and testing with you, if you like, so you have full control over how it's done :) Just a breif snippet to guide things: is_cv = self.resampling_strategy == "cv"
models = self.cv_models_ if is_cv else self.models_
for (_, model_id, _), model in models.items():
... # Same for setting previous steps
if is_cv:
... # populate the "estimators" value
else:
... # as before |
Also a breif thing I noticed, it appears the |
Yeah I tried doing simple type conversion to make it I will need to delay Thanks for explaining how exactly the pipelines are stored. :) |
Hello @eddiebergman! I have changed the data type of Can you tell me what you meant here:
Do you think we should select any model? Also, in your proposal of dictionary for cv models, the models are inside a |
I'm not really sure if it makes sense for uniformity purposes, either way, the end user using this dict will have to be aware of the difference as using To be complete, here is my suggested setup but feel free to suggest your own. In general, I don't think there a way to have uniform access given the choice of # "holdout", as it was before
[
<model_id> : {
"model_id": <>,
"rank": <>,
"ensemble_weight": <>,
"data_preprocessor": <>,
"feature_preprocessor": <>,
"classifier"/"regressor": <>,
"sklearn_model": <>,
}
]
# "cv", with the added "voting_model" key-val
[
<model_id> : {
"model_id": <>,
"rank": <>,
"ensemble_weight": <>,
"voting_model": <VotingX, the model full of each pipeline in "estimators">
"estimators": [
{
"data_preprocessor": <>,
"feature_preprocessor": <>,
"classifier"/"regressor": <>,
"sklearn_model": <>
}
]
}
] |
Yes, so should we return all the pipelines? If we have say
I wanted to know which model should be returned because as you have mentioned, there's one for each cv fold. I should have been more specific, sorry. To be more clear, cv_model_zero = list(model.automl_.cv_models.values())[0]
cv_model_zero.estimators_ would give many models, each trained on a different cv fold. My question is same as what you had originally written
I think your suggested implementation is perfectly fine. :) |
Apologies, I'm not entirely sure where the confusion is, |
Oh I am extremely sorry, I just didn't notice that a list is being returned. I will code it now, thank you! |
Hello! I am getting the following error when I try to train the AutoSklearnRegressor with ValueError: AutoMLRegressor does not support task binary What is the issue? And is it okay if I use any other dataset for this task? |
Yeah that's fine, use a different y value, essentially if there's only two numeric values, it's autodetected as a binary classification task |
Hey @eddiebergman. I have created the two tests. I have done this for the first time, so please tell me if it looks alright to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally good, the main change is just to do a set
comparison rather than a list
I have made the changes. Thanks for being helpful! :) |
Just waiting on the tests to finish. @mfeurer would you like to review this before merging, I'm happy with it as it is :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution. This looks very nice, but I think you also need to update the examples when they describe the use of show_models
. I just had a brief look and you need to update:
Hello @mfeurer and @eddiebergman! Merry Christmas! Sorry I was a bit busy for the last few days, but I have made changes to the function description in examples. I noticed that |
A builtin solution to python for pretty printing things. I would go with this and then I think we're happy with the PR :) from pprint import pprint
pprint(automl.show_models(), indent=4) However dictionaries give no guarantee on order of things but that's fine, it's just for demonstration purposes and the get models with weights can be used for getting a high level look. |
Great! I have used pretty printing for all the occurrences of |
Hi @userfindingself, All looks good to me, I've merged it with development and it'll be available in the next release I expect this release will be around January. If you look at our PR's which could effect performance, we need to make sure none of them change performance too negatively so we're doing some extensive testing with automlbenchmark, hence the delay. Thanks for your contribution! Please feel free to contribute again if you ever feel like it. I'm happy to help along and it would be a lot smoother now that you know how it works for our setup :) Happy Holidays ☃️ P.S. If you have time, we would also appreciate a little feedback on how contributing was for you, what was good, what was bad, what resources were we lacking to help you or if anything was frustrating about the process. Any thoughts you think might help us improve both your experience and encourage new-to-open-source contributers. My email is on my profile if you have the time and would like to share :) |
…of models in ensemble (#1321)
Yes, I would love to contribute more. Currently I am preparing for job interviews and as soon as I am done with those, I will have some leeway to start contributing again. Thank you for all your help! And yes I will surely connect with you for the feedback. But honestly, I don't think any improvement is needed/possible. :) Happy Holidays! |
…semble (#1321) * Changed show_models() function to return a dictionary of models in the ensemble instead of a string
…semble (#1321) * Changed show_models() function to return a dictionary of models in the ensemble instead of a string
Summary
Currently the
show_models()
function returns anstr
that has to be manually parsed with no way to access the models. I have changed it so that it returns a dictionary of models in ensemble and their information. This helps fix issue #1298 and the issues mentioned inside that thread.What's changed
Using
show_models()
will return a dictionary where the key would be model_id. Each entry is a model dictionary which contains the following:Example
Output: