Skip to content

Identify the best model #1206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kieran199 opened this issue Aug 5, 2021 · 14 comments
Closed

Identify the best model #1206

kieran199 opened this issue Aug 5, 2021 · 14 comments

Comments

@kieran199
Copy link

Hello there,

I've been through all the examples & it's not entirely clear to me how I identify the best model.

If I print the leaderboard & select the model I am interested in, how do i then find the below for that model? I have model ID in the leaderboard - where do I use it?:

  • Model type
  • Hyper parameters used
  • Any pre-processing steps that auto-sklearn used

Thanks a lot for the help in advance

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 5, 2021

Hi @kieran199,

We're currently working on changing the external API to make it more user friendly and we agree it's not so easy at the moment.

The current solution to get the model is as follows:

import sklearn
from sklearn import datasets
from autosklearn.classification import AutoSklearnClassifier

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

clf = AutoSklearnClassifier(time_left_for_this_task=120, per_run_time_limit=30)
clf.fit(X_train, y_train)

wanted_model_id = ...
wanted_model = None

for (seed, model_id, budget), model in clf.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = model

From there you can query the sklearn.Pipeline further to get the information you need.
There are also some more parameters to leaderboard that give some information on the model type and preprocessing used

The issue at the moment is that the internal keys to identify models consist of (seed, model_id, budget) which is more than an end user should really know about. As in your case and many others, you only really wish to use the model_id.

Rest assured we will bring some nicer public API changes to access the internals of autosklearn but for now this is the best solution I can offer you

@kieran199
Copy link
Author

kieran199 commented Aug 5, 2021

Thanks very much for your reply. I am getting:

AttributeError: 'SimpleClassificationPipeline' object has no attribute 'automl_'

when using clf.automl_.models_.items()

Do you know why this may be?

Also - I note that when I print the model, I get the below. If I decided (for whatever reason) i wanted to change the model selected from gaussian_nb to something else on the leaderboard, how would I do that?

SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'gaussian_nb', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'select_rates_classification', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.009151554238227241, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 1726, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'feature_preprocessor:select_rates_classification:alpha': 0.027868485240680432, 'feature_preprocessor:select_rates_classification:score_func': 'chi2', 'feature_preprocessor:select_rates_classification:mode': 'fpr'},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})

@kieran199
Copy link
Author

Also - one more question :) - normally, I would pickle a model and call it in production when new data arrives.

Will pickling this model, also pickle the preprocessing steps?

@eddiebergman
Copy link
Contributor

Thanks very much for your reply. I am getting:

AttributeError: 'SimpleClassificationPipeline' object has no attribute 'automl_'

when using clf.automl_.models_.items()

Do you know why this may be?

It sounds like you pickled part of the whole object? If so, can you provide code for how you did that? The above example works in an ipython session for me,

Also - one more question :) - normally, I would pickle a model and call it in production when new data arrives.

Will pickling this model, also pickle the preprocessing steps?

The Pipeline object you get at the end consists of all steps that are used in the process, so yes this include the preprocessing steps.

Also - I note that when I print the model, I get the below. If I decided (for whatever reason) i wanted to change the model selected from gaussian_nb to something else on the leaderboard, how would I do that?

SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'gaussian_nb', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'select_rates_classification', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.009151554238227241, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 1726, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'feature_preprocessor:select_rates_classification:alpha': 0.027868485240680432, 'feature_preprocessor:select_rates_classification:score_func': 'chi2', 'feature_preprocessor:select_rates_classification:mode': 'fpr'},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})

You would have to a load a seperate model from the leaderboard. The whole pipeline was built around using Guassian_nb, including the hyperparameters which are not valid for other model types. So no there is no meaningful way to drop and replace a different model type. If you want to fit a specific pipeline, you can make copy the configuration manually in sklearn and train the pipeline that way.

@kieran199
Copy link
Author

Hi there, I haven't pickled anything yet - all I've done so far is the below. Running your script immediately after gives me that error

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
import autosklearn.classification

train_features, test_features, train_labels, test_labels = train_test_split(features, output, test_size = 0.2, random_state = 42)
model = autosklearn.classification.AutoSklearnClassifier()
model.fit(train_features, train_labels)
predictions = model.predict(test_features)

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 5, 2021

I also used the variable model in the snippet, you'll have to change that in either my snippet or yours.

@kieran199
Copy link
Author

Yeah, I had changed that already to x :(

@eddiebergman
Copy link
Contributor

I don't know what to tell you, the snippet above works so I would imagine there is an error in however you copy-and-pasted and renaming variables. This kind of falls outside the scope of the help we can provide but if you post the full code you are using I'm happy to have a look and then close the issue if the answers all your questions.

@kieran199
Copy link
Author

Ah OK I see, I just re-ran it and it worked ( I am not sure why)

I promise this is the last question :) :)

Is there a way to return 1 single model as the output - the most accurate one? Rather than a dictionary of many models?

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 5, 2021

You can use the leaderboard to identify the most accurate model

clf.leaderboard(
    ensemble_only=False, # Include models that were also not included in the final ensemble
    sort_by='cost' # The loss on the validation set
)

You can identify the best model by the rank column.

Again i highly recommend reading the leaderboard api to know what kind of information you can extract.

@kieran199
Copy link
Author

And the number 1 model in the leaderboard is always the one selected?

so if I then pickled the result of the below - it would be rank 1 of leaderboard?

model = autosklearn.classification.AutoSklearnClassifier()
model.fit(train_features, train_labels)

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 5, 2021

Autosklearn selects an ensemble of models, not a single model. This isn't clear from the initial documentation that users come across so I'll take a note to update that!

Every model shown in leaderboard() is in the ensemble where ensemble_weight is how strong that model is in the ensemble.

@kieran199
Copy link
Author

Oh i see - that's interesting. So it will use a combination of all the models it produces, with a different weight assigned to each.

That makes sense - I hadn't picked that up from the documentation. Is there a high level overview which would cover how it works, as the manual didn't help in this regard

@eddiebergman
Copy link
Contributor

Noted, we'll try to make that clearer for code users in the future.

For now the best comprehensive overview are the two papers associated with autosklearn which may be a bit dense

  1. https://papers.nips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf
  2. https://arxiv.org/pdf/2007.04074.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants