Skip to content

Get Selected Features by Preprocessing Steps #524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
teresaconc opened this issue Aug 7, 2018 · 8 comments
Closed

Get Selected Features by Preprocessing Steps #524

teresaconc opened this issue Aug 7, 2018 · 8 comments

Comments

@teresaconc
Copy link
Contributor

Hi, I'm interested in using the results of the models found in a more independent way. Is there any way to get the selected features when the preprocessing step is a feature selection algorithm?

So far my approach is:

  1. Build a Pipeline with all the preprocessing steps that were chosen in the model
  2. set the default parameters of the configuration space to the ones given by the results
  3. apply a fit_transformer to transform the dataset (before using the estimator).

This is similar to what you did in the function test_weighting_effect (test file test_balancing.py). The thing is when I call the fit_transformer method the data transformed that is returned is a numpy array and not a dataframe (it doesn't have the column headers) so I can't know which were the features kept and removed.

Is there any way I can accomplished this? Perhaps in a easier way than this approach?

@mfeurer
Copy link
Contributor

mfeurer commented Aug 8, 2018

Hi, I'm not sure if I can follow the steps you did. Do you want to find the selected features after configuration? If yes, you can achieve it with this piece of code:

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification


def main():
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = \
        sklearn.model_selection.train_test_split(X, y, random_state=1)

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=15,
        per_run_time_limit=5,
        tmp_folder='/tmp/autosklearn_holdout_example_tmp',
        output_folder='/tmp/autosklearn_holdout_example_out',
        disable_evaluator_output=False,
        # 'holdout' with 'train_size'=0.67 is the default argument setting
        # for AutoSklearnClassifier. It is explicitly specified in this example
        # for demonstrational purpose.
        resampling_strategy='holdout',
        resampling_strategy_arguments={'train_size': 0.67},
        include_preprocessors=['select_percentile_classification'],
        include_estimators=['random_forest']
    )
    automl.fit(X_train, y_train, dataset_name='digits')

    # Iterate all models used in the final ensemble
    for weight, model in automl.get_models_with_weights():
        # Obtain the step of the underlying scikit-learn pipeline
        print(model.steps[-2])
        # Obtain the scores of the current feature selector
        print(model.steps[-2][-1].choice.preprocessor.scores_)
        # Obtain the percentile configured by Auto-sklearn
        print(model.steps[-2][-1].choice.preprocessor.percentile)


if __name__ == '__main__':
    main()

If that's not the case please provide a brief code example of what you would like to where I can fill in the missing pieces.

@teresaconc
Copy link
Contributor Author

Yeah that's pretty much it! The steps attribute was the key! Thanks!
Eventually what I was trying to achieve was the return of the model.steps[-2].get_support() which gives us a boolean array with the chosen variables. Most of the preprocessors have that method built-in.

Thank you!

@aimanakheel
Copy link

Hi ,
@mfeurer, @teresaconc
I have been trying to use the above solution to get Features and Feature importance scores used by the final model.

The above solution gives the Feature importance scores but does not give the features related to those scores.

I ask because I am passing categorical features in automl.fit(df_train, y_train,feat_type=feature_types) and features in df_train are very different from the features used for training.

I would appreciate any guidance to get Features and Feature importance scores used by the final model.

Thanks.

@mfeurer
Copy link
Contributor

mfeurer commented Jan 2, 2019

In scikit-learn this could be done via the get_feature_names method. However, Auto-sklearn does not implement this. I would be very happy about a contribution of this functionality, though.

@aimanakheel
Copy link

aimanakheel commented Jan 2, 2019

I would love to contribute. How can I help?

can you point me in the right direction?

@mfeurer
Copy link
Contributor

mfeurer commented Jan 4, 2019

All relevant pipeline code lives in the pipeline subpackage: https://github.com/automl/auto-sklearn/tree/development/autosklearn/pipeline
base.py contains basic code, classification.py and regression.py contain code specific to the two task types.

I must admit that I don't know how get_feature_names works exactly and scikit-learn's pipeline class itself doesn't implement it. Maybe figuring out how this is supposed to be used in a scikit-learn pipeline would be a good first step?

@kevinsay
Copy link

kevinsay commented Jan 9, 2019

@aimanakheel can you share the code for getting feature importance? thanks!

@aimanakheel
Copy link

@kevinsay
So far I was able to get the following:

My New Input:

pipeline = list(automl.automl.models.values())[0]
print(pipeline)

XgClass = pipeline.final_estimator.choice # .estimator.coef
print ('-----------------------')
print (XgClass)

XGEstmtr = XgClass.estimator
print ('-----------------------')
print (XGEstmtr)

feature_importance = XGEstmtr.coef_
print ('-----------------------')
print (feature_importance)

My Output:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants