Skip to content

Obtain feature list after ensemble classification #719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rcuocolo opened this issue Aug 30, 2019 · 5 comments
Closed

Obtain feature list after ensemble classification #719

rcuocolo opened this issue Aug 30, 2019 · 5 comments

Comments

@rcuocolo
Copy link

rcuocolo commented Aug 30, 2019

I ran auto-sklearn and obtained a 3 model ensemble for classifying my data. I would like to know which features were selected for the classification for reporting and better understanding of the process.
I already tried the code in: #524, but was not able to obtain feature names (in the column header of my data set).

This is the code I am currently employing for the classification:

import pandas as pd
import sklearn.model_selection
import sklearn.metrics
import autosklearn.classification
from sklearn import preprocessing

X = pd.read_csv('df.csv', index_col='A')
le = preprocessing.LabelEncoder()
for column_name in X.columns:
   if X[column_name].dtype == object:
      X[column_name] = le.fit_transform(X[column_name])
   else:
      pass
y = X.Infiltration
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=3, test_size=0.2)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=180,
    per_run_time_limit=60,
    ensemble_size=3,
    ensemble_nbest=50,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 10},
    )

automl.fit(X_train.copy(), y_train.copy(), dataset_name='test')
automl.refit(X_train.copy(), y_train.copy())

predictions = automl.predict(X_test)
probabilities = automl.predict_proba(X_test)

What can I add to obtain the desired output?

@mfeurer
Copy link
Contributor

mfeurer commented Jan 7, 2020

Could you please post a fully reproducible example in which you apply the code from issue #524 and it fails?

Also, you might have to cast to a numpy array before passing the data to Auto-sklearn.

@domainoverflow
Copy link

domainoverflow commented Feb 2, 2020

Hi @mfeurer Thanks in advance for your valuable time.
I am having the same problem. I am trying to get the feature_importance or coef_ but I can't.
The only difference is that I am using Regressor and not Classifier
I get R2 score. It works well in predicting but I need to access the feature importance .. tried many ways from here and #524.
I would be grateful if you could point me

import autosklearn.classification
import autosklearn.regression
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import pandas as pd
import sys
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

url = '/home/path_to/dataset.csv'
full_data = pd.read_csv(url)
full_data[['feature1','feature2','feature3','target_y_feature']]
y = full_data["target_y_feature"]
X = full_data.drop(["target_y_feature"], axis=1)
#y = y.to_numpy()
#X = X.to_numpy()
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3, random_state=42)
 
automl = autosklearn.regression.AutoSklearnRegressor(ensemble_size=1,time_left_for_this_task=220,per_run_time_limit=60,initial_configurations_via_metalearning=0)
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
#print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
predictions=y_hat
print(automl.show_models())
print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))
print("ground")
print(y_test)
print("predictions") 
print(predictions)


for weight, model in automl.get_models_with_weights():
        # Obtain the step of the underlying scikit-learn pipeline
        print(model.steps[-2])
        # Obtain the scores of the current feature selector
        print(model.steps[-2][-1].choice.preprocessor.scores_)
        # Obtain the percentile configured by Auto-sklearn
        print(model.steps[-2][-1].choice.preprocessor.percentile)
 

#automl.get_models_with_weights()

But I get AttributeError: 'int' object has no attribute 'scores_'

I also tried following the example from @teresaconc

pipeline = list(automl._automl.models.values())[0]
print(pipeline)

but get

AttributeError: 'list' object has no attribute 'models'

whereas if I do

pipeline = list(automl._automl)
print(pipeline)

I get

RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their init (no varargs). <class 'autosklearn.automl.AutoMLRegressor'> with constructor (self, *args, **kwargs) doesn't follow this convention.

For ensemble_size = 1 I have the following:

[(1.000000, SimpleRegressionPipeline({'categorical_encoding:__choice__': 'one_hot_encoding', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'no_preprocessing', 'regressor:__choice__': 'random_forest', 'rescaling:__choice__': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'regressor:random_forest:bootstrap': 'True', 'regressor:random_forest:criterion': 'mse', 'regressor:random_forest:max_depth': 'None', 'regressor:random_forest:max_features': 1.0, 'regressor:random_forest:max_leaf_nodes': 'None', 'regressor:random_forest:min_impurity_decrease': 0.0, 'regressor:random_forest:min_samples_leaf': 1, 'regressor:random_forest:min_samples_split': 2, 'regressor:random_forest:min_weight_fraction_leaf': 0.0, 'regressor:random_forest:n_estimators': 100, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.01},
dataset_properties={
  'task': 4,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'regression',
  'signed': False})),
]

I also tried casting the Panda Dataframe to numPy ( commented out above in the code ) but with the same outcome.

I would be grateful if you could point me to accessing the coefficients / feature importance.

Thank you,
s
ps: another way of asking this would be.. how could I get the feature_importances_ from this regression example auto-sklearn/examples/example_regression.py .. ? thanks for your time.

@akshayparanjape
Copy link

During prediction, I get an error as - DataFrame object has no attribute 'dtype' while passing a pandas DataFrame as input. Pandas dataframe has no attribute 'dtype' but has attribute as 'dtypes'.
Can you let me know, if this is a bug or am I doing something wrong?

@mfeurer
Copy link
Contributor

mfeurer commented Aug 4, 2020

We now do have an example showing how to obtain information from the trained pipelines: https://automl.github.io/auto-sklearn/development/examples/example_get_pipeline_components.html

This is currently in the development branch only but will be available in the next release.

@akshayparanjape we did not test for pandas in the master branch. Please use numpy there. We will support pandas dataframes in the next release.

@mfeurer
Copy link
Contributor

mfeurer commented Sep 2, 2020

Hi together, the previously mentioned example is now available in the master branch and main documentation: https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_get_pipeline_components.html#sphx-glr-examples-40-advanced-example-get-pipeline-components-py

Please reopen if this issue is still of interest to you and you need help adapting it to a specific model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants