Description
Describe the bug
When building extensions to auto-sklearn, one has to "register" them with .add_components
I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".
When I use this script with multiprocessing (passing n_jobs > 1
to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.
To Reproduce
Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch, PROTECTED_C
to show the two ways of writing this code.
If you run the script with n_jobs = 1
, there are no issues.
However, If you run the script with n_jobs > 1
, i.e. python main.py -n 2
, then you observe an issue depending on PROTECTED_C
:
- if
PROTECTED_C
=True
: all auto-sklearn runs fail, below is the output, note how theavail_preprocessors: ...
printed by the worker processes is missingNoPreprocessing
. Please see the comments inmain.py
below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior:
protected set-up code? True
avail_preprocessors: ['feature_type']
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:02,291:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:05,060:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,842:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,984:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:08,072:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
auto-sklearn results:
Dataset name: ef3cd4b2-6082-11ed-8f9c-0242ac110002
Metric: accuracy
Number of target algorithm runs: 5
Number of successful target algorithm runs: 0
Number of crashed target algorithm runs: 5
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
- if
PROTECTED_C
=False
: no issues, but relies on unusually structured scripts. Output:
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
auto-sklearn results:
Dataset name: 0a97408d-6083-11ed-9024-0242ac110002
Metric: accuracy
Best validation score: 0.943262
Number of target algorithm runs: 5
Number of successful target algorithm runs: 5
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
main.py:
PROTECTED_C = True
print(f"protected set-up code? {PROTECTED_C}")
print()
from smac.tae import StatusType
import autosklearn.classification
from autosklearn.pipeline.components.data_preprocessing import DataPreprocessorChoice
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sklearn.metrics
import parse_args
def configure_automl(args):
print("Configuring automl...")
# ... "register" custom components with auto-sklearn, depending on args
# For example, suppose a "simple" mode is requested, with no preprocessing
from no_preprocessing import NoPreprocessing
autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)
if not PROTECTED_C:
# This would be weird...it only makes sense to parse args if this module is being executed
args = parse_args.parse_args()
configure_automl(args)
avail_preprocessors = list(DataPreprocessorChoice.get_components())
print(f"avail_preprocessors: {avail_preprocessors}")
if __name__ == "__main__":
if PROTECTED_C:
# This is where I would expect to see argparsing logic, since it's only relevant if this script is being eecuted
args = parse_args.parse_args()
configure_automl(args)
avail_preprocessors = list(DataPreprocessorChoice.get_components())
print(f"(__main__) avail_preprocessors: {avail_preprocessors}")
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
include={"data_preprocessor": ["NoPreprocessing"]},
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 5},
n_jobs=args.n_jobs
)
clf.fit(X_train, y_train)
print()
# Print out the error messages from crashed runs
for run_key in clf.automl_.runhistory_.data:
run_val = clf.automl_.runhistory_.data[run_key]
if run_val.status == StatusType.CRASHED:
print("#########")
print("CRASHED")
print(run_val.additional_info['error'])
print()
print(clf.sprint_statistics())
no_preprocessing.py:
from typing import Optional
from ConfigSpace.configuration_space import ConfigurationSpace
from autosklearn.askl_typing import FEAT_TYPE_TYPE
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT
class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
def __init__(self, **kwargs):
"""This preprocessors does not change the data"""
# Some internal checks makes sure parameters are set
for key, val in kwargs.items():
setattr(self, key, val)
def fit(self, X, Y=None):
return self
def transform(self, X):
return X
@staticmethod
def get_properties(dataset_properties=None):
return {
"shortname": "NoPreprocessing",
"name": "NoPreprocessing",
"handles_regression": True,
"handles_classification": True,
"handles_multiclass": True,
"handles_multilabel": True,
"handles_multioutput": True,
"is_deterministic": True,
"input": (SPARSE, DENSE, UNSIGNED_DATA),
"output": (INPUT,),
}
@staticmethod
def get_hyperparameter_search_space(
feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
):
return ConfigurationSpace() # Return an empty configuration as there is None
parse_args.py:
import argparse
def parse_args():
parser = argparse.ArgumentParser(
description="Investigate issues with Third Party components and concurrency",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("-n", "--n_jobs", type=int, default=1, help="The number of jobs to run in parallel for fit(). -1 means using all processors.")
parser.add_argument("-m", "--mode", type=str, choices=["kitchen-sink", "very-simple", "interpretable-models"], default="kitchen-sink", help="Dictates what is included or not in the search space of auto-sklearn.")
args = parser.parse_args()
print(f"parsed args: {args}")
return args
Environment and installation:
I'm running all this using the auto-sklearn docker image built off master, mentioned here
- OS == Linux (Docker)
- Python version == 3.8.10
- Auto-sklearn version == 0.15.0
Notes
I think this is because of how multiprocessing works, and using the "spawn" start method.
Maybe these portions of the codebase are relevant to this: