You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When building extensions to auto-sklearn, one has to "register" them with .add_components
I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".
When I use this script with multiprocessing (passing n_jobs > 1 to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.
To Reproduce
Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch, PROTECTED_C to show the two ways of writing this code.
If you run the script with n_jobs = 1, there are no issues.
However, If you run the script with n_jobs > 1, i.e. python main.py -n 2, then you observe an issue depending on PROTECTED_C:
if PROTECTED_C = True: all auto-sklearn runs fail, below is the output, note how the avail_preprocessors: ... printed by the worker processes is missing NoPreprocessing. Please see the comments in main.py below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior:
protected set-up code? True
avail_preprocessors: ['feature_type']
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:02,291:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:05,060:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,842:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,984:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:08,072:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")
auto-sklearn results:
Dataset name: ef3cd4b2-6082-11ed-8f9c-0242ac110002
Metric: accuracy
Number of target algorithm runs: 5
Number of successful target algorithm runs: 0
Number of crashed target algorithm runs: 5
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
if PROTECTED_C = False: no issues, but relies on unusually structured scripts. Output:
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
auto-sklearn results:
Dataset name: 0a97408d-6083-11ed-9024-0242ac110002
Metric: accuracy
Best validation score: 0.943262
Number of target algorithm runs: 5
Number of successful target algorithm runs: 5
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
main.py:
PROTECTED_C=Trueprint(f"protected set-up code? {PROTECTED_C}")
print()
fromsmac.taeimportStatusTypeimportautosklearn.classificationfromautosklearn.pipeline.components.data_preprocessingimportDataPreprocessorChoicefromsklearn.datasetsimportload_breast_cancerfromsklearn.model_selectionimporttrain_test_splitimportsklearn.metricsimportparse_argsdefconfigure_automl(args):
print("Configuring automl...")
# ... "register" custom components with auto-sklearn, depending on args# For example, suppose a "simple" mode is requested, with no preprocessingfromno_preprocessingimportNoPreprocessingautosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)
ifnotPROTECTED_C:
# This would be weird...it only makes sense to parse args if this module is being executedargs=parse_args.parse_args()
configure_automl(args)
avail_preprocessors=list(DataPreprocessorChoice.get_components())
print(f"avail_preprocessors: {avail_preprocessors}")
if__name__=="__main__":
ifPROTECTED_C:
# This is where I would expect to see argparsing logic, since it's only relevant if this script is being eecutedargs=parse_args.parse_args()
configure_automl(args)
avail_preprocessors=list(DataPreprocessorChoice.get_components())
print(f"(__main__) avail_preprocessors: {avail_preprocessors}")
X, y=load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test=train_test_split(X, y)
clf=autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
include={"data_preprocessor": ["NoPreprocessing"]},
# Bellow two flags are provided to speed up calculations# Not recommended for a real implementationinitial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 5},
n_jobs=args.n_jobs
)
clf.fit(X_train, y_train)
print()
# Print out the error messages from crashed runsforrun_keyinclf.automl_.runhistory_.data:
run_val=clf.automl_.runhistory_.data[run_key]
ifrun_val.status==StatusType.CRASHED:
print("#########")
print("CRASHED")
print(run_val.additional_info['error'])
print()
print(clf.sprint_statistics())
no_preprocessing.py:
fromtypingimportOptionalfromConfigSpace.configuration_spaceimportConfigurationSpacefromautosklearn.askl_typingimportFEAT_TYPE_TYPEfromautosklearn.pipeline.components.baseimportAutoSklearnPreprocessingAlgorithmfromautosklearn.pipeline.constantsimportSPARSE, DENSE, UNSIGNED_DATA, INPUTclassNoPreprocessing(AutoSklearnPreprocessingAlgorithm):
def__init__(self, **kwargs):
"""This preprocessors does not change the data"""# Some internal checks makes sure parameters are setforkey, valinkwargs.items():
setattr(self, key, val)
deffit(self, X, Y=None):
returnselfdeftransform(self, X):
returnX@staticmethoddefget_properties(dataset_properties=None):
return {
"shortname": "NoPreprocessing",
"name": "NoPreprocessing",
"handles_regression": True,
"handles_classification": True,
"handles_multiclass": True,
"handles_multilabel": True,
"handles_multioutput": True,
"is_deterministic": True,
"input": (SPARSE, DENSE, UNSIGNED_DATA),
"output": (INPUT,),
}
@staticmethoddefget_hyperparameter_search_space(
feat_type: Optional[FEAT_TYPE_TYPE] =None, dataset_properties=None
):
returnConfigurationSpace() # Return an empty configuration as there is None
parse_args.py:
importargparsedefparse_args():
parser=argparse.ArgumentParser(
description="Investigate issues with Third Party components and concurrency",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("-n", "--n_jobs", type=int, default=1, help="The number of jobs to run in parallel for fit(). -1 means using all processors.")
parser.add_argument("-m", "--mode", type=str, choices=["kitchen-sink", "very-simple", "interpretable-models"], default="kitchen-sink", help="Dictates what is included or not in the search space of auto-sklearn.")
args=parser.parse_args()
print(f"parsed args: {args}")
returnargs
Environment and installation:
I'm running all this using the auto-sklearn docker image built off master, mentioned here
OS == Linux (Docker)
Python version == 3.8.10
Auto-sklearn version == 0.15.0
Notes
I think this is because of how multiprocessing works, and using the "spawn" start method.
Maybe these portions of the codebase are relevant to this:
Sorry to see that this issue has come back, I'm aware there are some issues related to multi-processing. I thought at one point this had been fixed, maybe not and maybe it regressed. As you can imagine, multi-processing testing like this can be a bit complicated, so I appreciate the scripts.
I'm currently working on updating auto-sklearn to the latest scikit-learn and our other core dependencies but these scripts will be helpful when I get a chance to look at this!
Describe the bug
When building extensions to auto-sklearn, one has to "register" them with
.add_components
I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".
When I use this script with multiprocessing (passing
n_jobs > 1
to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.To Reproduce
Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch,
PROTECTED_C
to show the two ways of writing this code.If you run the script with
n_jobs = 1
, there are no issues.However, If you run the script with
n_jobs > 1
, i.e.python main.py -n 2
, then you observe an issue depending onPROTECTED_C
:PROTECTED_C
=True
: all auto-sklearn runs fail, below is the output, note how theavail_preprocessors: ...
printed by the worker processes is missingNoPreprocessing
. Please see the comments inmain.py
below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior:PROTECTED_C
=False
: no issues, but relies on unusually structured scripts. Output:main.py:
no_preprocessing.py:
parse_args.py:
Environment and installation:
I'm running all this using the auto-sklearn docker image built off master, mentioned here
Notes
I think this is because of how multiprocessing works, and using the "spawn" start method.
Maybe these portions of the codebase are relevant to this:
auto-sklearn/autosklearn/evaluation/__init__.py
Line 388 in 5c69ddf
auto-sklearn/autosklearn/evaluation/abstract_evaluator.py
Line 291 in a7f73f1
The text was updated successfully, but these errors were encountered: