Skip to content

Third Party Components not shared with spawned child processes when n_jobs > 1 #1607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AmirAlavi opened this issue Nov 9, 2022 · 1 comment

Comments

@AmirAlavi
Copy link

Describe the bug

When building extensions to auto-sklearn, one has to "register" them with .add_components

I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".

When I use this script with multiprocessing (passing n_jobs > 1 to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.

To Reproduce

Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch, PROTECTED_C to show the two ways of writing this code.

If you run the script with n_jobs = 1, there are no issues.

However, If you run the script with n_jobs > 1, i.e. python main.py -n 2, then you observe an issue depending on PROTECTED_C:

  • if PROTECTED_C = True: all auto-sklearn runs fail, below is the output, note how the avail_preprocessors: ... printed by the worker processes is missing NoPreprocessing. Please see the comments in main.py below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior:
protected set-up code? True

avail_preprocessors: ['feature_type']
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:02,291:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:05,060:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,842:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,984:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:08,072:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

auto-sklearn results:
  Dataset name: ef3cd4b2-6082-11ed-8f9c-0242ac110002
  Metric: accuracy
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 0
  Number of crashed target algorithm runs: 5
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0
  • if PROTECTED_C = False: no issues, but relies on unusually structured scripts. Output:
protected set-up code? False

parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']

auto-sklearn results:
  Dataset name: 0a97408d-6083-11ed-9024-0242ac110002
  Metric: accuracy
  Best validation score: 0.943262
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 5
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

main.py:

PROTECTED_C = True
print(f"protected set-up code? {PROTECTED_C}")
print()

from smac.tae import StatusType

import autosklearn.classification
from autosklearn.pipeline.components.data_preprocessing import DataPreprocessorChoice

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sklearn.metrics

import parse_args


def configure_automl(args):
    print("Configuring automl...")
    # ... "register" custom components with auto-sklearn, depending on args
    # For example, suppose a "simple" mode is requested, with no preprocessing
    from no_preprocessing import NoPreprocessing
    autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

if not PROTECTED_C:
    # This would be weird...it only makes sense to parse args if this module is being executed
    args = parse_args.parse_args()
    configure_automl(args)

avail_preprocessors = list(DataPreprocessorChoice.get_components())
print(f"avail_preprocessors: {avail_preprocessors}")

if __name__ == "__main__":
    if PROTECTED_C:
        # This is where I would expect to see argparsing logic, since it's only relevant if this script is being eecuted
        args = parse_args.parse_args()
        configure_automl(args)
    
    avail_preprocessors = list(DataPreprocessorChoice.get_components())
    print(f"(__main__) avail_preprocessors: {avail_preprocessors}")
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    clf = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        include={"data_preprocessor": ["NoPreprocessing"]},
        # Bellow two flags are provided to speed up calculations
        # Not recommended for a real implementation
        initial_configurations_via_metalearning=0,
        smac_scenario_args={"runcount_limit": 5},
        n_jobs=args.n_jobs
    )
    clf.fit(X_train, y_train)
    
    print()
    # Print out the error messages from crashed runs
    for run_key in clf.automl_.runhistory_.data:
        run_val = clf.automl_.runhistory_.data[run_key]
        if run_val.status == StatusType.CRASHED:
            print("#########")
            print("CRASHED")
            print(run_val.additional_info['error'])
            print()
            
    print(clf.sprint_statistics())

no_preprocessing.py:

from typing import Optional

from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.askl_typing import FEAT_TYPE_TYPE
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT


class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, **kwargs):
        """This preprocessors does not change the data"""
        # Some internal checks makes sure parameters are set
        for key, val in kwargs.items():
            setattr(self, key, val)

    def fit(self, X, Y=None):
        return self

    def transform(self, X):
        return X

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "NoPreprocessing",
            "name": "NoPreprocessing",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (SPARSE, DENSE, UNSIGNED_DATA),
            "output": (INPUT,),
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        return ConfigurationSpace()  # Return an empty configuration as there is None

parse_args.py:

import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Investigate issues with Third Party components and concurrency",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("-n", "--n_jobs", type=int, default=1, help="The number of jobs to run in parallel for fit(). -1 means using all processors.")
    parser.add_argument("-m", "--mode", type=str, choices=["kitchen-sink", "very-simple", "interpretable-models"], default="kitchen-sink", help="Dictates what is included or not in the search space of auto-sklearn.")
    args = parser.parse_args()
    print(f"parsed args: {args}")
    return args

Environment and installation:

I'm running all this using the auto-sklearn docker image built off master, mentioned here

  • OS == Linux (Docker)
  • Python version == 3.8.10
  • Auto-sklearn version == 0.15.0

Notes

I think this is because of how multiprocessing works, and using the "spawn" start method.

Maybe these portions of the codebase are relevant to this:

additional_components=autosklearn.pipeline.components.base._addons,

# Add 3rd-party components to the list of 3rd-party components in case this

@eddiebergman
Copy link
Contributor

Hi @AmirAlavi,

Sorry to see that this issue has come back, I'm aware there are some issues related to multi-processing. I thought at one point this had been fixed, maybe not and maybe it regressed. As you can imagine, multi-processing testing like this can be a bit complicated, so I appreciate the scripts.

I'm currently working on updating auto-sklearn to the latest scikit-learn and our other core dependencies but these scripts will be helpful when I get a chance to look at this!

Many thanks,
Eddie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants