Third Party Components not shared with spawned child processes when n_jobs > 1

## Describe the bug ##
When building extensions to auto-sklearn, one has to "register" them with `.add_components`

I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".

When I use this script with multiprocessing (passing `n_jobs > 1` to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.

## To Reproduce ##
Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch, `PROTECTED_C` to show the two ways of writing this code.

If you run the script with `n_jobs = 1`, there are no issues.

However, If you run the script with `n_jobs > 1`, i.e. `python main.py -n 2`, then you observe an issue depending on `PROTECTED_C`:

- if `PROTECTED_C` = `True`: all auto-sklearn runs fail, below is the output, note how the `avail_preprocessors: ...` printed by the worker processes is missing `NoPreprocessing`. **Please see the comments in `main.py` below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior**:
```
protected set-up code? True

avail_preprocessors: ['feature_type']
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:02,291:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:05,060:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,842:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,984:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:08,072:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

auto-sklearn results:
  Dataset name: ef3cd4b2-6082-11ed-8f9c-0242ac110002
  Metric: accuracy
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 0
  Number of crashed target algorithm runs: 5
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0
```
- if `PROTECTED_C` = `False`: no issues, but relies on unusually structured scripts. Output:
```
protected set-up code? False

parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']

auto-sklearn results:
  Dataset name: 0a97408d-6083-11ed-9024-0242ac110002
  Metric: accuracy
  Best validation score: 0.943262
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 5
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0
```

main.py:
```python
PROTECTED_C = True
print(f"protected set-up code? {PROTECTED_C}")
print()

from smac.tae import StatusType

import autosklearn.classification
from autosklearn.pipeline.components.data_preprocessing import DataPreprocessorChoice

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sklearn.metrics

import parse_args


def configure_automl(args):
    print("Configuring automl...")
    # ... "register" custom components with auto-sklearn, depending on args
    # For example, suppose a "simple" mode is requested, with no preprocessing
    from no_preprocessing import NoPreprocessing
    autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

if not PROTECTED_C:
    # This would be weird...it only makes sense to parse args if this module is being executed
    args = parse_args.parse_args()
    configure_automl(args)

avail_preprocessors = list(DataPreprocessorChoice.get_components())
print(f"avail_preprocessors: {avail_preprocessors}")

if __name__ == "__main__":
    if PROTECTED_C:
        # This is where I would expect to see argparsing logic, since it's only relevant if this script is being eecuted
        args = parse_args.parse_args()
        configure_automl(args)
    
    avail_preprocessors = list(DataPreprocessorChoice.get_components())
    print(f"(__main__) avail_preprocessors: {avail_preprocessors}")
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    clf = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        include={"data_preprocessor": ["NoPreprocessing"]},
        # Bellow two flags are provided to speed up calculations
        # Not recommended for a real implementation
        initial_configurations_via_metalearning=0,
        smac_scenario_args={"runcount_limit": 5},
        n_jobs=args.n_jobs
    )
    clf.fit(X_train, y_train)
    
    print()
    # Print out the error messages from crashed runs
    for run_key in clf.automl_.runhistory_.data:
        run_val = clf.automl_.runhistory_.data[run_key]
        if run_val.status == StatusType.CRASHED:
            print("#########")
            print("CRASHED")
            print(run_val.additional_info['error'])
            print()
            
    print(clf.sprint_statistics())
```

no_preprocessing.py:
```python
from typing import Optional

from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.askl_typing import FEAT_TYPE_TYPE
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT


class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, **kwargs):
        """This preprocessors does not change the data"""
        # Some internal checks makes sure parameters are set
        for key, val in kwargs.items():
            setattr(self, key, val)

    def fit(self, X, Y=None):
        return self

    def transform(self, X):
        return X

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "NoPreprocessing",
            "name": "NoPreprocessing",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (SPARSE, DENSE, UNSIGNED_DATA),
            "output": (INPUT,),
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        return ConfigurationSpace()  # Return an empty configuration as there is None

```

parse_args.py:
```python
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Investigate issues with Third Party components and concurrency",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("-n", "--n_jobs", type=int, default=1, help="The number of jobs to run in parallel for fit(). -1 means using all processors.")
    parser.add_argument("-m", "--mode", type=str, choices=["kitchen-sink", "very-simple", "interpretable-models"], default="kitchen-sink", help="Dictates what is included or not in the search space of auto-sklearn.")
    args = parser.parse_args()
    print(f"parsed args: {args}")
    return args
```

## Environment and installation: ##
I'm running all this using the auto-sklearn docker image built off master, mentioned [here](https://automl.github.io/auto-sklearn/master/installation.html)

* OS == Linux (Docker)
* Python version == 3.8.10
* Auto-sklearn version == 0.15.0

## Notes
I think this is because of how multiprocessing works, and using the "spawn" start method.

Maybe these portions of the codebase are relevant to this:

https://github.com/automl/auto-sklearn/blob/5c69ddf4584c5c7c4977203a1a579d042c6e3716/autosklearn/evaluation/__init__.py#L388

https://github.com/automl/auto-sklearn/blob/a7f73f1563a25a74200692f615fa44b34a8a942c/autosklearn/evaluation/abstract_evaluator.py#L291

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Third Party Components not shared with spawned child processes when n_jobs > 1 #1607

Describe the bug

To Reproduce

Environment and installation:

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Third Party Components not shared with spawned child processes when n_jobs > 1 #1607

Description

Describe the bug

To Reproduce

Environment and installation:

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions