Skip to content

Commit 96de622

Browse files
authored
[Add] documentation and example for parallel computation (#322)
* Add documenation and example for parallel computation * Update examples/40_advanced/example_parallel_n_jobs.py
1 parent 28e1d47 commit 96de622

File tree

3 files changed

+89
-4
lines changed

3 files changed

+89
-4
lines changed

docs/manual.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,3 +48,24 @@ Auto-PyTorch allows users to inspect the training results and statistics. The fo
4848
>>> automl = TabularClassificationTask()
4949
>>> automl.fit(X_train, y_train)
5050
>>> automl.show_models()
51+
52+
Parallel computation
53+
====================
54+
55+
In it's default mode, *Auto-PyTorch* already uses two cores. The first one is used for model building, the second for building an ensemble every time a new machine learning model has finished training.
56+
57+
Nevertheless, *Auto-PyTorch* also supports parallel Bayesian optimization via the use of `Dask.distributed <https://distributed.dask.org/>`_. By providing the arguments ``n_jobs`` to the estimator construction, one can control the number of cores available to *Auto-PyTorch* (As shown in the Example :ref:`sphx_glr_examples_40_advanced_example_parallel_n_jobs.py`). When multiple cores are available, *Auto-PyTorch* will create a worker per core, and use the available workers to both search for better machine learning models as well as building an ensemble with them until the time resource is exhausted.
58+
59+
**Note:** *Auto-PyTorch* requires all workers to have access to a shared file system for storing training data and models.
60+
61+
*Auto-PyTorch* employs `threadpoolctl <https://github.com/joblib/threadpoolctl/>`_ to control the number of threads employed by scientific libraries like numpy or scikit-learn. This is done exclusively during the building procedure of models, not during inference. In particular, *Auto-PyTorch* allows each pipeline to use at most 1 thread during training. At predicting and scoring time this limitation is not enforced by *Auto-PyTorch*. You can control the number of resources
62+
employed by the pipelines by setting the following variables in your environment, prior to running *Auto-PyTorch*:
63+
64+
.. code-block:: shell-session
65+
66+
$ export OPENBLAS_NUM_THREADS=1
67+
$ export MKL_NUM_THREADS=1
68+
$ export OMP_NUM_THREADS=1
69+
70+
71+
For further information about how scikit-learn handles multiprocessing, please check the `Parallelism, resource management, and configuration <https://scikit-learn.org/stable/computing/parallelism.html>`_ documentation from the library.

examples/40_advanced/README.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,3 @@ Advanced Tabular Dataset Examples
66
=================================
77

88
Advanced examples for using *Auto-PyTorch* on tabular datasets.
9-
We explain
10-
1. How to customise the search space
11-
2. How to split the data according to different resampling strategies
12-
3. How to visualize the results of Auto-PyTorch
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
"""
2+
======================
3+
Tabular Classification
4+
======================
5+
6+
The following example shows how to fit a sample classification model parallely on 2 cores
7+
with AutoPyTorch
8+
"""
9+
import os
10+
import tempfile as tmp
11+
import warnings
12+
13+
os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
14+
os.environ['OMP_NUM_THREADS'] = '1'
15+
os.environ['OPENBLAS_NUM_THREADS'] = '1'
16+
os.environ['MKL_NUM_THREADS'] = '1'
17+
18+
warnings.simplefilter(action='ignore', category=UserWarning)
19+
warnings.simplefilter(action='ignore', category=FutureWarning)
20+
21+
import sklearn.datasets
22+
import sklearn.model_selection
23+
24+
from autoPyTorch.api.tabular_classification import TabularClassificationTask
25+
26+
if __name__ == '__main__':
27+
############################################################################
28+
# Data Loading
29+
# ============
30+
X, y = sklearn.datasets.fetch_openml(data_id=40981, return_X_y=True, as_frame=True)
31+
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
32+
X,
33+
y,
34+
random_state=1,
35+
)
36+
37+
############################################################################
38+
# Build and fit a classifier
39+
# ==========================
40+
api = TabularClassificationTask(
41+
n_jobs=2,
42+
seed=42,
43+
)
44+
45+
############################################################################
46+
# Search for an ensemble of machine learning algorithms
47+
# =====================================================
48+
api.search(
49+
X_train=X_train,
50+
y_train=y_train,
51+
X_test=X_test.copy(),
52+
y_test=y_test.copy(),
53+
optimize_metric='accuracy',
54+
total_walltime_limit=300,
55+
func_eval_time_limit_secs=50,
56+
# Each one of the 2 jobs is allocated 3GB
57+
memory_limit=3072,
58+
)
59+
60+
############################################################################
61+
# Print the final ensemble performance
62+
# ====================================
63+
print(api.run_history, api.trajectory)
64+
y_pred = api.predict(X_test)
65+
score = api.score(y_pred, y_test)
66+
print(score)
67+
# Print the final ensemble built by AutoPyTorch
68+
print(api.show_models())

0 commit comments

Comments
 (0)