Skip to content

Commit 56e6ac0

Browse files
mfeurereddiebergman
authored andcommitted
Undo accidental commit
1 parent 2007204 commit 56e6ac0

File tree

2 files changed

+18
-20
lines changed

2 files changed

+18
-20
lines changed

doc/faq.rst

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -31,30 +31,26 @@ General
3131
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
3232
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
3333
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
34+
Supported formats for these training and testing pairs are: np.ndarray,
35+
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
3436

35-
Regarding the features, there are multiple things to consider:
37+
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
38+
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
39+
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
40+
for multidimensional data.
41+
42+
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
3643

3744
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
3845
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
39-
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
40-
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
41-
supports both categorical or string as column type. Please ensure that you are using the correct
42-
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
43-
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
44-
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
45-
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
46-
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
47-
* For further details on how to properly encode your data, you can check the Pandas Example
48-
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
46+
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
47+
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
48+
column has a categorical/boolean class, it will be encoded. If the column is of any other type
49+
(Object or Timeseries), an error will be raised. For further details on how to properly encode
50+
your data, you can check the Pandas Example
51+
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
52+
If you are working with time series, it is recommended that you follow this approach
4953
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
50-
* If you prefer not using the string option at all you can disable this option. In this case
51-
objects, strings and categorical columns are encoded as categorical.
52-
53-
.. code:: python
54-
55-
import autosklearn.classification
56-
automl = autosklearn.classification.AutoSklearnClassifier(allow_string_features=False)
57-
automl.fit(X_train, y_train)
5854

5955
Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be
6056
automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding

doc/manual.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -317,12 +317,14 @@ Other
317317
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
318318
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
319319
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
320+
Supported formats for these training and testing pairs are: np.ndarray,
321+
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
320322

321323
Regarding the features, there are multiple things to consider:
322324

323325
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
324326
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
325-
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
327+
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
326328
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
327329
supports both categorical or string as column type. Please ensure that you are using the correct
328330
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and

0 commit comments

Comments
 (0)