You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/faq.rst
+15-19Lines changed: 15 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -31,30 +31,26 @@ General
31
31
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
32
32
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
33
33
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
34
+
Supported formats for these training and testing pairs are: np.ndarray,
35
+
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
34
36
35
-
Regarding the features, there are multiple things to consider:
37
+
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
38
+
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
39
+
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
40
+
for multidimensional data.
41
+
42
+
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
36
43
37
44
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
38
45
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
39
-
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
40
-
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
41
-
supports both categorical or string as column type. Please ensure that you are using the correct
42
-
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
43
-
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
44
-
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
45
-
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
46
-
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
47
-
* For further details on how to properly encode your data, you can check the Pandas Example
48
-
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
46
+
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
47
+
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
48
+
column has a categorical/boolean class, it will be encoded. If the column is of any other type
49
+
(Object or Timeseries), an error will be raised. For further details on how to properly encode
50
+
your data, you can check the Pandas Example
51
+
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
52
+
If you are working with time series, it is recommended that you follow this approach
49
53
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
50
-
* If you prefer not using the string option at all you can disable this option. In this case
51
-
objects, strings and categorical columns are encoded as categorical.
0 commit comments