Skip to content

Update FAQ with text stuff #1500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 19 additions & 15 deletions doc/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,26 +31,30 @@ General
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
Supported formats for these training and testing pairs are: np.ndarray,
pd.DataFrame, scipy.sparse.csr_matrix and python lists.

If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
for multidimensional data.

Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
Regarding the features, there are multiple things to consider:

* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
column has a categorical/boolean class, it will be encoded. If the column is of any other type
(Object or Timeseries), an error will be raised. For further details on how to properly encode
your data, you can check the Pandas Example
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
If you are working with time series, it is recommended that you follow this approach
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
supports both categorical or string as column type. Please ensure that you are using the correct
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
* For further details on how to properly encode your data, you can check the Pandas Example
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
* If you prefer not using the string option at all you can disable this option. In this case
objects, strings and categorical columns are encoded as categorical.

.. code:: python

import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(allow_string_features=False)
automl.fit(X_train, y_train)

Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be
automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding
Expand Down
6 changes: 2 additions & 4 deletions doc/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -317,20 +317,18 @@ Other
Optionally, you can measure the ability of this fitted model to generalize to unseen data by
providing an optional testing pair (X_test/Y_test). For further details, please refer to the
Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
Supported formats for these training and testing pairs are: np.ndarray,
pd.DataFrame, scipy.sparse.csr_matrix and python lists.

Regarding the features, there are multiple things to consider:

* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
* You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
supports both categorical or string as column type. Please ensure that you are using the correct
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
* For further details on how to properly encode your data, you can check the Pandas Example
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
Expand Down