From 60023d35321c74ce23711364048ec58107effc28 Mon Sep 17 00:00:00 2001
From: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
Date: Wed, 8 Jun 2022 12:50:53 +0200
Subject: [PATCH 1/2] Update FAQ with text stuff

---
 doc/faq.rst    | 34 +++++++++++++++++++---------------
 doc/manual.rst |  4 +---
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/doc/faq.rst b/doc/faq.rst
index 23a0124ce1..255ce4c76d 100644
--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -31,26 +31,30 @@ General
     Optionally, you can measure the ability of this fitted model to generalize to unseen data by
     providing an optional testing pair (X_test/Y_test). For further details, please refer to the
     Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
-    Supported formats for these training and testing pairs are: np.ndarray,
-    pd.DataFrame, scipy.sparse.csr_matrix and python lists.
 
-    If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
-    data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
-    for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
-    for multidimensional data.
-
-    Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
+    Regarding the features, there are multiple things to consider:
 
     * Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
       can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
-    * You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
-      dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
-      column has a categorical/boolean class, it will be encoded. If the column is of any other type
-      (Object or Timeseries), an error will be raised. For further details on how to properly encode
-      your data, you can check the Pandas Example
-      `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
-      If you are working with time series, it is recommended that you follow this approach
+    * You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
+      dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
+      supports both categorical or string as column type. Please ensure that you are using the correct
+      dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
+      encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
+    * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
+      data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
+      for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
+    * For further details on how to properly encode your data, you can check the Pandas Example
+      `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
       `Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
+    * If you prefer not using the string option at all you can disable this option. In this case
+      objects, strings and categorical columns are encoded as categorical.
+
+    .. code:: python
+
+        import autosklearn.classification
+        automl = autosklearn.classification.AutoSklearnClassifier(allow_string_features=False)
+        automl.fit(X_train, y_train)
 
     Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be
     automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding
diff --git a/doc/manual.rst b/doc/manual.rst
index c3a37e19e3..e2b7e4d556 100644
--- a/doc/manual.rst
+++ b/doc/manual.rst
@@ -317,14 +317,12 @@ Other
     Optionally, you can measure the ability of this fitted model to generalize to unseen data by
     providing an optional testing pair (X_test/Y_test). For further details, please refer to the
     Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`.
-    Supported formats for these training and testing pairs are: np.ndarray,
-    pd.DataFrame, scipy.sparse.csr_matrix and python lists.
 
     Regarding the features, there are multiple things to consider:
 
     * Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
       can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
-    * You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
+    * You can provide a pandas DataFrame with properly formatted columns. If a column has numerical
       dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
       supports both categorical or string as column type. Please ensure that you are using the correct
       dtype for your task. By default *auto-sklearn* treats object and string columns as strings and

From cc3e529154d8b505dfa17c819cc1cbeadd6f2f21 Mon Sep 17 00:00:00 2001
From: Matthias Feurer <feurerm@informatik.uni-freiburg.de>
Date: Wed, 8 Jun 2022 13:33:58 +0200
Subject: [PATCH 2/2] Take suggestions into account

---
 doc/faq.rst    | 2 +-
 doc/manual.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/faq.rst b/doc/faq.rst
index 255ce4c76d..56fa360937 100644
--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -42,7 +42,7 @@ General
       dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
       encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
     * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
-      data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
+      Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
       for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
     * For further details on how to properly encode your data, you can check the Pandas Example
       `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
diff --git a/doc/manual.rst b/doc/manual.rst
index e2b7e4d556..7cdb162881 100644
--- a/doc/manual.rst
+++ b/doc/manual.rst
@@ -328,7 +328,7 @@ Other
       dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
       encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
     * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
-      data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
+      Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
       for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
     * For further details on how to properly encode your data, you can check the Pandas Example
       `Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach