From 60023d35321c74ce23711364048ec58107effc28 Mon Sep 17 00:00:00 2001 From: Matthias Feurer Date: Wed, 8 Jun 2022 12:50:53 +0200 Subject: [PATCH 1/2] Update FAQ with text stuff --- doc/faq.rst | 34 +++++++++++++++++++--------------- doc/manual.rst | 4 +--- 2 files changed, 20 insertions(+), 18 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index 23a0124ce1..255ce4c76d 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -31,26 +31,30 @@ General Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`. - Supported formats for these training and testing pairs are: np.ndarray, - pd.DataFrame, scipy.sparse.csr_matrix and python lists. - If your data contains categorical values (in the features or targets), autosklearn will automatically encode your - data using a `sklearn.preprocessing.LabelEncoder `_ - for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder `_ - for multidimensional data. - - Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns: + Regarding the features, there are multiple things to consider: * Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`. - * You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical - dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the - column has a categorical/boolean class, it will be encoded. If the column is of any other type - (Object or Timeseries), an error will be raised. For further details on how to properly encode - your data, you can check the Pandas Example - `Working with categorical data `_). - If you are working with time series, it is recommended that you follow this approach + * You can provide a pandas DataFrame with properly formatted columns. If a column has numerical + dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn* + supports both categorical or string as column type. Please ensure that you are using the correct + dtype for your task. By default *auto-sklearn* treats object and string columns as strings and + encodes the data using `sklearn.feature_extraction.text.CountVectorizer `_ + * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical. + data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder `_ + for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder `_ for multidimensional data. + * For further details on how to properly encode your data, you can check the Pandas Example + `Working with categorical data `_). If you are working with time series, it is recommended that you follow this approach `Working with time data `_. + * If you prefer not using the string option at all you can disable this option. In this case + objects, strings and categorical columns are encoded as categorical. + + .. code:: python + + import autosklearn.classification + automl = autosklearn.classification.AutoSklearnClassifier(allow_string_features=False) + automl.fit(X_train, y_train) Regarding the targets (y_train/y_test), if the task involves a classification problem, such features will be automatically encoded. It is recommended to provide both y_train and y_test during fit, so that a common encoding diff --git a/doc/manual.rst b/doc/manual.rst index c3a37e19e3..e2b7e4d556 100644 --- a/doc/manual.rst +++ b/doc/manual.rst @@ -317,14 +317,12 @@ Other Optionally, you can measure the ability of this fitted model to generalize to unseen data by providing an optional testing pair (X_test/Y_test). For further details, please refer to the Example :ref:`sphx_glr_examples_40_advanced_example_pandas_train_test.py`. - Supported formats for these training and testing pairs are: np.ndarray, - pd.DataFrame, scipy.sparse.csr_matrix and python lists. Regarding the features, there are multiple things to consider: * Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`. - * You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical + * You can provide a pandas DataFrame with properly formatted columns. If a column has numerical dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn* supports both categorical or string as column type. Please ensure that you are using the correct dtype for your task. By default *auto-sklearn* treats object and string columns as strings and From cc3e529154d8b505dfa17c819cc1cbeadd6f2f21 Mon Sep 17 00:00:00 2001 From: Matthias Feurer Date: Wed, 8 Jun 2022 13:33:58 +0200 Subject: [PATCH 2/2] Take suggestions into account --- doc/faq.rst | 2 +- doc/manual.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index 255ce4c76d..56fa360937 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -42,7 +42,7 @@ General dtype for your task. By default *auto-sklearn* treats object and string columns as strings and encodes the data using `sklearn.feature_extraction.text.CountVectorizer `_ * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical. - data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder `_ + Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder `_ for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder `_ for multidimensional data. * For further details on how to properly encode your data, you can check the Pandas Example `Working with categorical data `_). If you are working with time series, it is recommended that you follow this approach diff --git a/doc/manual.rst b/doc/manual.rst index e2b7e4d556..7cdb162881 100644 --- a/doc/manual.rst +++ b/doc/manual.rst @@ -328,7 +328,7 @@ Other dtype for your task. By default *auto-sklearn* treats object and string columns as strings and encodes the data using `sklearn.feature_extraction.text.CountVectorizer `_ * If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical. - data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder `_ + Data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder `_ for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder `_ for multidimensional data. * For further details on how to properly encode your data, you can check the Pandas Example `Working with categorical data `_). If you are working with time series, it is recommended that you follow this approach