How to apply a custom preprocessor to only specified features #1110

stepthom · 2021-03-31T12:54:02Z

I would like to extend auto-sklearn to handle datasets with both numerical and textual features. In particular, I want to implement a custom preprocessor that can take a textual feature and apply a TFIDF transformation.

This has raised a few concerns/questions in my head:

Since AutoSklearn does not accept features of type object, I will have cast my text feature to type category, but I do not want the standard categorical preprocessors (e.g., OHE) to be executed on this text feature on accident. Is there a way to achieve this?
How can I be sure that my custom preprocessor is only executed on my textual feature, and not the other (numeric) features?

If the above is simply not possible with the current Auto Sklearn architecture, would you be interested in a pull request that would extend auto sklearn to handle textual features?

The text was updated successfully, but these errors were encountered:

mfeurer · 2021-04-01T08:13:00Z

Unfortunately, this is not possible yet. However, it would be great to explore whether this is possible and we would be very happy about your work on this.

Let me share a few thoughts:

Would it be okay for you to use pandas' new string type as this is unambiguous?
I'd suggest to hard-code TF-IDF + truncated SVD as the only option for handling text data in Auto-sklearn and extend it from there (We'll in parallel work on passing custom pipelines, so these two should complement each other, but might also require some work to coordinate). That looks very much aligned to what you want to achieve, too.
I'm not sure how to handle text features in meta-learning yet, but we can figure that out later.

stepthom · 2021-04-05T12:46:31Z

Thank you @mfeurer. Your suggestions about the string type and TF-IDF+Truncated SVD make a lot of sense. 👍 From there, it will be easy to add other dimensionality-reduction techniques (e.g., NMF, ICA, LDA) in subsequent PRs.

I will have time to work on this new feature in a couple of weeks. As this would be my first contribution to this project, I will update you here first with a rough work plan before I proceed.

mfeurer · 2021-04-06T18:12:23Z

I will update you here first with a rough work plan before I proceed.

Great, just give a ping when you are ready; we'll do our best to respond in a timely manner.

eddiebergman · 2021-11-17T10:18:37Z

This is currently being addressed and tested with PR #1300

BradKML · 2022-10-27T09:20:18Z

@stepthom thanks for the note on ICA, since I am solving other problems with it in Survey-based Regression and how it cna be tied to PCA in regards to dimension count.

mfeurer added the enhancement A new improvement or feature label Apr 1, 2021

eddiebergman mentioned this issue Jul 21, 2023

What's in store for Auto-Sklearn? -- From the Developers #1677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to apply a custom preprocessor to only specified features #1110

How to apply a custom preprocessor to only specified features #1110

stepthom commented Mar 31, 2021

mfeurer commented Apr 1, 2021

Uh oh!

stepthom commented Apr 5, 2021

Uh oh!

mfeurer commented Apr 6, 2021

Uh oh!

eddiebergman commented Nov 17, 2021

Uh oh!

BradKML commented Oct 27, 2022

Uh oh!

How to apply a custom preprocessor to only specified features #1110

How to apply a custom preprocessor to only specified features #1110

Comments

stepthom commented Mar 31, 2021

mfeurer commented Apr 1, 2021

Uh oh!

stepthom commented Apr 5, 2021

Uh oh!

mfeurer commented Apr 6, 2021

Uh oh!

eddiebergman commented Nov 17, 2021

Uh oh!

BradKML commented Oct 27, 2022

Uh oh!