Skip to content

How to apply a custom preprocessor to only specified features #1110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
stepthom opened this issue Mar 31, 2021 · 5 comments
Open

How to apply a custom preprocessor to only specified features #1110

stepthom opened this issue Mar 31, 2021 · 5 comments
Labels
enhancement A new improvement or feature

Comments

@stepthom
Copy link

I would like to extend auto-sklearn to handle datasets with both numerical and textual features. In particular, I want to implement a custom preprocessor that can take a textual feature and apply a TFIDF transformation.

This has raised a few concerns/questions in my head:

  • Since AutoSklearn does not accept features of type object, I will have cast my text feature to type category, but I do not want the standard categorical preprocessors (e.g., OHE) to be executed on this text feature on accident. Is there a way to achieve this?
  • How can I be sure that my custom preprocessor is only executed on my textual feature, and not the other (numeric) features?

If the above is simply not possible with the current Auto Sklearn architecture, would you be interested in a pull request that would extend auto sklearn to handle textual features?

@mfeurer
Copy link
Contributor

mfeurer commented Apr 1, 2021

Unfortunately, this is not possible yet. However, it would be great to explore whether this is possible and we would be very happy about your work on this.

Let me share a few thoughts:

  • Would it be okay for you to use pandas' new string type as this is unambiguous?
  • I'd suggest to hard-code TF-IDF + truncated SVD as the only option for handling text data in Auto-sklearn and extend it from there (We'll in parallel work on passing custom pipelines, so these two should complement each other, but might also require some work to coordinate). That looks very much aligned to what you want to achieve, too.
  • I'm not sure how to handle text features in meta-learning yet, but we can figure that out later.

@mfeurer mfeurer added the enhancement A new improvement or feature label Apr 1, 2021
@stepthom
Copy link
Author

stepthom commented Apr 5, 2021

Thank you @mfeurer. Your suggestions about the string type and TF-IDF+Truncated SVD make a lot of sense. 👍 From there, it will be easy to add other dimensionality-reduction techniques (e.g., NMF, ICA, LDA) in subsequent PRs.

I will have time to work on this new feature in a couple of weeks. As this would be my first contribution to this project, I will update you here first with a rough work plan before I proceed.

@mfeurer
Copy link
Contributor

mfeurer commented Apr 6, 2021

I will update you here first with a rough work plan before I proceed.

Great, just give a ping when you are ready; we'll do our best to respond in a timely manner.

@eddiebergman
Copy link
Contributor

This is currently being addressed and tested with PR #1300

@BradKML
Copy link

BradKML commented Oct 27, 2022

@stepthom thanks for the note on ICA, since I am solving other problems with it in Survey-based Regression and how it cna be tied to PCA in regards to dimension count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new improvement or feature
Projects
None yet
Development

No branches or pull requests

4 participants