-
Notifications
You must be signed in to change notification settings - Fork 1.3k
How to apply a custom preprocessor to only specified features #1110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unfortunately, this is not possible yet. However, it would be great to explore whether this is possible and we would be very happy about your work on this. Let me share a few thoughts:
|
Thank you @mfeurer. Your suggestions about the string type and TF-IDF+Truncated SVD make a lot of sense. 👍 From there, it will be easy to add other dimensionality-reduction techniques (e.g., NMF, ICA, LDA) in subsequent PRs. I will have time to work on this new feature in a couple of weeks. As this would be my first contribution to this project, I will update you here first with a rough work plan before I proceed. |
Great, just give a ping when you are ready; we'll do our best to respond in a timely manner. |
This is currently being addressed and tested with PR #1300 |
@stepthom thanks for the note on ICA, since I am solving other problems with it in Survey-based Regression and how it cna be tied to PCA in regards to dimension count. |
I would like to extend auto-sklearn to handle datasets with both numerical and textual features. In particular, I want to implement a custom preprocessor that can take a textual feature and apply a TFIDF transformation.
This has raised a few concerns/questions in my head:
object
, I will have cast my text feature to typecategory
, but I do not want the standard categorical preprocessors (e.g., OHE) to be executed on this text feature on accident. Is there a way to achieve this?If the above is simply not possible with the current Auto Sklearn architecture, would you be interested in a pull request that would extend auto sklearn to handle textual features?
The text was updated successfully, but these errors were encountered: