Skip to content

Text preprocessing V2 TODOs #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
9 of 11 tasks
mfeurer opened this issue Jan 19, 2022 · 4 comments
Open
9 of 11 tasks

Text preprocessing V2 TODOs #1373

mfeurer opened this issue Jan 19, 2022 · 4 comments
Assignees
Labels
enhancement A new improvement or feature
Milestone

Comments

@mfeurer
Copy link
Contributor

mfeurer commented Jan 19, 2022

This is a list of follow-up tasks to #1300.

General implementation

  • Improve text example to include more meaningful dataset
  • Improve text example to contain links to further material that describes how we handle text data, for example this
  • Rename hyperparameters following this comment
  • Potentially move the text feature reduction to a different module
  • discuss handling of pandas dtype object -> can we default it to string or categorical?
  • Add a parameter to allow for text processing (default to True)
  • Discuss text feature support in the manual
  • Improve the way feature types are passed to the meta-feature computation (search for the following todo: Todo make this more cohesive to the overall structure (quick bug fix))
  • Fix Unused hyperparameters remain active when datasets are purely categorical or purely numerical #741

Hyperparameter space

  • Benchmark whether TF/IDF should be applied on a per-sample or per-feature level (see Text Processing #1300 (comment))
  • Improve text feature reduction upper and lower bound
@mfeurer mfeurer added the enhancement A new improvement or feature label Jan 19, 2022
@Louquinze
Copy link
Collaborator

Louquinze commented Feb 15, 2022

can not find Improve the way feature types are passed to the meta-feature computation (search for the following todo: Todo make this more cohesive to the overall structure (quick bug fix)) in the open to do's

edit: metafeatures.py:1089

@mfeurer
Copy link
Contributor Author

mfeurer commented Feb 15, 2022

I think the comment right now says: TODO make this more cohesive to the overall structure (quick bug fix)

Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 21, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 24, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 24, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 24, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
@Louquinze
Copy link
Collaborator

can we rename the point "Potentially move the text feature reduction to a different module" to "Potentially rename the text feature reduction to a different module" ?

@mfeurer
Copy link
Contributor Author

mfeurer commented Feb 25, 2022

Sure

Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 25, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 25, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue Feb 25, 2022
automl#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.
mfeurer pushed a commit that referenced this issue Mar 2, 2022
* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 17, 2022
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 17, 2022
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 17, 2022
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 17, 2022
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 19, 2022
Louquinze added a commit to Louquinze/auto-sklearn that referenced this issue May 19, 2022
@eddiebergman eddiebergman linked a pull request Jun 10, 2022 that will close this issue
@eddiebergman eddiebergman added this to the V0.15 milestone Jun 10, 2022
eddiebergman pushed a commit that referenced this issue Aug 18, 2022
* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* rename "ngram_range" to "ngram_upper_bound" this includes renaming it in all *csv and *json files for metalearning

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* handle the following issue
#1373 (comment)

this commit fixes the first 3 bullet points on the to do list.

1. rename hyperparameter "ngram_range" --> "ngram_upper_bound"
   this includes changing all *csv and *json files

2. Create a new textpreprocessing example_text_preprocessing.py, this new example features the 20Newsgroups dataset

import in example_text_preprocessing.py to long, but i can not come up with a good solution

include feedback from 02.24.

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 5 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity

* limit 20NG to 2 labels. automl.leaderboard has problems if the ensamble contains only one model. Therefore we reduced the problem complexity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new improvement or feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants