Skip to content

How to weight a given class? [class balancing] #1596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
UnixJunkie opened this issue Oct 10, 2022 · 4 comments
Open

How to weight a given class? [class balancing] #1596

UnixJunkie opened this issue Oct 10, 2022 · 4 comments
Labels
enhancement A new improvement or feature question

Comments

@UnixJunkie
Copy link
Contributor

Is it possible to give a list of weights that should be tried for a given class?
I have some data where very heavy reweighting of the under-represented class is necessary to get any good classifier.

I don't know in advance what is the weight to use; apparently it depends of the ML being used.
So, this is another hyper parameter that needs to be optimized.

@eddiebergman eddiebergman added enhancement A new improvement or feature question labels Oct 19, 2022
@aron-bram
Copy link
Collaborator

aron-bram commented Oct 20, 2022

Hi,

Unfortunately we do not provide a way to give a list of class weights to be tried out during the optimization process.

Although, by default auto-sklearn should handle the imbalance in the dataset by also including estimators in the search that use sample/class weights, and sets their weights to be the inverse of each class's frequency (refer to Balancing for implementation).
Similarly to how sklearn's "balance" value for the class_weights parameter works with some estimators.

May I ask, what performance you reached on this dataset using auto-sklearn and how it compared to some other methods?

In general, an alternative would be to oversample the under-represented class or to undersample the over-represented one. Not sure if this is a good enough option for you, though.

Or you could define your custom metric in auto-sklearn that somehow takes the imbalance of the classes into account.

You may also be interested in defining your own balancing component (Extending Auto-Sklearn with Classification Component example)

I hope I could help, and please feel free to follow up on it.

Let me know @eddiebergman if I forgot about something.

@UnixJunkie
Copy link
Contributor Author

Class weight is just another hyperparameter that needs to be optimized in some datasets, with some ML methods (like SVM).
Using inverse of the class frequency is just an initial guess. Sometimes very far a guess from what optimization would give you.

auto-sklearn miserably failed on this dataset; while by hand I could optimize a model using liblinear (and very strong class weighting for the under-represented class). So, auto-skelarn AUC's was 0.5; mine was 0.58 (yes, it is a hard binary classification dataset).

Trying to resample the classes doesn't help on this dataset. I tried bagging for class balancing.

There are already metrics in there that take class imbalance into account (e.g. AUC if you output probabilities is fine).

FYI, caret allows users to pass the class weights to try to all methods that support class weights.
Although caret doesn't do it right: it should be optimized like all other hyperparameters, not scanned by the user.

@aron-bram
Copy link
Collaborator

We do realize that handling it as a hyperparameter would improve results achieved on such extremely unbalanced datasets. It just hasn't been a prority for us given the lack of such requests.
But thank you for your suggestion, it indeed has the potential to improve the library.

We will consider adding this as a floating-point hyperparameter, which could be used by the Balancing class. However, I can not
yet give you an exact date by which this feature will be included unfortunately.
Is this an urgent issue for you?

If so, then you could implement your own balancing class as indicated at the bottom of my previous answer. This is far from being the optimal solution, but it should work. I can try to give you a hint on how to achieve this with a dummy implementation soon.

Thank you for your patience.

@UnixJunkie
Copy link
Contributor Author

This is not urgent; auto-sklearn fails on this dataset, so I don't use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new improvement or feature question
Projects
None yet
Development

No branches or pull requests

3 participants