Skip to content

Samplers / pipelines for imbalanced datasets #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Jul 27, 2018 · 16 comments
Open

Samplers / pipelines for imbalanced datasets #317

TomAugspurger opened this issue Jul 27, 2018 · 16 comments
Labels
Roadmap Larger, self-contained pieces of work.

Comments

@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 27, 2018

Imbalanced datasets, where the classes have very different occurrence rates, can show up in large data sets.

There are many strategies for dealing with imbalanced data. http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html implements a set, some of which could be scaled to large datasets with dask.

@TomAugspurger TomAugspurger added the Roadmap Larger, self-contained pieces of work. label Aug 16, 2018
@sephib
Copy link

sephib commented Feb 27, 2020

Hi,
I think that most of the changes would be to introduce the option of fit_resample and fit_sample into the fit_transform method.
I'll be happy to assist on this issue.

@TomAugspurger
Copy link
Member Author

@sephib do you have any examples of fit_resample and fit_sample? I'm not familiar with them.

@sephib
Copy link

sephib commented Mar 3, 2020

The core fit_resample function is from within imblearn/base.py.
It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Mar 3, 2020 via email

@sephib
Copy link

sephib commented Mar 4, 2020

Currently when I ran daskml with an imblearn pipeline I got an error:

AttributeError: 'FunctionSamplerw object has no attribute 'transform'

This is from the dask_ml/model_selection/method.py fit_transform function which is looking for a fit_transform or fit and transform attributes (which are in imblearn "converted" to fit_resample

@TomAugspurger
Copy link
Member Author

It'd would help to have a minimal minimal example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@sephib
Copy link

sephib commented Mar 9, 2020

Hi
Here is a sample code that passes the dask_ml/model_selection/methods.py . Unfortunately it still does not pass the /imblearn/base.py file but I think it may be something with the example

when amending the file with

from imblearn.pipeline import Pipeline

instead of

from sklearn.pipeline import Pipeline

and adding these lines into the fit_transform function after line 260

elif hasattr(est, "fit_resample"):
                Xt = est.fit_resample(X, y, **fit_params)
rom sklearn.model_selection import train_test_split as tts
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
import dask_ml.model_selection as dcv
from sklearn.model_selection import GridSearchCV

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1.25, weights=[0.3, 0.7],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=5, n_clusters_per_class=1,
                           n_samples=5000, random_state=10)

# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
param_grid = {"pca__n_components":[1, 2, 3],}

# grid = GridSearchCV(pipeline, param_grid=param_grid)
grid = dcv.GridSearchCV(pipeline, param_grid=param_grid)

grid.fit(X_train, y_train)

Any inputs would be appreciated

@TomAugspurger
Copy link
Member Author

Thanks. So the issue is with dask_ml.model_selection.GridSearchCV? I'm confused about how this would work with scikit-learn, since (AFAIK) fit_resample isn't part of their API.

@sephib
Copy link

sephib commented Mar 9, 2020

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Mar 10, 2020 via email

@glemaitre
Copy link
Contributor

@TomAugspurger

I started a POC to adapt our RandomUnderSampler to support natively dask array and dataframe (in/out): scikit-learn-contrib/imbalanced-learn#777

I think that we can do something similar for both RandomOverSampler and ClusterCentroids. They don't rely on kNN and thus make it possible to work in a distributed setting. The other methods rely on kNN and I am not sure that it would be easy to do anything then.

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

I was wondering if you would have a bit of time just to check if, on the dask part, we don't implement something stupid in the above PR (I am not super familiar yet with distributed computation).

@sephib
Copy link

sephib commented Nov 6, 2020

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Nov 6, 2020 via email

@sephib
Copy link

sephib commented Nov 7, 2020

I guess we can see how the @glemaitre PR goes through and then see if there is anything else to do on dask-ml side

@vishalvvs
Copy link

Does imblearn supports Dask Natively??
I have been using joblib with parallel_backend = "dask" for it, but it seems that it is not able to parallelize my tasks.

@Jose-Bastos
Copy link

Any updates on this? For example, could I use RandomOverSampler if I use @glemaitre 's PR with minor changes? Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap Larger, self-contained pieces of work.
Projects
None yet
Development

No branches or pull requests

5 participants