Samplers / pipelines for imbalanced datasets #317

TomAugspurger · 2018-07-27T01:51:40Z

Imbalanced datasets, where the classes have very different occurrence rates, can show up in large data sets.

There are many strategies for dealing with imbalanced data. http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html implements a set, some of which could be scaled to large datasets with dask.

sephib · 2020-02-27T17:46:27Z

Hi,
I think that most of the changes would be to introduce the option of fit_resample and fit_sample into the fit_transform method.
I'll be happy to assist on this issue.

TomAugspurger · 2020-03-02T14:01:14Z

@sephib do you have any examples of fit_resample and fit_sample? I'm not familiar with them.

sephib · 2020-03-03T09:43:04Z

The core fit_resample function is from within imblearn/base.py.
It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline

TomAugspurger · 2020-03-03T12:42:22Z

Thanks. The standard `sklearn.pipeline.Pipeline` works well with dask containers. Does the one in imblearn work with Dask objects? If not, what breaks?

…

On Tue, Mar 3, 2020 at 3:43 AM sephib ***@***.***> wrote: The core fit_resample <https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py#L54> function is from within imblearn/base.py <https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py> . It is incorporated throughout the imblearn library - for example here is the implementation within imblearn pipeline <https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5aed61f2e5dc0e8af87d97ea92b95dcafdd0/imblearn/pipeline.py#L333> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#317?email_source=notifications&email_token=AAKAOIXTMCM6CANWCZF4BSLRFTGKRA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSY6EI#issuecomment-593858321>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUBOVBXDONJBYWE3CLRFTGKRANCNFSM4FMLFMHQ> .

sephib · 2020-03-04T07:23:05Z

Currently when I ran daskml with an imblearn pipeline I got an error:

AttributeError: 'FunctionSamplerw object has no attribute 'transform'

This is from the dask_ml/model_selection/method.py fit_transform function which is looking for a fit_transform or fit and transform attributes (which are in imblearn "converted" to fit_resample

TomAugspurger · 2020-03-04T12:46:58Z

It'd would help to have a minimal minimal example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

sephib · 2020-03-09T08:51:15Z

Hi
Here is a sample code that passes the dask_ml/model_selection/methods.py . Unfortunately it still does not pass the /imblearn/base.py file but I think it may be something with the example

when amending the file with

from imblearn.pipeline import Pipeline

instead of

from sklearn.pipeline import Pipeline

and adding these lines into the fit_transform function after line 260

elif hasattr(est, "fit_resample"):
                Xt = est.fit_resample(X, y, **fit_params)

rom sklearn.model_selection import train_test_split as tts
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
import dask_ml.model_selection as dcv
from sklearn.model_selection import GridSearchCV

# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=1.25, weights=[0.3, 0.7],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=5, n_clusters_per_class=1,
                           n_samples=5000, random_state=10)

# Instanciate a PCA object for the sake of easy visualisation
pca = PCA(n_components=2)

# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

# Create the classifier
knn = KNN(1)

# Make the splits
X_train, X_test, y_train, y_test = tts(X, y, random_state=42)

# Add one transformers and two samplers in the pipeline object
pipeline = make_pipeline(pca, enn, renn, knn)
param_grid = {"pca__n_components":[1, 2, 3],}

# grid = GridSearchCV(pipeline, param_grid=param_grid)
grid = dcv.GridSearchCV(pipeline, param_grid=param_grid)

grid.fit(X_train, y_train)

Any inputs would be appreciated

TomAugspurger · 2020-03-09T15:17:30Z

Thanks. So the issue is with dask_ml.model_selection.GridSearchCV? I'm confused about how this would work with scikit-learn, since (AFAIK) fit_resample isn't part of their API.

sephib · 2020-03-09T21:50:04Z

That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors).

TomAugspurger · 2020-03-10T13:42:14Z

I don't really see how that would work. But feel free to propose changes in a PR and we can discuss that there.

…

On Mon, Mar 9, 2020 at 4:50 PM sephib ***@***.***> wrote: That's the magic of imblearn.pipeline (if you un-comment the dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without any errors). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#317?email_source=notifications&email_token=AAKAOIS7W5XODNWP5WGBPMDRGVQAZA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJGJKA#issuecomment-596796584>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQ7ML7HENASN5PPL5LRGVQAZANCNFSM4FMLFMHQ> .

glemaitre · 2020-11-06T10:42:49Z

@TomAugspurger

I started a POC to adapt our RandomUnderSampler to support natively dask array and dataframe (in/out): scikit-learn-contrib/imbalanced-learn#777

I think that we can do something similar for both RandomOverSampler and ClusterCentroids. They don't rely on kNN and thus make it possible to work in a distributed setting. The other methods rely on kNN and I am not sure that it would be easy to do anything then.

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

I was wondering if you would have a bit of time just to check if, on the dask part, we don't implement something stupid in the above PR (I am not super familiar yet with distributed computation).

sephib · 2020-11-06T13:36:58Z

Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline.

@TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance

TomAugspurger · 2020-11-06T14:16:42Z

I'm not sure what's required, but perhaps imbalanced-learn's Pipeline will just be able to accept Dask collections after that pull request? I don't know what estimators like GridSearchCV need to do (if anything) to work with imbalanced-learn pipelines.

…

On Fri, Nov 6, 2020 at 7:37 AM sephib ***@***.***> wrote: Regarding the integration with the imbalanced-learn Pipeline, our implementation is exactly the one of scikit-learn but we check if a sampler is within the pipeline. This check looks for the attribute fit_resample which would be applied only during fit of the pipeline. Thus, I would say that you can safely use imblearn.Pipeline in replacement of the sklearn.Pipeline. @TomAugspurger <https://github.com/TomAugspurger> is a PR still relevant? if so i'll be happy to get some guidance — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#317 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIWQIZFLMFA25KATC5DSOP3YRANCNFSM4FMLFMHQ> .

sephib · 2020-11-07T21:53:49Z

I guess we can see how the @glemaitre PR goes through and then see if there is anything else to do on dask-ml side

vishalvvs · 2022-09-01T09:43:19Z

Does imblearn supports Dask Natively??
I have been using joblib with parallel_backend = "dask" for it, but it seems that it is not able to parallelize my tasks.

Jose-Bastos · 2024-04-09T18:49:38Z

Any updates on this? For example, could I use RandomOverSampler if I use @glemaitre 's PR with minor changes? Thank you in advance!

TomAugspurger mentioned this issue Jul 27, 2018

Case Study: Criteo dataset #295

Open

TomAugspurger added the Roadmap Larger, self-contained pieces of work. label Aug 16, 2018

sephib mentioned this issue Mar 29, 2020

using dask_ml with imblearn scikit-learn-contrib/imbalanced-learn#701

Closed

stsievert mentioned this issue Apr 29, 2020

Add support for imblearn's Pipeline and Samplers. #638

Closed

This was referenced Nov 2, 2020

[ENH] Have a subset of sampler enabling sampling in large dataset scikit-learn-contrib/imbalanced-learn#771

Open

ENH make Random*Sampler accept dask array and dataframe scikit-learn-contrib/imbalanced-learn#777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samplers / pipelines for imbalanced datasets #317

Samplers / pipelines for imbalanced datasets #317

TomAugspurger commented Jul 27, 2018 •

edited

Loading

sephib commented Feb 27, 2020

TomAugspurger commented Mar 2, 2020

sephib commented Mar 3, 2020

TomAugspurger commented Mar 3, 2020 via email

sephib commented Mar 4, 2020 •

edited

Loading

TomAugspurger commented Mar 4, 2020

sephib commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

sephib commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020 via email

glemaitre commented Nov 6, 2020

sephib commented Nov 6, 2020

TomAugspurger commented Nov 6, 2020 via email

sephib commented Nov 7, 2020

vishalvvs commented Sep 1, 2022

Jose-Bastos commented Apr 9, 2024

Samplers / pipelines for imbalanced datasets #317

Samplers / pipelines for imbalanced datasets #317

Comments

TomAugspurger commented Jul 27, 2018 • edited Loading

sephib commented Feb 27, 2020

TomAugspurger commented Mar 2, 2020

sephib commented Mar 3, 2020

TomAugspurger commented Mar 3, 2020 via email

sephib commented Mar 4, 2020 • edited Loading

TomAugspurger commented Mar 4, 2020

sephib commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

sephib commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020 via email

glemaitre commented Nov 6, 2020

sephib commented Nov 6, 2020

TomAugspurger commented Nov 6, 2020 via email

sephib commented Nov 7, 2020

vishalvvs commented Sep 1, 2022

Jose-Bastos commented Apr 9, 2024

TomAugspurger commented Jul 27, 2018 •

edited

Loading

sephib commented Mar 4, 2020 •

edited

Loading