-
-
Notifications
You must be signed in to change notification settings - Fork 262
Samplers / pipelines for imbalanced datasets #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, |
@sephib do you have any examples of fit_resample and fit_sample? I'm not familiar with them. |
The core fit_resample function is from within imblearn/base.py. |
Thanks. The standard `sklearn.pipeline.Pipeline` works well with dask
containers. Does the one in imblearn work with Dask objects? If not, what
breaks?
…On Tue, Mar 3, 2020 at 3:43 AM sephib ***@***.***> wrote:
The core fit_resample
<https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py#L54>
function is from within imblearn/base.py
<https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5ae/imblearn/base.py>
.
It is incorporated throughout the imblearn library - for example here is
the implementation within imblearn pipeline
<https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6b3c5aed61f2e5dc0e8af87d97ea92b95dcafdd0/imblearn/pipeline.py#L333>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#317?email_source=notifications&email_token=AAKAOIXTMCM6CANWCZF4BSLRFTGKRA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSY6EI#issuecomment-593858321>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUBOVBXDONJBYWE3CLRFTGKRANCNFSM4FMLFMHQ>
.
|
Currently when I ran
This is from the dask_ml/model_selection/method.py fit_transform function which is looking for a |
It'd would help to have a minimal minimal example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Hi when amending the file with
instead of
and adding these lines into the fit_transform function after line 260
Any inputs would be appreciated |
Thanks. So the issue is with dask_ml.model_selection.GridSearchCV? I'm confused about how this would work with scikit-learn, since (AFAIK) fit_resample isn't part of their API. |
That's the magic of |
I don't really see how that would work. But feel free to propose changes in
a PR and we can discuss that there.
…On Mon, Mar 9, 2020 at 4:50 PM sephib ***@***.***> wrote:
That's the magic of imblearn.pipeline (if you un-comment the
dvc.GirdSearchCV and leave the sklearn GridSearchCV the code runs without
any errors).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#317?email_source=notifications&email_token=AAKAOIS7W5XODNWP5WGBPMDRGVQAZA5CNFSM4FMLFMH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJGJKA#issuecomment-596796584>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQ7ML7HENASN5PPL5LRGVQAZANCNFSM4FMLFMHQ>
.
|
I started a POC to adapt our I think that we can do something similar for both Regarding the integration with the imbalanced-learn I was wondering if you would have a bit of time just to check if, on the dask part, we don't implement something stupid in the above PR (I am not super familiar yet with distributed computation). |
@TomAugspurger is a PR still relevant? if so i'll be happy to get some guidance |
I'm not sure what's required, but perhaps imbalanced-learn's Pipeline will
just be able to accept Dask collections after that pull request? I don't
know what estimators like GridSearchCV need to do (if anything) to work
with imbalanced-learn pipelines.
…On Fri, Nov 6, 2020 at 7:37 AM sephib ***@***.***> wrote:
Regarding the integration with the imbalanced-learn Pipeline, our
implementation is exactly the one of scikit-learn but we check if a sampler
is within the pipeline. This check looks for the attribute fit_resample
which would be applied only during fit of the pipeline. Thus, I would say
that you can safely use imblearn.Pipeline in replacement of the
sklearn.Pipeline.
@TomAugspurger <https://github.com/TomAugspurger> is a PR still relevant?
if so i'll be happy to get some guidance
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#317 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIWQIZFLMFA25KATC5DSOP3YRANCNFSM4FMLFMHQ>
.
|
I guess we can see how the @glemaitre PR goes through and then see if there is anything else to do on dask-ml side |
Does imblearn supports Dask Natively?? |
Any updates on this? For example, could I use RandomOverSampler if I use @glemaitre 's PR with minor changes? Thank you in advance! |
Imbalanced datasets, where the classes have very different occurrence rates, can show up in large data sets.
There are many strategies for dealing with imbalanced data. http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html implements a set, some of which could be scaled to large datasets with dask.
The text was updated successfully, but these errors were encountered: