Add support for imblearn's Pipeline and Samplers. #638

xKHUNx · 2020-04-10T09:39:41Z

I'm using dask-ml to parallelized my sklearn pipeline, which consist of imblearn pipeline. I hack the code to make it work with it.

This would be extremely useful for people that are using both libraries.

TomAugspurger

Seems like there are some test failures. I haven't looked closely at them yet. Can you reproduce them locally?

cc @glemaitre, if you have any thoughts on how this should be supported.

TomAugspurger · 2020-04-10T13:00:21Z

dask_ml/model_selection/_search.py

@@ -326,7 +330,7 @@ def do_fit_and_score(
    scorer,
    return_train_score,
 ):
-    if not isinstance(est, Pipeline):


Does imblearn.pipeline.Pipeline not subclass sklearn.pipeline.Pipeline?

glemaitre · 2020-04-13T07:55:07Z

cc @glemaitre, if you have any thoughts on how this should be supported.

Basically, we just inherit from Pipeline and allow to call fit_resample during fit of the Pipeline.
Actually, it would be interesting to have some supported resamplers in dask: random under- and over-sampling. The rest of the sampler would not make so much sense with large dataset and not sure that you can generate sample without a full view on the dataset. However, random sampler could be handy to resample on each node maybe.

TomAugspurger

@xKHUNx can you summarize the high-level changes? I'm a little concerned that this is affecting so many small places? For example, why does the return type of fit_transform need to be changed? Can that be limited to a new fit_reshape method that returns the transformed yt as well?

TomAugspurger · 2020-04-13T13:16:27Z

dask_ml/model_selection/_search.py

+try:
+    from imblearn.pipeline import Pipeline
+except:
+    from sklearn.pipeline import Pipeline


Why is this needed? Can we always use sklearn.pipeline.Pipeline and check for hasattr(x, "fit_resample")?

@xKHUNx can you summarize the high-level changes? I'm a little concerned that this is affecting so many small places? For example, why does the return type of fit_transform need to be changed? Can that be limited to a new fit_reshape method that returns the transformed yt as well?

I assume you meant fit_resample, correct me if I am wrong. My high level idea is that yt needs to be propagated throughout the pipeline as component that implements fit_resample might change the y values. The code works for my personal use case, where I perform RandomizedCVSearch with imblearn component in my pipeline. I can't post my code here as it is related to my work.

Why is this needed? Can we always use sklearn.pipeline.Pipeline and check for hasattr(x, "fit_resample")?

This is needed when the refit option is selected. If I understand correctly, when the refit option is selected, dask-ml will fit the pipeline normally as a whole, without going through each compoennt in the pipeline. In this scenario, if a normal sklearn pipeline is used, it won't handle pipeline component that implements fit_resample instead of fit_transform. On the other hand, if imblearn's pipeline is used, it would behave as it should.

stsievert · 2020-04-29T16:33:36Z

Would this PR close #317? It seems certainly is related.

TomAugspurger · 2020-05-01T19:28:31Z

@xKHUNx can you include a test with for changes (will need to include imblearn in the environment file)? It'd be helpful to have an example to play with.

sephib · 2020-05-10T08:42:23Z

Would this PR close #317? It seems certainly is related.

Yes

TomAugspurger · 2020-05-11T11:33:15Z

@xKHUNx are you able to provide a simple test / example using this. That'd help me with reviewing.

TomAugspurger · 2020-07-27T21:09:44Z

@xKHUNx are you able to add tests here?

xKHUNx · 2020-07-28T04:22:35Z

I have not test it myself, but this should do:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE
from scipy.stats import loguniform
from dask_ml.model_selection import RandomizedSearchCV
from dask.distributed import Client

# Initiate Dask client
client = Client('127.0.0.1:8786')

# Fix random seed
np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://github.com/raw/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('smote', SMOTE(sampling_strategy='all')),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

params = {
    'classifier__C': loguniform(0.001, 1)
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

final_clf = RandomizedSearchCV(clf, n_iter=10, param_distributions=params, cv=3, n_jobs=-1, refit=True)
final_clf.fit(X_train, y_train)
print("model score: %.3f" % final_clf.score(X_test, y_test))

I'm not familiar with writing test scripts, let alone Dask's one. So, it would be great if someone can turn this into a test script.

TomAugspurger · 2020-07-28T16:28:28Z

I won't have time, unfortunately. Are you interested in working on this PR still?

xKHUNx · 2020-07-29T03:03:36Z

Are you interested in working on this PR still?

This PR is good enough for my personal use case, I don't see the need to improve it for the near future.

TomAugspurger · 2020-07-29T11:14:40Z

OK. We would need to add a tests, clean up a few things, and clarify a couple changes before this will be merged. Let me know if you want to pick it up again.

Add support for imblearn's Pipeline and Samplers.

e92fb78

TomAugspurger reviewed Apr 10, 2020

View reviewed changes

Make sure yt is always defined, replace issubclass with isinstanece.

e0d3596

TomAugspurger reviewed Apr 13, 2020

View reviewed changes

xKHUNx added 4 commits April 23, 2020 15:31

Revert isinstanece back to issubclass.

c44d3ab

_do_featureunion returns y.

19e2cf2

Add ), ys

c29355e

Update _do_featureunion to return y.

84fd693

Return ys regardless.

7508f42

xKHUNx added 16 commits May 5, 2020 16:19

Print X's and y's shape for debugging.

ffeffc0

Print estimator and estimator type for debugging purpose.

250f931

Sometimes return None instead of ys in _do_fit_step().

d7d302e

Update if ys, when is_transform, none_passthrough is true.

605865b

Change dsk[(yt_name, m, n)] to ys.

a3bf61b

Print np.hstack(Xs) shape in feature_union_concat().

ddf7400

Change print shape to value.

c93dd9c

Update methods.py

98f9811

Just more debuggind updates.

9c481d4

Change back from ys to tuple (for debugging)

4623a0e

Fetch the first tuple in ys to return.

f07da71

Added feature_union_y().

25e2eeb

Remove extra comma.

ef29fd5

Update _search.py

86012e6

Fix typo.

7f3d769

Select only first element in ys list.

52734d3

glemaitre mentioned this pull request Jun 8, 2020

using dask_ml with imblearn scikit-learn-contrib/imbalanced-learn#701

Closed

TomAugspurger closed this Jul 29, 2020

glemaitre mentioned this pull request Aug 11, 2020

how can the tool be used with "larger than memory" data scenario scikit-learn-contrib/imbalanced-learn#736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for imblearn's Pipeline and Samplers. #638

Add support for imblearn's Pipeline and Samplers. #638

xKHUNx commented Apr 10, 2020

TomAugspurger left a comment

TomAugspurger Apr 10, 2020

glemaitre May 27, 2020

glemaitre commented Apr 13, 2020

TomAugspurger left a comment

TomAugspurger Apr 13, 2020

xKHUNx Apr 13, 2020 •

edited

Loading

stsievert commented Apr 29, 2020

TomAugspurger commented May 1, 2020 •

edited

Loading

sephib commented May 10, 2020

TomAugspurger commented May 11, 2020

TomAugspurger commented Jul 27, 2020

xKHUNx commented Jul 28, 2020 •

edited

Loading

TomAugspurger commented Jul 28, 2020

xKHUNx commented Jul 29, 2020

TomAugspurger commented Jul 29, 2020

Add support for imblearn's Pipeline and Samplers. #638

Add support for imblearn's Pipeline and Samplers. #638

Conversation

xKHUNx commented Apr 10, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Apr 10, 2020

Choose a reason for hiding this comment

glemaitre May 27, 2020

Choose a reason for hiding this comment

glemaitre commented Apr 13, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Apr 13, 2020

Choose a reason for hiding this comment

xKHUNx Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

stsievert commented Apr 29, 2020

TomAugspurger commented May 1, 2020 • edited Loading

sephib commented May 10, 2020

TomAugspurger commented May 11, 2020

TomAugspurger commented Jul 27, 2020

xKHUNx commented Jul 28, 2020 • edited Loading

TomAugspurger commented Jul 28, 2020

xKHUNx commented Jul 29, 2020

TomAugspurger commented Jul 29, 2020

xKHUNx Apr 13, 2020 •

edited

Loading

TomAugspurger commented May 1, 2020 •

edited

Loading

xKHUNx commented Jul 28, 2020 •

edited

Loading