Benchmark for dataset size before Memory Errors on `SMOTENC` resampled dataset creation #667

00krishna · 2019-12-18T17:17:01Z

Description

I am getting a memory error when using the SMOTENC fit_resample() method on a large dataset. I have about 8 million rows and about 50,000 positive values. I have 5 categorical columns and 1 numeric column in the dataset.

I can try and "thin" my dataset to reduce it. But I was wondering if any benchmarking has been done to estimate workable dataset sizes?

I can post the code here, but I think it will be the same as #300 or similar issues.

Steps/Code to Reproduce

I can post the code here, but I think it will be the same as #300 or similar issues.

Expected Results

No error should be thrown. I should get the resampled dataset as output.

Actual Results

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-11-9a6295703248> in <module>
----> 1 X_resampled, y_resampled = smote_nc.fit_resample(df_features, df_labels)

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
     79         )
     80 
---> 81         output = self._fit_resample(X, y)
     82 
     83         if self._X_columns is not None or self._y_name is not None:

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    980         X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
    981 
--> 982         X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
    983 
    984         # reverse the encoding of the categorical features

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    727             nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
    728             X_new, y_new = self._make_samples(
--> 729                 X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
    730             )
    731             X_resampled.append(X_new)

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    107         cols = np.mod(samples_indices, nn_num.shape[1])
    108 
--> 109         X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
    110         y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
    111         return X_new, y_new

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
   1035         # convert to dense array since scipy.sparse doesn't handle 3D
   1036         nn_data = (nn_data.toarray() if sparse.issparse(nn_data) else nn_data)
-> 1037         all_neighbors = nn_data[nn_num[rows]]
   1038 
   1039         categories_size = [self.continuous_features_.size] + [

MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64

Versions

Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
NumPy 1.17.4
SciPy 1.3.2
Scikit-Learn 0.22
Imbalanced-Learn 0.6.1

The text was updated successfully, but these errors were encountered:

glemaitre · 2019-12-18T21:50:15Z

I don't see what we can do here. If you have 8 millions points, at some points we are going to compute the neighbors distance for those which is not tractable but this on what SMOTE is based on.

00krishna · 2019-12-18T22:04:10Z

Yep. I understand what you mean. I found that the system works up till about 2.5 million rows.
I was just trying to find some different bases to remove rows, and then trying the resampler. So this works.

00krishna changed the title ~~MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64~~ Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation Dec 18, 2019

00krishna closed this as completed Dec 19, 2019

glemaitre mentioned this issue Nov 2, 2020

[ENH] Have a subset of sampler enabling sampling in large dataset #771

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for dataset size before Memory Errors on `SMOTENC` resampled dataset creation #667

Benchmark for dataset size before Memory Errors on `SMOTENC` resampled dataset creation #667

00krishna commented Dec 18, 2019 •

edited

Loading

glemaitre commented Dec 18, 2019

00krishna commented Dec 18, 2019

Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation #667

Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation #667

Comments

00krishna commented Dec 18, 2019 • edited Loading

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Dec 18, 2019

00krishna commented Dec 18, 2019

Benchmark for dataset size before Memory Errors on `SMOTENC` resampled dataset creation #667

Benchmark for dataset size before Memory Errors on `SMOTENC` resampled dataset creation #667

00krishna commented Dec 18, 2019 •

edited

Loading