Skip to content

Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
00krishna opened this issue Dec 18, 2019 · 2 comments

Comments

@00krishna
Copy link

00krishna commented Dec 18, 2019

Description

I am getting a memory error when using the SMOTENC fit_resample() method on a large dataset. I have about 8 million rows and about 50,000 positive values. I have 5 categorical columns and 1 numeric column in the dataset.

I can try and "thin" my dataset to reduce it. But I was wondering if any benchmarking has been done to estimate workable dataset sizes?

I can post the code here, but I think it will be the same as #300 or similar issues.

Steps/Code to Reproduce

I can post the code here, but I think it will be the same as #300 or similar issues.

Expected Results

No error should be thrown. I should get the resampled dataset as output.

Actual Results

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-11-9a6295703248> in <module>
----> 1 X_resampled, y_resampled = smote_nc.fit_resample(df_features, df_labels)

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
     79         )
     80 
---> 81         output = self._fit_resample(X, y)
     82 
     83         if self._X_columns is not None or self._y_name is not None:

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    980         X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
    981 
--> 982         X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
    983 
    984         # reverse the encoding of the categorical features

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
    727             nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
    728             X_new, y_new = self._make_samples(
--> 729                 X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
    730             )
    731             X_resampled.append(X_new)

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    107         cols = np.mod(samples_indices, nn_num.shape[1])
    108 
--> 109         X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
    110         y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
    111         return X_new, y_new

/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
   1035         # convert to dense array since scipy.sparse doesn't handle 3D
   1036         nn_data = (nn_data.toarray() if sparse.issparse(nn_data) else nn_data)
-> 1037         all_neighbors = nn_data[nn_num[rows]]
   1038 
   1039         categories_size = [self.continuous_features_.size] + [

MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64

Versions

Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
NumPy 1.17.4
SciPy 1.3.2
Scikit-Learn 0.22
Imbalanced-Learn 0.6.1

@00krishna 00krishna changed the title MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64 Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation Dec 18, 2019
@glemaitre
Copy link
Member

I don't see what we can do here. If you have 8 millions points, at some points we are going to compute the neighbors distance for those which is not tractable but this on what SMOTE is based on.

@00krishna
Copy link
Author

Yep. I understand what you mean. I found that the system works up till about 2.5 million rows.
I was just trying to find some different bases to remove rows, and then trying the resampler. So this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants