You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting a memory error when using the SMOTENCfit_resample() method on a large dataset. I have about 8 million rows and about 50,000 positive values. I have 5 categorical columns and 1 numeric column in the dataset.
I can try and "thin" my dataset to reduce it. But I was wondering if any benchmarking has been done to estimate workable dataset sizes?
I can post the code here, but I think it will be the same as #300 or similar issues.
Steps/Code to Reproduce
I can post the code here, but I think it will be the same as #300 or similar issues.
Expected Results
No error should be thrown. I should get the resampled dataset as output.
Actual Results
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-11-9a6295703248> in <module>
----> 1 X_resampled, y_resampled = smote_nc.fit_resample(df_features, df_labels)
/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
79 )
80
---> 81 output = self._fit_resample(X, y)
82
83 if self._X_columns is not None or self._y_name is not None:
/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
980 X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
981
--> 982 X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
983
984 # reverse the encoding of the categorical features
/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
727 nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
728 X_new, y_new = self._make_samples(
--> 729 X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
730 )
731 X_resampled.append(X_new)
/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
107 cols = np.mod(samples_indices, nn_num.shape[1])
108
--> 109 X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
110 y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)
111 return X_new, y_new
/media/hayagriva/anaconda3/envs/pNumerical/lib/python3.7/site-packages/imblearn/over_sampling/_smote.py in _generate_samples(self, X, nn_data, nn_num, rows, cols, steps)
1035 # convert to dense array since scipy.sparse doesn't handle 3D
1036 nn_data = (nn_data.toarray() if sparse.issparse(nn_data) else nn_data)
-> 1037 all_neighbors = nn_data[nn_num[rows]]
1038
1039 categories_size = [self.continuous_features_.size] + [
MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64
The text was updated successfully, but these errors were encountered:
00krishna
changed the title
MemoryError: Unable to allocate array with shape (8218042, 5, 19611) and data type float64
Benchmark for dataset size before Memory Errors on SMOTENC resampled dataset creation
Dec 18, 2019
I don't see what we can do here. If you have 8 millions points, at some points we are going to compute the neighbors distance for those which is not tractable but this on what SMOTE is based on.
Yep. I understand what you mean. I found that the system works up till about 2.5 million rows.
I was just trying to find some different bases to remove rows, and then trying the resampler. So this works.
Description
I am getting a memory error when using the
SMOTENC
fit_resample()
method on a large dataset. I have about 8 million rows and about 50,000 positive values. I have 5 categorical columns and 1 numeric column in the dataset.I can try and "thin" my dataset to reduce it. But I was wondering if any benchmarking has been done to estimate workable dataset sizes?
I can post the code here, but I think it will be the same as #300 or similar issues.
Steps/Code to Reproduce
I can post the code here, but I think it will be the same as #300 or similar issues.
Expected Results
No error should be thrown. I should get the resampled dataset as output.
Actual Results
Versions
Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
NumPy 1.17.4
SciPy 1.3.2
Scikit-Learn 0.22
Imbalanced-Learn 0.6.1
The text was updated successfully, but these errors were encountered: