What should I do to handle categorical variables? #33

delilyn · 2016-03-05T10:25:31Z

First, thanks for sharing the tools for us.
And I want to generates synthetic samples by SMOTE algorithm, but some of my features was categorical, like region 、gender and so on. I want to know how to handle these categorical variables to generate samples with the same type. I can't find any explanation in the document.
Thank you!

fmfn · 2016-03-07T22:33:49Z

It assumed that you will first vectorize your categorical features with your preferred method.

SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.

At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.

Does that help?

jacobmontiel · 2016-05-26T20:02:37Z

@fmfn @glemaitre
Related discussion
Oversampling with categorical variables

Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types:

Input
schema: [C | N | N | C | N]
samples = n

Output
schema: [C | N | N | C | N]
samples = n + ratio*minonrity_class_samples

Reference code
Weka - SMOTE.java

nchen9191 · 2016-11-26T05:23:48Z

I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise.

Thanks!

glemaitre · 2016-11-26T13:48:59Z

@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features.

nishkalavallabhi · 2017-07-10T20:19:22Z

Is it still only for continuous data?

glemaitre · 2017-07-10T20:39:52Z

Yep we still did not implement categorical methods. PR welcomed

dbarrundiag · 2017-07-27T21:54:05Z

@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much!

glemaitre · 2017-07-28T14:00:51Z

Yep, we need to accept string as input. Right now check_X_y accept only numeric value.
So those algorithm could overwrite the function to handle those data.

parulsahi · 2018-10-08T10:44:40Z

But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered?
I am using SMOTE algo but it is converting a nominal variable with categories(0 and 1) into continuous values between 0 and 1.
Is there a solution, maybe a modification to the categorical variables before feeding them into the SMOTE function.
Thank you.

glemaitre · 2018-10-21T13:06:33Z

Use SMOTENC for mix of categorical and continuous variable

atendra12 · 2018-10-22T09:13:20Z

Hi,

Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks.

lisiqi · 2018-11-13T15:35:24Z

@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values?

glemaitre · 2018-11-13T15:59:06Z

Nop. SMOTE-NC is for both categorical and numerical. I think that it should be another variant for SMOTE to handle solely categorical.

…

On Tue, 13 Nov 2018 at 16:35, Siqi Li ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9P3xy5fvcOzAUU6aSZMxbCMFXnhrBks5uuua_gaJpZM4Hp91j> .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

melgazar9 · 2019-01-21T02:41:25Z

I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC?

glemaitre · 2019-01-21T13:28:51Z

The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html

melgazar9 · 2019-01-21T14:57:59Z

Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this?

s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([1,2,2,9,3,5])
s3 = pd.Series([9,8,3,5,2,3])
s4 = pd.Series([0,1,1,0,1,0])
s5 = pd.Series([0,1,0,0,0,1])
df = pd.concat([s1,s2,s3,s4,s5], axis=1).rename(columns={0:'col1',1:'col2',2:'col3',3:'col4', 4:'col5'})

sm = SMOTENC(categorical_features=['col4', 'col5'])
X,y = sm.fit_resample(df[['col1','col2','col4']], df['col3'])

ValueError Traceback (most recent call last)
in
----> 1 sm = SMOTENC(categorical_features=['col4', 'col5']).fit_resample(df2[['col1','col2','col4']], df2['col3'])

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
83 self.sampling_strategy, y, self._sampling_type)
84
---> 85 output = self._fit_resample(X, y)
86
87 if binarize_y:

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
938 def fit_resample(self, X, y):
939 self.n_features = X.shape[1]
--> 940 self._validate_estimator()
941
942 # compute the median of the standard deviation of the minority class

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self)
931 raise ValueError(
932 'Some of the categorical indices are out of range. Indices'
--> 933 ' should be between 0 and {}'.format(self.n_features))
934 self.categorical_features = categorical_features
935 self.continuous_features_ = np.setdiff1d(np.arange(self.n_features_),

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3

glemaitre · 2019-01-21T15:10:42Z

Please open a new issue instead of commenting on a closed issue.

glemaitre · 2019-01-21T15:21:09Z

You should pass the numerical indices and not column name as indicated in the documentation.

fmfn closed this as completed Mar 7, 2016

chkoar mentioned this issue Feb 5, 2018

Is SMOTE-Non Continuous implemented in scikit-learn library? #401

Closed

scikit-learn-contrib locked as resolved and limited conversation to collaborators Jan 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should I do to handle categorical variables? #33

What should I do to handle categorical variables? #33

delilyn commented Mar 5, 2016

fmfn commented Mar 7, 2016

jacobmontiel commented May 26, 2016

nchen9191 commented Nov 26, 2016

glemaitre commented Nov 26, 2016

nishkalavallabhi commented Jul 10, 2017

glemaitre commented Jul 10, 2017 via email

dbarrundiag commented Jul 27, 2017

glemaitre commented Jul 28, 2017

parulsahi commented Oct 8, 2018

glemaitre commented Oct 21, 2018

atendra12 commented Oct 22, 2018 •

edited

Loading

lisiqi commented Nov 13, 2018

glemaitre commented Nov 13, 2018 via email

melgazar9 commented Jan 21, 2019

glemaitre commented Jan 21, 2019

melgazar9 commented Jan 21, 2019

glemaitre commented Jan 21, 2019

glemaitre commented Jan 21, 2019

What should I do to handle categorical variables? #33

What should I do to handle categorical variables? #33

Comments

delilyn commented Mar 5, 2016

fmfn commented Mar 7, 2016

jacobmontiel commented May 26, 2016

nchen9191 commented Nov 26, 2016

glemaitre commented Nov 26, 2016

nishkalavallabhi commented Jul 10, 2017

glemaitre commented Jul 10, 2017 via email

dbarrundiag commented Jul 27, 2017

glemaitre commented Jul 28, 2017

parulsahi commented Oct 8, 2018

glemaitre commented Oct 21, 2018

atendra12 commented Oct 22, 2018 • edited Loading

lisiqi commented Nov 13, 2018

glemaitre commented Nov 13, 2018 via email

melgazar9 commented Jan 21, 2019

glemaitre commented Jan 21, 2019

melgazar9 commented Jan 21, 2019

glemaitre commented Jan 21, 2019

glemaitre commented Jan 21, 2019

atendra12 commented Oct 22, 2018 •

edited

Loading