Skip to content

What should I do to handle categorical variables? #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
delilyn opened this issue Mar 5, 2016 · 18 comments
Closed

What should I do to handle categorical variables? #33

delilyn opened this issue Mar 5, 2016 · 18 comments

Comments

@delilyn
Copy link

delilyn commented Mar 5, 2016

First, thanks for sharing the tools for us.
And I want to generates synthetic samples by SMOTE algorithm, but some of my features was categorical, like region 、gender and so on. I want to know how to handle these categorical variables to generate samples with the same type. I can't find any explanation in the document.
Thank you!

@fmfn
Copy link
Collaborator

fmfn commented Mar 7, 2016

It assumed that you will first vectorize your categorical features with your preferred method.

SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.

At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.

Does that help?

@fmfn fmfn closed this as completed Mar 7, 2016
@jacobmontiel
Copy link

@fmfn @glemaitre
Related discussion
Oversampling with categorical variables

Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types:

Input
schema: [C | N | N | C | N]
samples = n

Output
schema: [C | N | N | C | N]
samples = n + ratio*minonrity_class_samples

Reference code
Weka - SMOTE.java

@nchen9191
Copy link

I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise.

Thanks!

@glemaitre
Copy link
Member

@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features.

@nishkalavallabhi
Copy link

Is it still only for continuous data?

@glemaitre
Copy link
Member

glemaitre commented Jul 10, 2017 via email

@dbarrundiag
Copy link

@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much!

@glemaitre
Copy link
Member

Yep, we need to accept string as input. Right now check_X_y accept only numeric value.
So those algorithm could overwrite the function to handle those data.

@parulsahi
Copy link

But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered?
I am using SMOTE algo but it is converting a nominal variable with categories(0 and 1) into continuous values between 0 and 1.
Is there a solution, maybe a modification to the categorical variables before feeding them into the SMOTE function.
Thank you.

@glemaitre
Copy link
Member

Use SMOTENC for mix of categorical and continuous variable

@atendra12
Copy link

atendra12 commented Oct 22, 2018

Hi,

Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks.

@lisiqi
Copy link

lisiqi commented Nov 13, 2018

@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values?

@glemaitre
Copy link
Member

glemaitre commented Nov 13, 2018 via email

@melgazar9
Copy link

I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC?

@glemaitre
Copy link
Member

The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html

@melgazar9
Copy link

Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this?

s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([1,2,2,9,3,5])
s3 = pd.Series([9,8,3,5,2,3])
s4 = pd.Series([0,1,1,0,1,0])
s5 = pd.Series([0,1,0,0,0,1])
df = pd.concat([s1,s2,s3,s4,s5], axis=1).rename(columns={0:'col1',1:'col2',2:'col3',3:'col4', 4:'col5'})

sm = SMOTENC(categorical_features=['col4', 'col5'])
X,y = sm.fit_resample(df[['col1','col2','col4']], df['col3'])

ValueError Traceback (most recent call last)
in
----> 1 sm = SMOTENC(categorical_features=['col4', 'col5']).fit_resample(df2[['col1','col2','col4']], df2['col3'])

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
83 self.sampling_strategy, y, self._sampling_type)
84
---> 85 output = self._fit_resample(X, y)
86
87 if binarize_y:

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
938 def fit_resample(self, X, y):
939 self.n_features
= X.shape[1]
--> 940 self._validate_estimator()
941
942 # compute the median of the standard deviation of the minority class

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self)
931 raise ValueError(
932 'Some of the categorical indices are out of range. Indices'
--> 933 ' should be between 0 and {}'.format(self.n_features
))
934 self.categorical_features
= categorical_features
935 self.continuous_features_ = np.setdiff1d(np.arange(self.n_features_),

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3

@glemaitre
Copy link
Member

Please open a new issue instead of commenting on a closed issue.

@scikit-learn-contrib scikit-learn-contrib locked as resolved and limited conversation to collaborators Jan 21, 2019
@glemaitre
Copy link
Member

You should pass the numerical indices and not column name as indicated in the documentation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests