-
Notifications
You must be signed in to change notification settings - Fork 1.3k
What should I do to handle categorical variables? #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It assumed that you will first vectorize your categorical features with your preferred method. SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object. At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels. Does that help? |
@fmfn @glemaitre Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types: Input Output Reference code |
I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise. Thanks! |
@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features. |
Is it still only for continuous data? |
Yep we still did not implement categorical methods. PR welcomed
|
@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much! |
Yep, we need to accept string as input. Right now |
But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered? |
Use SMOTENC for mix of categorical and continuous variable |
Hi, Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks. |
@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values? |
Nop. SMOTE-NC is for both categorical and numerical. I think that it should
be another variant for SMOTE to handle solely categorical.
…On Tue, 13 Nov 2018 at 16:35, Siqi Li ***@***.***> wrote:
@glemaitre <https://github.com/glemaitre> Hi, is it possible to use
SMOTENC for only categorical features, within which there are many
categorical values?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHG9P3xy5fvcOzAUU6aSZMxbCMFXnhrBks5uuua_gaJpZM4Hp91j>
.
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
|
I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC? |
The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html |
Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this? s1 = pd.Series([1,2,3,4,5,6]) sm = SMOTENC(categorical_features=['col4', 'col5']) ValueError Traceback (most recent call last) ~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y) ~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y) ~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self) ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3 |
Please open a new issue instead of commenting on a closed issue. |
You should pass the numerical indices and not column name as indicated in the documentation. |
First, thanks for sharing the tools for us.
And I want to generates synthetic samples by SMOTE algorithm, but some of my features was categorical, like region 、gender and so on. I want to know how to handle these categorical variables to generate samples with the same type. I can't find any explanation in the document.
Thank you!
The text was updated successfully, but these errors were encountered: