Skip to content

Dataframe Where Dataframe == False Returns Dataframe with Floats #10336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jaradc opened this issue Jun 12, 2015 · 6 comments · Fixed by #41389
Closed

Dataframe Where Dataframe == False Returns Dataframe with Floats #10336

jaradc opened this issue Jun 12, 2015 · 6 comments · Fixed by #41389
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jaradc
Copy link

jaradc commented Jun 12, 2015

If I have a Dataframe with True/False values only like this:

df_mask = pd.DataFrame({'AAA': [True] * 4,
                        'BBB': [False]*4,
                        'CCC': [True, False, True, False]}); print(df_mask)
    AAA    BBB    CCC
0  True  False   True
1  True  False  False
2  True  False   True
3  True  False  False

Then try to print where the values in the dataframe is equivalent to False like so:

print(df_mask[df_mask == False])
print(df_mask.where(df_mask == False))

My question is about column CCC. Column BBB shows False (as I expect) but why is index 1 and 3 in column CCC equal to 0 instead of False?

   AAA    BBB  CCC
0  NaN  False  NaN
1  NaN  False    0
2  NaN  False  NaN
3  NaN  False    0
   AAA    BBB  CCC
0  NaN  False  NaN
1  NaN  False    0
2  NaN  False  NaN
3  NaN  False    0

Why doesn't it return a dataframe that looks like this?

   AAA    BBB   CCC
0  NaN  False   NaN
1  NaN  False False
2  NaN  False   NaN
3  NaN  False False

I was asked to post this question here. I originally posted this on StackOverflow here.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

when you insert NaN into a column, it is check for compat, then changed to an appropriate dtype if necessary. bool is a sub-class of int which is a subclass of float, which supports NaN. You are suggesting for bool results that they go one step farther to object. I suppose this is possible, maybe even desirable. However, this would have to be checked for in this (in the BoolBlock in internals). If you'd like to take a crack at this would be great.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Enhancement labels Jun 12, 2015
@jreback jreback added this to the Next Major Release milestone Jun 12, 2015
@jaradc
Copy link
Author

jaradc commented Jun 12, 2015

I'm sorry but I'm just a lowly, humble pandas newbie and didn't even know about the hierarchy of classes you just mentioned. This question came about more from mere curiosity while I was following the pandas cookbook tutorials and learning about masks. The tutorials set values to integers so I was curious what bools would do and got a result I didn't expect (mixture of bool, NaN, and float).

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

@jaradc no problem. If you do want to look you could trace the call and see what happens. I just setup a test and step thru unless I see something interesting (e.g. this is called) and which needs a definition in the BoolBlock to handle a bit more.

@kawochen
Copy link
Contributor

The problem is actually with numpy. This line calls the numpy function.

>>> np.where(np.array([False,True,False,True]), np.array([True, False, True, False]), np.nan)
array([ nan,   0.,  nan,   0.])

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

@kawochen that line is ok, it correctly returns the correct result. It then needs to be coerced as a BoolBlock cannot hold NA. It then is inferred to be a FloatBlock, rather than an ObjectBlock. (normally you DO want to go to a FloatBlock, but if its coming from a BoolBlock and now has NA then its reasonable to change the type that it is coerced.

@mroeschke
Copy link
Member

These looks to work on master now. I supposed could use a test

In [19]: df_mask = pd.DataFrame({'AAA': [True] * 4,
    ...:                         'BBB': [False]*4,
    ...:                         'CCC': [True, False, True, False]}); print(df_mask)
    AAA    BBB    CCC
0  True  False   True
1  True  False  False
2  True  False   True
3  True  False  False

In [20]: df_mask.where(df_mask == False)
Out[20]:
   AAA    BBB    CCC
0  NaN  False    NaN
1  NaN  False  False
2  NaN  False    NaN
3  NaN  False  False

In [21]: df_mask[df_mask == False]
Out[21]:
   AAA    BBB    CCC
0  NaN  False    NaN
1  NaN  False  False
2  NaN  False    NaN
3  NaN  False  False

In [22]: pd.__version__
Out[22]: '1.3.0.dev0+1351.g04f9a4b10d.dirty'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 18, 2021
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.3 May 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants