Skip to content

Using boolean Series to mask array broken in 0.13 #5776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mwaskom opened this issue Dec 26, 2013 · 7 comments
Closed

Using boolean Series to mask array broken in 0.13 #5776

mwaskom opened this issue Dec 26, 2013 · 7 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@mwaskom
Copy link
Contributor

mwaskom commented Dec 26, 2013

Hi, I'm testing out some code on the 0.13 release candidate and I've run into problems with a fairly common (for me) pattern. It's no longer possible to use a boolean Series to index a numpy array. E.g.:

import numpy as np
import pandas as pd
x = np.random.randn(30)
mask = pd.Series(np.random.rand(30) > .5)
x[mask].mean()

Raises:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-27-73e40e756b4e> in <module>()
      3 x = np.random.randn(30)
      4 mask = pd.Series(np.random.rand(30) > .5)
----> 5 x[mask].mean()

IndexError: unsupported iterator index

Perhaps this is not a good idiom, but this change breaks quite a bit of existing code.

@jreback
Copy link
Contributor

jreback commented Dec 26, 2013

In [8]: np.random.seed(1234)

In [9]: x = np.random.randn(30)

In [10]: mask = pd.Series(np.random.rand(30) > .5)

In [11]: x
Out[11]: 
array([  4.71435164e-01,  -1.19097569e+00,   1.43270697e+00,
        -3.12651896e-01,  -7.20588733e-01,   8.87162940e-01,
         8.59588414e-01,  -6.36523504e-01,   1.56963721e-02,
        -2.24268495e+00,   1.15003572e+00,   9.91946022e-01,
         9.53324128e-01,  -2.02125482e+00,  -3.34077366e-01,
         2.11836468e-03,   4.05453412e-01,   2.89091941e-01,
         1.32115819e+00,  -1.54690555e+00,  -2.02646325e-01,
        -6.55969344e-01,   1.93421376e-01,   5.53438911e-01,
         1.31815155e+00,  -4.69305285e-01,   6.75554085e-01,
        -1.81702723e+00,  -1.83108540e-01,   1.05896919e+00])

In [12]: mask
Out[12]: 
0      True
1     False
2      True
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10    False
11     True
12     True
13     True
14    False
15     True
16    False
17     True
18    False
19     True
20     True
21    False
22     True
23     True
24     True
25     True
26     True
27     True
28    False
29     True
dtype: bool

Here are 2 workarounds

In [13]: Series(x)[mask].mean()
Out[13]: 0.095842422790904033

In [14]: x[mask.values].mean()
Out[14]: 0.095842422790904033

This doesn't work because of how numpy treats 'foreign' arrays; it basically calls getitem on each element (a) this is quite slow, (b) this might work depending exactly which values are True.

This is actually a pretty odd thing to do; why is x not simply a Series as well? (I know it works in 0.12, but that is because Series is a direct sub-class of ndarray and so numpy treats it differently).

@mwaskom
Copy link
Contributor Author

mwaskom commented Dec 27, 2013

Because there are lots of way of initializing some new data (np.zeros, random sampling, etc.) that don't return Series objects, and sometimes it's easier to start from there and then transform the values conditional on data that is in a DataFrame before being added to it.

I realize the workaround is simple, I'm just annoyed because it's going to have to be applied on an ad hoc basis each time I run into this pattern and have broken code. But I understand if it's an unavoidable/a problem on the numpy side.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2013

I always use series/data frames as it makes things simpler
IMHO

sometimes it's tricky to know how numpy treats foreign arrays as much of its access is c code so not so so easy to step thru

going to take a look at this some more as I think it should work (could be a bug on numpy side or possibly need some access method on a series)

@jtratner
Copy link
Contributor

I think the full set we need to implement are:

*array_interface
*array_struct
*array

And I think most can just be delegated

@jreback
Copy link
Contributor

jreback commented Dec 27, 2013

these r just tried in turn
and not this particular issue
u only need to define 1

@jreback
Copy link
Contributor

jreback commented Dec 27, 2013

so numpy does PyArray_Check, which ultimately calls PyObject_TypeCheck a python c-api on the object to determine whether the object is a subclass of the passed type (e.g. ndarray). I don't think this can be intercepted as it pretty much ignores any attempt to override with __instancecheck__ and __subclasscheck__ via a Series metaclass (though it intercepts other base types, so I know it 'works' in theory). Must be directly checking a variable defined in the c-api.

I would actually say this is an interface issue from numpy side. It should just see if its duck typed (after doing the current checks), because Series certainly emulates all aspects of the array. (e.g. if should just check if __array__ is available), not sure why it does not.

@mwaskom
Copy link
Contributor Author

mwaskom commented Dec 27, 2013

Ok sounds very reasonable, thanks for looking into it and feel free to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

3 participants