Skip to content

DataFrame from hierarchical NumPy recarray with hierarchical MultiIndex results in all NaN values #13421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzwinck opened this issue Jun 10, 2016 · 17 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Jun 10, 2016

I filed #13415 in which it was said that DataFrame(recarray, columns=MultiIndex) does reindexing and so only selects matching columns to be in the resultant frame. I can see how this might be a backward compatibility constraint. However, I have discovered a similar but different case which still seems broken:

arr = np.zeros(3, [('q', [('x',float), ('y',int)])])
ind = pd.MultiIndex.from_tuples([('q','x'),('q','y')])
pd.DataFrame(arr, columns=ind)

This creates a 3x2 array of zeros, but results in a 3x2 DataFrame of NaNs. Note that the column names basically match: the NumPy array has a top-level q with subitems x and y, and so does the MultiIndex. If the top-level name in the MultiIndex is changed to something other than q it results in an empty DataFrame, meaning that there is some recognized correspondence between the input data and the requested columns. But the data is lost nevertheless, putting NaNs where should be zeros.

Either the columns are considered non-matching, in which case the result should be an empty DataFrame, or they do match, in which case the result should be a DataFrame with contents from the input array.

@jzwinck jzwinck changed the title DataFrame from hierarchical NumPy recarray with hierarchical MultiIndex discards all data DataFrame from hierarchical NumPy recarray with hierarchical MultiIndex results in all NaN values Jun 10, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2016

The issue is rather that pandas does not parses that hierarchical dtype as you expect:

In [74]: arr = np.zeros(3, [('q', [('x',float), ('y',int)])])

In [76]: pd.DataFrame(arr)
Out[76]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

Given the above result, the rest (empty frame when providing columns) is logical again.
However, I am not sure what the correct way to convert such a recarray should be. The above also seems to make sense, as the records of the recarray consist of tuples, the resulting dataframe has tuples as well.

BTW, I closed the previous issue, but that does not mean it is prohibited to ask further questions on that topic over there :-)

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

This sort of works with the only constructor that accepts rec-arrays.

In [4]: pd.DataFrame.from_records(arr, columns=ind)
Out[4]: 
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

this is essentially another case of #7893

@jreback jreback closed this as completed Jun 10, 2016
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Jun 10, 2016
@jreback jreback added this to the No action milestone Jun 10, 2016
@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 10, 2016

I disagree that this is another case of #7893. As I tried to explain:

Either the columns are considered non-matching, in which case the result should be an empty DataFrame, or they do match, in which case the result should be a DataFrame with contents from the input array.

The current behavior is that an erroneous DataFrame is created, which does not contain data from the input array, but is also not empty. If Pandas recognizes that the column names match, it should use the input data; if it believes the names don't match then the result should be an empty DataFrame. The current behavior is half-and-half.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

and that's a bug I agree

we just need another issue that covers the same material as another issue
it will just even more lost - if you would like to address that issue then you can include this as a test case

@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 10, 2016

What I would like more than anything is to have a simple way to take a hierarchical recarray (as my example arr) and get it into a DataFrame with MultiIndex. I think you see what I am trying to do--can you offer a workaround?

@jorisvandenbossche
Copy link
Member

@jreback the result you show from from_records is exactly the same as from DataFrame():

In [1]: arr = np.zeros(3, [('q', [('x',float), ('y',int)])])

In [2]: pd.DataFrame(arr)
Out[2]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

In [3]: pd.DataFrame.from_records(arr)
Out[3]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

In [4]: ind = pd.MultiIndex.from_tuples([('q','x'),('q','y')])

In [5]: pd.DataFrame.from_records(arr, columns=ind)
Out[5]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

So in the last line, the columns=ind is actually ignored which rather looks like a bug

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

assign the columns directly

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

not even sure why you would work with rec arrays to begin with - they r not very friendly (not to mention have an inefficient memory repr)

@shoyer
Copy link
Member

shoyer commented Jun 10, 2016

@jreback I agree that rec arrays don't work very well, but I disagree that they are memory inefficient -- the data is all packed together in the dtype, so that seems perfectly reasonable to me.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 10, 2016

@jreback to use a non-hierarchical example, let's say I have received from another library a big list of tuples and I have a dtype list which corresponds to them, e.g.:

data = [(1.2, 'foo'), (3.4, 'bar')] # in reality wider and very long, comes from another library
dtype = [('value', float), ('name', 'S3')]

Now in NumPy I do this:

np.array(data, dtype)

And I get something useful:

array([(1.2, 'foo'), (3.4, 'bar')], 
    dtype=[('value', '<f8'), ('name', 'S3')])

I can then construct a DataFrame from that array. Is there a better way to construct a DataFrame with explicit, heterogeneous column types? I don't want Pandas to guess the column types.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

this is exactly what .from_records() does
simply assign the columns after if they r MultiIndexes (which is a bug)

they are memory inefficient as pandas has to convert then to a columnar layout

@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 10, 2016

This doesn't work--the dtype cannot be specified:

data = [(1.2, 5), (3.4, 6)]
dtype = [('value', float), ('name', 'i2')]
pd.DataFrame.from_records(data)._data

It gives:

Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 1, 1), 1 x 2, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64

Only by using NumPy do I get what I want:

pd.DataFrame.from_records(np.array(data, dtype))._data

Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 1, 1), 1 x 2, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int16

Note we now see int16 rather than int64. You have said that using recarray is memory-inefficient, but I am struggling because in my use case, not using recarray causes inefficiency in Pandas.

Is there a way to construct a DataFrame with multiple columns of different types efficiently from a sequence of tuples? Obviously I don't have an efficient way to get one column at a time from the tuples, so I can't easily construct a bunch of Series etc.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2016

That pd.DataFrame.from_records(data, dtype) does not give the desired result is expected, as the second keyword argument is index (so you are setting the dtype list as the index values).

There is no way (as far as I know) to pass directly a compound dtype without making a numpy array first.

@shoyer
Copy link
Member

shoyer commented Jun 10, 2016

@jzwinck This gives data in the form you want:

dtype = [('value', float), ('name', 'i2')]
data = np.array([(1.2, 5), (3.4, 6)], dtype)
pd.DataFrame.from_records(data).dtypes

You need to make the numpy array with the proper dtype before passing it to from_records

@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 10, 2016

@jorisvandenbossche and @shoyer Right, so what you and I are all saying is that constructing a NumPy recarray (structured array) is a prerequisite to constructing a Pandas DataFrame. Yet above I am being told that recarrays are bad and inefficient. So I don't really understand what to take away from all this.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2016

@jzwinck It is only a prerequisite when you want to specify a compound dtype. Otherwise, you can pass the list of tuples just to DataFrame() and it will work without making a recarray first.

Further, you only have to worry about this if memory/performance of constructing your frame is a bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants