-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DataFrame from hierarchical NumPy recarray with hierarchical MultiIndex results in all NaN values #13421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue is rather that pandas does not parses that hierarchical dtype as you expect:
Given the above result, the rest (empty frame when providing columns) is logical again. BTW, I closed the previous issue, but that does not mean it is prohibited to ask further questions on that topic over there :-) |
This sort of works with the only constructor that accepts rec-arrays.
|
this is essentially another case of #7893 |
I disagree that this is another case of #7893. As I tried to explain:
The current behavior is that an erroneous DataFrame is created, which does not contain data from the input array, but is also not empty. If Pandas recognizes that the column names match, it should use the input data; if it believes the names don't match then the result should be an empty DataFrame. The current behavior is half-and-half. |
and that's a bug I agree we just need another issue that covers the same material as another issue |
What I would like more than anything is to have a simple way to take a hierarchical recarray (as my example |
@jreback the result you show from
So in the last line, the |
assign the columns directly |
not even sure why you would work with rec arrays to begin with - they r not very friendly (not to mention have an inefficient memory repr) |
@jreback I agree that rec arrays don't work very well, but I disagree that they are memory inefficient -- the data is all packed together in the dtype, so that seems perfectly reasonable to me. |
@jreback to use a non-hierarchical example, let's say I have received from another library a big list of tuples and I have a dtype list which corresponds to them, e.g.:
Now in NumPy I do this:
And I get something useful:
I can then construct a DataFrame from that array. Is there a better way to construct a DataFrame with explicit, heterogeneous column types? I don't want Pandas to guess the column types. |
this is exactly what .from_records() does they are memory inefficient as pandas has to convert then to a columnar layout |
This doesn't work--the dtype cannot be specified:
It gives:
Only by using NumPy do I get what I want:
Note we now see int16 rather than int64. You have said that using recarray is memory-inefficient, but I am struggling because in my use case, not using recarray causes inefficiency in Pandas. Is there a way to construct a DataFrame with multiple columns of different types efficiently from a sequence of tuples? Obviously I don't have an efficient way to get one column at a time from the tuples, so I can't easily construct a bunch of Series etc. |
That There is no way (as far as I know) to pass directly a compound dtype without making a numpy array first. |
@jzwinck This gives data in the form you want:
You need to make the numpy array with the proper dtype before passing it to |
@jorisvandenbossche and @shoyer Right, so what you and I are all saying is that constructing a NumPy recarray (structured array) is a prerequisite to constructing a Pandas DataFrame. Yet above I am being told that recarrays are bad and inefficient. So I don't really understand what to take away from all this. |
@jzwinck It is only a prerequisite when you want to specify a compound dtype. Otherwise, you can pass the list of tuples just to Further, you only have to worry about this if memory/performance of constructing your frame is a bottleneck. |
I filed #13415 in which it was said that
DataFrame(recarray, columns=MultiIndex)
does reindexing and so only selects matching columns to be in the resultant frame. I can see how this might be a backward compatibility constraint. However, I have discovered a similar but different case which still seems broken:This creates a 3x2 array of zeros, but results in a 3x2 DataFrame of NaNs. Note that the column names basically match: the NumPy array has a top-level
q
with subitemsx
andy
, and so does theMultiIndex
. If the top-level name in the MultiIndex is changed to something other thanq
it results in an empty DataFrame, meaning that there is some recognized correspondence between the input data and the requested columns. But the data is lost nevertheless, putting NaNs where should be zeros.Either the columns are considered non-matching, in which case the result should be an empty DataFrame, or they do match, in which case the result should be a DataFrame with contents from the input array.
The text was updated successfully, but these errors were encountered: