Skip to content

Inconsistent behavior of hierarchical indexes when indexes are of different data types #3521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kghose opened this issue May 3, 2013 · 11 comments

Comments

@kghose
Copy link

kghose commented May 3, 2013

import pandas as pd, pylab

#int index
col = pd.MultiIndex.from_tuples([('c1',0),('c1',1),('c2',0)],names=['f','s'])
dat = pylab.randn(2,len(col))
df1 = pd.DataFrame(dat, columns=col)

#Text index
col = pd.MultiIndex.from_tuples([('c3','x'),('c3','y')])
dat = pylab.randn(2,len(col))
df2 = pd.DataFrame(dat, columns=col)

#This does not merge column indexes
df_a = pd.concat([df1,df2])
#But this does
df_b = pd.concat([df1,df2], axis=1)

col = pd.MultiIndex.from_tuples([('c1','0'),('c1','1'),('c2','0')],names=['f','s'])
dat = pylab.randn(2,len(col))
df4 = pd.DataFrame(dat, columns=col)
df_d = pd.concat([df4,df2], axis=0)
In [31]: df_a
Out[31]: 
   c1      c2        c3          
    0   1   0         x         y
0 NaN NaN NaN       NaN       NaN
1 NaN NaN NaN       NaN       NaN
0 NaN NaN NaN -0.694275 -1.357936
1 NaN NaN NaN -1.450523 -1.453957

This is unexpected behavior, especially since axis=1 does not care

In [32]: df_b
Out[32]: 
f        c1                  c2        c3          
s         0         1         0         x         y
0  0.381601 -0.730360  0.157936 -0.694275 -1.357936
1  0.344333 -1.308118  1.503335 -1.450523 -1.453957

This now works, because the sub-index is same type (str)

In [33]: df_d
Out[33]: 
         c1                  c2        c3          
          0         1         0         x         y
0  0.162019 -0.325463 -0.200149       NaN       NaN
1 -0.142477 -0.089191 -0.439161       NaN       NaN
0       NaN       NaN       NaN -0.694275 -1.357936
1       NaN       NaN       NaN -1.450523 -1.453957
@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

You should try to use GitHub flavored markdown when you post code/output or images (slightly faster than copypasting/retyping/editing-after-retyping), it's much easier to read.

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

@kghose What version are you using? I get a ValueError when I try to concat df1 and df2 and when I do the same with axis=1 I get the thing I think you expected:

pandas-error

@kghose
Copy link
Author

kghose commented May 3, 2013

Thanks for checking. I'm using '0.11.0'. Had the same issue with '0.10.0'

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

Okay. That's probably the issue. I'm using git master (which is version 0.12.0.dev-9b6b8fb), so if you can you might want to upgrade. At least you might be able to clone the latest repo and do a git diff on pandas/tools/merge.py and then grep for exceptions to see what changeset introduced this. You still have to upgrade though if you want the expected behavior.

@kghose
Copy link
Author

kghose commented May 3, 2013

@cpcloud . Thank you! That helps me out. I might stick with the released versions for a bit. How is your experience with master? How stable is it? I worry about bugs on the bleeding edge.

@kghose
Copy link
Author

kghose commented May 3, 2013

Closing as issue seems to be resolved with latest version ( @cpcloud )

@kghose kghose closed this as completed May 3, 2013
@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

@kghose I'm not sure how to measure the stability of a code base. I find it pretty "stable", but I'm probably the worst person to ask about this, since I work in neuroscience and stability is usually not a concern because I can fix it most of the time. I also use the bleeding edge of every part of the Python science stack and it has yet to fail me in a way that totally curbs my productivity. You should take what I say about this with a tiny grain of salt, though, for the aforementioned reasons.

@kghose
Copy link
Author

kghose commented May 3, 2013

Oh, I'm in the same business. But I worry about subtle bugs that get through and skew results, but I simply should write more thorough tests that I run periodically (My usual practice is to test a method and then freeze it and forget about it. But newer versions of libraries may introduce subtle bugs e.g. in NaN handling that are nasty for data processing).

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

I know what you mean. For analyzing data I am crazy about assertions I put them everywhere so that I can fail as soon as possible. That helps cut down on writing tests for, say, a one off plotting script specific to a particular paper or kind of analysis you're doing. I've found that pandas and the scientific python community is very aware of this exact issue and these 'subtle' bugs are few and far between. However, see #3513 for an example of what you're talking about.

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

@kghose You also might be interested in this blog post.

@kghose
Copy link
Author

kghose commented May 3, 2013

@cpcloud thanks for the links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants