Skip to content

Incorrect sequence handling with MultiIndex.from_tuples #14794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
groutr opened this issue Dec 3, 2016 · 3 comments
Open

Incorrect sequence handling with MultiIndex.from_tuples #14794

groutr opened this issue Dec 3, 2016 · 3 comments

Comments

@groutr
Copy link
Contributor

groutr commented Dec 3, 2016

T = tuple(tuple(range(k)) for k in range(1, 8))
#Note: min(map(len, T)) == 1 and max(map(len, T)) == 7
bad = pd.MultiIndex.from_tuples(T)
bad.nlevels  # == 1 ???

good = pd.MultiIndex.from_tuples(list(T))
good.nlevels # == 7

Problem description

The docstring leaves some ambiguity as to whether a tuple or list can be passed in. It does say a list of tuples at the top, but in the argument specification, it hints that tuples can be a list/sequence of tuple-likes. A tuple of tuples is a sequence of tuple-likes. I find it strange that my tuple of tuples was truncated to a single level MultiIndex, while my list of tuples was not.
Changing https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L983
to call itertools.zip_longest would fix the truncation issue.

I also don't understand why sometimes the arrays are transposed and every other case we don't transpose.
https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L980-L983

Expected Output

I would expect lists and tuples and other sequences of tuple-likes (sets, for example) to be treated the same. Do lists have a special meaning here?

This is in the latest master (588e29d) and pandas 0.19.1.
I'd be happy to submit a PR if this is unintended behavior

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

In [4]: list(zip(*T))
Out[4]: [(0, 0, 0, 0, 0, 0, 0)]

is what we do with a list-of-tuples (though), first making a list of this before hand doesn't change the result.

I am not sure of what zip does with unbalanced tuples.

So in this case we should be raising (.from_arrays will raise with unbalanced input arrays, which was a recent change).

So if you can figure out a nice performant way to zip these (w/o losing things), all ears.

@jreback jreback added this to the Next Major Release milestone Dec 4, 2016
@groutr
Copy link
Contributor Author

groutr commented Dec 5, 2016

Zip terminates at the shortest iterable. As I mentioned in my first post, itertools.zip_longest (from the standard library) is the ways to solve that (https://docs.python.org/3/library/itertools.html#itertools.zip_longest). What my questions were trying to get at was if there were any underlying reasons for the behavior.

I see that list(zip(*tuples)) seems to do a transpose.

In []: list(T55)
Out[]:
[(0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4)]
In []: list(zip(*T55))
Out[]:
[(0, 0, 0, 0, 0),
 (1, 1, 1, 1, 1),
 (2, 2, 2, 2, 2),
 (3, 3, 3, 3, 3),
 (4, 4, 4, 4, 4)]

I'd propose the following the if statement.

        elif isinstance(tuples, Sequence):     # Sequence is an abstract class from collections.abc
            arrays = list(lib.to_object_array_tuples(list(tuples)).T)
        else:
            arrays = list(zip_longest(*tuples))

But, if the elif is checking that tuples is a Sequence, then I think the else clause is largely unnecessary, because if tuples is a sequence, then lib.to_object_array_tuples should be able to handle it. In fact, the call to to_object_array_tuples maybe wholly unneeded because we can pass a list of tuples directly to MultiIndex.from_arrays (zip_longest will ensure that each inner tuple is the same length)

Maybe the else clause should raise an error if tuples isn't a sequence. This avoids the need to zip anything, unless I'm misunderstanding how lib.to_object_array_tuples can fail.

        elif isinstance(tuples, Sequence):
            arrays = list(zip_longest(*tuples))
        else:
            raise ValueError("tuples must be a sequence")
        return MultiIndex.from_arrays(arrays, ...)

Upon further investigation, to_object_array_tuples appears to mostly reimplement zip_longest. It seems for small iterables, zip_longest is can be much faster. Though for longer iterables, to_object_array_tuples starts to get an edge (timings done with python 3.5). Should we optimize index creation between large and small indexes?

# d55 = list of tuples like T (from OP) k=55
zip_longest = 19.9us
to_object_array_tuples: 15.5us

#where k = 20
zip_longest = 3.74us
to_object_array_tuples = 4.84us

# k =10
zip_longest = 1.46us
to_object_array_tuples = 3.35us

If you prefer to discuss code details in a PR, I can open one tonight.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

not sure of the original rationale for having all of the clauses in the if

so feel free to see what happens if you add your examples as a test (and make changes as u suggest)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants