add optional default value to GroupBy.get_group #5452

dwiel · 2013-11-06T14:38:00Z

I've used a try except block as opposed to and if else (below) to keep the non-default code path fast. What I didn't do:

if default is None :
    inds = self.indices[name]
else :
    inds = self.indices.get(name, default)

Here is an example use case:

def foo(df, keys) :
    g = df.groupby('key')
    for key in keys :
        for x in g.get_group(key, default=[]).x :
            yield key, x

vs

def foo(df, keys) :
    g = df.groupby('key')
    for key in keys :
        if key in g.indices :
            for x in g.get_group(key).x :
                yield key, x

Many of the simple cases are actually handled by other higher level functions

jreback · 2013-11-06T14:44:35Z

pls setup travis (see contributing.md)
need tests (I don't think our implementation will work if default was say a string), you can't take with it

jreback · 2013-11-06T16:35:36Z

pandas/tests/test_groupby.py

use self.assertRaises(KeyError, grouped.get_group, -1, default='default')

jtratner · 2013-11-06T22:12:20Z

Can you add some examples of why this would be useful to your PR? Are you
thinking this would be like dict . get?

dwiel · 2013-11-07T00:33:36Z

yeah, just added an example to the top, but yeah, it would act similar to .get in that it would allow you to avoid an if key in g.indices or an extra try/catch block

dwiel · 2013-11-07T00:40:14Z

oops

jreback · 2013-11-07T12:56:25Z

needs to be rebased to master, see here: https://github.com/pydata/pandas/wiki/Using-Git

dwiel · 2013-11-22T00:46:38Z

alright, squashed commits and rebased to master.

dwiel · 2013-11-22T02:05:02Z

still, failing build tests though ... will figure out whats going on there

jtratner · 2013-11-22T07:54:52Z

pandas/core/groupby.py

This needs to be self.indices[default]

jtratner · 2013-11-22T07:55:42Z

if you put in my fix I'm guessing (more of) the builds will pass. inds needs to be an array of integer-like, not a string key.

dwiel · 2013-11-22T14:04:28Z

exactly. A string key should raise an exception and that was what I was testing for, I was just expecting the wrong exception type. I'm not sure why the other test was failing since it worked on my machine a while back. Anyway, I've changed that test as well, so now they all pass on my machine, will wait and see how it looks to Travis

jtratner · 2013-11-22T15:52:34Z

btw - if only "slow" passes, that really means all your tests are failing
:) that build just skips the majority of the test suite in favor of running
tests that take a while.

dwiel · 2013-11-22T16:09:16Z

gotcha, I was running python setup.py test, not sure what exactly that gets mapped to. I guess ./test.sh would probably be better to run. Anyway, it looks like Travis passed. Does it do the full large suite of tests?

jtratner · 2013-11-22T16:10:52Z

Yes it does. Easiest is to run this:

nosetests -A 'not network and not slow'

jtratner · 2013-11-22T16:11:42Z

(that runs non slow tests which is usually want you want)

jtratner · 2013-11-22T16:13:32Z

Can you add a docstring to this that describes what kind of default should be passed to get_group?

Also need to add release notes at some point.

dwiel · 2013-11-22T16:14:00Z

cool, thanks. Is there anything else I should do to help get this PR accepted?

jtratner · 2013-11-22T16:19:54Z

The part that concerns me is that if you call it with a bad default it only raises if you give a group that's not in the grouper. So you could have a silent bug if you were using this. Maybe test whether it's actually a perf hit if you validate every time?

(I might be reading it wrong - no access to a computer right now)

dwiel · 2013-11-22T16:33:25Z

added the docstring.

I could do a type check on all calls, the problem is that it appears that there is no explicit expected type. In pandas.core.indexing._maybe_convert_indices, it converts isinstance(indices, list) to np.array, and leaves everything else alone. Its possible that if a user has their own np.array like object which implements all of the necessary operations take will work fine. Is there a way to test that default has a type which provides the desired interface?

jtratner · 2013-11-22T17:24:24Z

Maybe common.is_list_like?

So,

if default is not None and not com.is_list_like(default):
raise TypeError

dwiel · 2013-11-22T17:31:18Z

awesome :) I can try that. How do you recommend I test the performance impact of that?

dwiel · 2013-11-22T17:32:18Z

I could write my own tests, but I know there are already some performance tests in pandas, maybe I can use existing tests?

jtratner · 2013-11-22T17:35:13Z

Look at test_perf.sh. I'm assuming there's a groupby test (just run the whole thing and see what falls out) . I wouldn't expect it to be that much of an issue though.

dwiel · 2013-11-22T17:46:26Z

alright, I'll check that out. Yeah I agree that it will in most cases be far outweighed by the time required to obj.take

dwiel · 2013-11-27T01:13:46Z

Alright, I just added the test for is_list_like and it seems to have not impacted performance.

jreback · 2013-11-27T18:38:45Z

@jtratner ok?

jreback · 2013-11-27T18:39:20Z

@dwiel actually..can you add release notes? does this have a reference issue?

dwiel · 2013-11-27T18:53:53Z

there isn't a reference issue that I know of. Should I create one? Just added release notes

dwiel · 2013-11-28T00:08:59Z

Alright, added this example:

>>> data = DataFrame({'name' : ['a', 'a', 'b', 'd'], 'count' : [3,4,3,2]})
>>> g = data.groupby('name')
>>> g.get_group('a', default = []).count().sum()
7
>>> g.get_group('c', default = []).count().sum()
0

cancan101 · 2013-11-28T00:42:46Z

There is no test of default being a list of indices. The only tests right now are for the empty list and string (an invalid default).

jreback · 2013-11-28T00:55:33Z

is their any reason to ever have a value for default or is maybe having it True/False enough?

cancan101 · 2013-11-28T00:57:11Z

Tbh I don't even understand the third option which is why I'm asking for test case.

jtratner · 2013-11-28T01:01:28Z

to build on that, can you offer a real-world (ish) case where this is helpful?

dwiel · 2013-11-28T01:13:48Z

I added a test case for a default value which wasn't None or []:

data = DataFrame({
    'name' : ['other', 'a', 'a', 'b', 'd'],
    'count' : [1,3,4,3,2],
})
g = data.groupby('name')
self.assertEqual(g.get_group('a', default = [0])['count'].sum(), 7)
self.assertEqual(g.get_group('c', default = [0])['count'].sum(), 1)

dwiel · 2013-11-28T01:16:21Z

default as boolean makes sense to me. The counter argument is that a non boolean default better matches dict.get, but a boolean value also side steps this confusion about how the various types should be handled. Any name ideas? default=True doesn't make intuitive sense.

cancan101 · 2013-11-28T01:22:13Z

Maybe: use_default or missing_as_empty.

cancan101 · 2013-11-28T01:22:55Z

The use case for the third option ie what you offered as the test case above is confusing to me.

dwiel · 2013-11-28T01:33:40Z

yeah, the test case is quite contrived. It might be slightly more clear to do this instead:

...
default_index = data.index[data['name'] == 'other']
self.assertEqual(g.get_group('c', default = default_index)['count'].sum(), 1)

The contrived idea was to for get_group to return the same thing as get_group('other') in the event that name isn't available. Not necessarily useful, but someone wanted an example.

All of that said, at this point it might make more sense to keep it simple for now with a boolean default_as_empty parameter. If we wanted different semantics later we could add them: default_index = [0,1,2] or default = 'string', etc. Since we really only have a reasonable use case for True and False at the moment, that might be the best combination of utility and simplicity.

jtratner · 2013-11-28T02:05:27Z

Again, can you offer where you'd actually need this?

dwiel · 2013-11-28T02:12:35Z

@jtratner where you'd actually need a default at all, or where you'd actually need an index slice default?

jtratner · 2013-11-28T03:52:24Z

@dwiel my problem with this PR is that every example I've seen makes the code harder to follow, because it hides what's actually being returned when the default is used.

For example, it's not intuitive that

g.get_group('c', default = [0])

means 'get row 0 from the original DataFrame if you don't have anything'. In contrast, this is very clear:

try:
    group = g.get_group('c')
except KeyError:
    group = data.iloc[[0]]

At best, I could see emulating dict.get by allowing a second argument that is the default to be returned if the key's not there, so this would be much clearer to me:

g.get_group('c', default=data.iloc[[0]])

because it follows the flow of dict.get equivalently. It also works just as well with the example you gave earlier too.

jorisvandenbossche · 2013-11-28T11:59:06Z

I also think that following dict.get would be the most sensible, so just giving the value/object itself that you want to return as argument to default. As @jtratner said, this allows you to work with indices:

g.get_group('c', default=data.iloc[[0,1]])

and also to return another group:

g.get_group('c', default=g.get_group('a'))

The only problem with this is that returing an empty DataFrame with providing [] does not really fit in this logic, while this is maybe the most common usecase.

dwiel · 2013-11-28T12:32:10Z

Alright, for sake of ease of conversation here are the proposed options so far:

no change to get_group
default takes a list of indices and raises TypeException in all other cases
default takes a list of indices or a more dict.get like pass through value. any option that is_list_like will be interpreted as indices
default takes a value to pass through like dict.get
default takes a value to pass through like dict.get; default_as_empty takes True to use default = self.take([])
no default; default_as_empty takes True to use default = self.take([])

I had no idea this minor change would be so involved!

jreback · 2013-12-03T11:37:22Z

@jtratner what do you think?

jreback · 2014-02-16T13:06:28Z

circling back to this...

@dwiel

ok I like @jorisvandenbossche / @jtratner examples, e.g. if you provide a default it will get returned

g.get_group('c', default=data.iloc[[0,1]])

g.get_group('c', default=g.get_group('a'))

but I will suggest an API change here that I think makes sense.

if the group is not found, instead of raising a KeyError

if their is a non-None default, return it
else return an empty object (based on the type of object being grouped, e.g. series/frame)

this is consistent with get; not sure what will break if you do this, but it makes it intuitive

jreback · 2015-01-18T21:39:57Z

@dwiel closing, but if you'd like to update, pls do and we can reopen

jreback reviewed Nov 6, 2013
View reviewed changes

pandas/tests/test_groupby.py Outdated

Copy link

Contributor

jreback Nov 6, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.assertRaises(KeyError, grouped.get_group, -1, default='default')

dwiel closed this Nov 7, 2013

dwiel reopened this Nov 7, 2013

jtratner reviewed Nov 22, 2013
View reviewed changes

pandas/core/groupby.py

Copy link

Contributor

jtratner Nov 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be self.indices[default]

add optional default value to GroupBy.get_group

1e7b974

jreback added API Design labels Feb 16, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback closed this Jan 18, 2015

jreback mentioned this pull request Jan 22, 2015

Exception in get_group() with invalid index #9299

Closed

kinow added a commit to kinow/pandas that referenced this pull request Sep 22, 2019

Issue pandas-dev#5452 add default option to get_group

cf377e7

kinow added a commit to kinow/pandas that referenced this pull request Sep 23, 2019

Issue pandas-dev#5452 add default option to get_group

2a82d4d

kinow mentioned this pull request Sep 23, 2019

Add default parameter for get_group #28574

Closed

5 tasks

kinow added a commit to kinow/pandas that referenced this pull request Sep 23, 2019

Issue pandas-dev#5452 add default option to get_group

55f5b78

Uh oh!

add optional default value to GroupBy.get_group #5452

add optional default value to GroupBy.get_group #5452

Uh oh!

Conversation

dwiel commented Nov 6, 2013

Uh oh!

jreback commented Nov 6, 2013

Uh oh!

jreback Nov 6, 2013

Choose a reason for hiding this comment

Uh oh!

jtratner commented Nov 6, 2013

Uh oh!

dwiel commented Nov 7, 2013

Uh oh!

dwiel commented Nov 7, 2013

Uh oh!

jreback commented Nov 7, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner Nov 22, 2013

Choose a reason for hiding this comment

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

jtratner commented Nov 22, 2013

Uh oh!

dwiel commented Nov 22, 2013

Uh oh!

dwiel commented Nov 27, 2013

Uh oh!

jreback commented Nov 27, 2013

Uh oh!

jreback commented Nov 27, 2013

Uh oh!

dwiel commented Nov 27, 2013

Uh oh!

dwiel commented Nov 28, 2013

Uh oh!

cancan101 commented Nov 28, 2013

Uh oh!

jreback commented Nov 28, 2013

Uh oh!

cancan101 commented Nov 28, 2013

Uh oh!

jtratner commented Nov 28, 2013

Uh oh!

dwiel commented Nov 28, 2013

Uh oh!

dwiel commented Nov 28, 2013

Uh oh!

cancan101 commented Nov 28, 2013