Skip to content

Conversation

dwiel
Copy link
Contributor

@dwiel dwiel commented Nov 6, 2013

I've used a try except block as opposed to and if else (below) to keep the non-default code path fast. What I didn't do:

if default is None :
    inds = self.indices[name]
else :
    inds = self.indices.get(name, default)

Here is an example use case:

def foo(df, keys) :
    g = df.groupby('key')
    for key in keys :
        for x in g.get_group(key, default=[]).x :
            yield key, x

vs

def foo(df, keys) :
    g = df.groupby('key')
    for key in keys :
        if key in g.indices :
            for x in g.get_group(key).x :
                yield key, x

Many of the simple cases are actually handled by other higher level functions

@jreback
Copy link
Contributor

jreback commented Nov 6, 2013

  • pls setup travis (see contributing.md)
  • need tests (I don't think our implementation will work if default was say a string), you can't take with it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.assertRaises(KeyError, grouped.get_group, -1, default='default')

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

Can you add some examples of why this would be useful to your PR? Are you
thinking this would be like dict . get?

@dwiel
Copy link
Contributor Author

dwiel commented Nov 7, 2013

yeah, just added an example to the top, but yeah, it would act similar to .get in that it would allow you to avoid an if key in g.indices or an extra try/catch block

@dwiel dwiel closed this Nov 7, 2013
@dwiel
Copy link
Contributor Author

dwiel commented Nov 7, 2013

oops

@dwiel dwiel reopened this Nov 7, 2013
@jreback
Copy link
Contributor

jreback commented Nov 7, 2013

needs to be rebased to master, see here: https://github.com/pydata/pandas/wiki/Using-Git

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

alright, squashed commits and rebased to master.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

still, failing build tests though ... will figure out whats going on there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be self.indices[default]

@jtratner
Copy link
Contributor

if you put in my fix I'm guessing (more of) the builds will pass. inds needs to be an array of integer-like, not a string key.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

exactly. A string key should raise an exception and that was what I was testing for, I was just expecting the wrong exception type. I'm not sure why the other test was failing since it worked on my machine a while back. Anyway, I've changed that test as well, so now they all pass on my machine, will wait and see how it looks to Travis

@jtratner
Copy link
Contributor

btw - if only "slow" passes, that really means all your tests are failing
:) that build just skips the majority of the test suite in favor of running
tests that take a while.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

gotcha, I was running python setup.py test, not sure what exactly that gets mapped to. I guess ./test.sh would probably be better to run. Anyway, it looks like Travis passed. Does it do the full large suite of tests?

@jtratner
Copy link
Contributor

Yes it does. Easiest is to run this:

nosetests -A 'not network and not slow'

@jtratner
Copy link
Contributor

(that runs non slow tests which is usually want you want)

@jtratner
Copy link
Contributor

Can you add a docstring to this that describes what kind of default should be passed to get_group?

Also need to add release notes at some point.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

cool, thanks. Is there anything else I should do to help get this PR accepted?

@jtratner
Copy link
Contributor

The part that concerns me is that if you call it with a bad default it only raises if you give a group that's not in the grouper. So you could have a silent bug if you were using this. Maybe test whether it's actually a perf hit if you validate every time?

(I might be reading it wrong - no access to a computer right now)

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

added the docstring.

I could do a type check on all calls, the problem is that it appears that there is no explicit expected type. In pandas.core.indexing._maybe_convert_indices, it converts isinstance(indices, list) to np.array, and leaves everything else alone. Its possible that if a user has their own np.array like object which implements all of the necessary operations take will work fine. Is there a way to test that default has a type which provides the desired interface?

@jtratner
Copy link
Contributor

Maybe common.is_list_like?

So,

if default is not None and not com.is_list_like(default):
raise TypeError

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

awesome :) I can try that. How do you recommend I test the performance impact of that?

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

I could write my own tests, but I know there are already some performance tests in pandas, maybe I can use existing tests?

@jtratner
Copy link
Contributor

Look at test_perf.sh. I'm assuming there's a groupby test (just run the whole thing and see what falls out) . I wouldn't expect it to be that much of an issue though.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 22, 2013

alright, I'll check that out. Yeah I agree that it will in most cases be far outweighed by the time required to obj.take

@dwiel
Copy link
Contributor Author

dwiel commented Nov 27, 2013

Alright, I just added the test for is_list_like and it seems to have not impacted performance.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

@jtratner ok?

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

@dwiel actually..can you add release notes? does this have a reference issue?

@dwiel
Copy link
Contributor Author

dwiel commented Nov 27, 2013

there isn't a reference issue that I know of. Should I create one? Just added release notes

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

Alright, added this example:

>>> data = DataFrame({'name' : ['a', 'a', 'b', 'd'], 'count' : [3,4,3,2]})
>>> g = data.groupby('name')
>>> g.get_group('a', default = []).count().sum()
7
>>> g.get_group('c', default = []).count().sum()
0

@cancan101
Copy link
Contributor

There is no test of default being a list of indices. The only tests right now are for the empty list and string (an invalid default).

@jreback
Copy link
Contributor

jreback commented Nov 28, 2013

is their any reason to ever have a value for default or is maybe having it True/False enough?

@cancan101
Copy link
Contributor

Tbh I don't even understand the third option which is why I'm asking for test case.

@jtratner
Copy link
Contributor

to build on that, can you offer a real-world (ish) case where this is helpful?

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

I added a test case for a default value which wasn't None or []:

data = DataFrame({
    'name' : ['other', 'a', 'a', 'b', 'd'],
    'count' : [1,3,4,3,2],
})
g = data.groupby('name')
self.assertEqual(g.get_group('a', default = [0])['count'].sum(), 7)
self.assertEqual(g.get_group('c', default = [0])['count'].sum(), 1)

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

default as boolean makes sense to me. The counter argument is that a non boolean default better matches dict.get, but a boolean value also side steps this confusion about how the various types should be handled. Any name ideas? default=True doesn't make intuitive sense.

@cancan101
Copy link
Contributor

Maybe: use_default or missing_as_empty.

@cancan101
Copy link
Contributor

The use case for the third option ie what you offered as the test case above is confusing to me.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

yeah, the test case is quite contrived. It might be slightly more clear to do this instead:

...
default_index = data.index[data['name'] == 'other']
self.assertEqual(g.get_group('c', default = default_index)['count'].sum(), 1)

The contrived idea was to for get_group to return the same thing as get_group('other') in the event that name isn't available. Not necessarily useful, but someone wanted an example.

All of that said, at this point it might make more sense to keep it simple for now with a boolean default_as_empty parameter. If we wanted different semantics later we could add them: default_index = [0,1,2] or default = 'string', etc. Since we really only have a reasonable use case for True and False at the moment, that might be the best combination of utility and simplicity.

@jtratner
Copy link
Contributor

Again, can you offer where you'd actually need this?

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

@jtratner where you'd actually need a default at all, or where you'd actually need an index slice default?

@jtratner
Copy link
Contributor

@dwiel my problem with this PR is that every example I've seen makes the code harder to follow, because it hides what's actually being returned when the default is used.

For example, it's not intuitive that

g.get_group('c', default = [0])

means 'get row 0 from the original DataFrame if you don't have anything'. In contrast, this is very clear:

try:
    group = g.get_group('c')
except KeyError:
    group = data.iloc[[0]]

At best, I could see emulating dict.get by allowing a second argument that is the default to be returned if the key's not there, so this would be much clearer to me:

g.get_group('c', default=data.iloc[[0]])

because it follows the flow of dict.get equivalently. It also works just as well with the example you gave earlier too.

@jorisvandenbossche
Copy link
Member

I also think that following dict.get would be the most sensible, so just giving the value/object itself that you want to return as argument to default. As @jtratner said, this allows you to work with indices:

g.get_group('c', default=data.iloc[[0,1]])

and also to return another group:

g.get_group('c', default=g.get_group('a'))

The only problem with this is that returing an empty DataFrame with providing [] does not really fit in this logic, while this is maybe the most common usecase.

@dwiel
Copy link
Contributor Author

dwiel commented Nov 28, 2013

Alright, for sake of ease of conversation here are the proposed options so far:

  1. no change to get_group
  2. default takes a list of indices and raises TypeException in all other cases
  3. default takes a list of indices or a more dict.get like pass through value. any option that is_list_like will be interpreted as indices
  4. default takes a value to pass through like dict.get
  5. default takes a value to pass through like dict.get; default_as_empty takes True to use default = self.take([])
  6. no default; default_as_empty takes True to use default = self.take([])

I had no idea this minor change would be so involved!

@jreback
Copy link
Contributor

jreback commented Dec 3, 2013

@jtratner what do you think?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

circling back to this...

@dwiel

ok I like @jorisvandenbossche / @jtratner examples, e.g. if you provide a default it will get returned

g.get_group('c', default=data.iloc[[0,1]])

g.get_group('c', default=g.get_group('a'))

but I will suggest an API change here that I think makes sense.

if the group is not found, instead of raising a KeyError

  • if their is a non-None default, return it
  • else return an empty object (based on the type of object being grouped, e.g. series/frame)

this is consistent with get; not sure what will break if you do this, but it makes it intuitive

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@dwiel closing, but if you'd like to update, pls do and we can reopen

@jreback jreback closed this Jan 18, 2015
kinow added a commit to kinow/pandas that referenced this pull request Sep 22, 2019
kinow added a commit to kinow/pandas that referenced this pull request Sep 23, 2019
@kinow kinow mentioned this pull request Sep 23, 2019
5 tasks
kinow added a commit to kinow/pandas that referenced this pull request Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Groupby Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants