-
-
Notifications
You must be signed in to change notification settings - Fork 19k
add optional default value to GroupBy.get_group #5452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
pandas/tests/test_groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.assertRaises(KeyError, grouped.get_group, -1, default='default')
Can you add some examples of why this would be useful to your PR? Are you |
yeah, just added an example to the top, but yeah, it would act similar to .get in that it would allow you to avoid an if key in g.indices or an extra try/catch block |
oops |
needs to be rebased to master, see here: https://github.com/pydata/pandas/wiki/Using-Git |
alright, squashed commits and rebased to master. |
still, failing build tests though ... will figure out whats going on there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be self.indices[default]
if you put in my fix I'm guessing (more of) the builds will pass. inds needs to be an array of integer-like, not a string key. |
exactly. A string key should raise an exception and that was what I was testing for, I was just expecting the wrong exception type. I'm not sure why the other test was failing since it worked on my machine a while back. Anyway, I've changed that test as well, so now they all pass on my machine, will wait and see how it looks to Travis |
btw - if only "slow" passes, that really means all your tests are failing |
gotcha, I was running python setup.py test, not sure what exactly that gets mapped to. I guess ./test.sh would probably be better to run. Anyway, it looks like Travis passed. Does it do the full large suite of tests? |
Yes it does. Easiest is to run this: nosetests -A 'not network and not slow' |
(that runs non slow tests which is usually want you want) |
Can you add a docstring to this that describes what kind of default should be passed to get_group? Also need to add release notes at some point. |
cool, thanks. Is there anything else I should do to help get this PR accepted? |
The part that concerns me is that if you call it with a bad default it only raises if you give a group that's not in the grouper. So you could have a silent bug if you were using this. Maybe test whether it's actually a perf hit if you validate every time? (I might be reading it wrong - no access to a computer right now) |
added the docstring. I could do a type check on all calls, the problem is that it appears that there is no explicit expected type. In pandas.core.indexing._maybe_convert_indices, it converts isinstance(indices, list) to np.array, and leaves everything else alone. Its possible that if a user has their own np.array like object which implements all of the necessary operations take will work fine. Is there a way to test that default has a type which provides the desired interface? |
Maybe common.is_list_like? So, if default is not None and not com.is_list_like(default): |
awesome :) I can try that. How do you recommend I test the performance impact of that? |
I could write my own tests, but I know there are already some performance tests in pandas, maybe I can use existing tests? |
Look at test_perf.sh. I'm assuming there's a groupby test (just run the whole thing and see what falls out) . I wouldn't expect it to be that much of an issue though. |
alright, I'll check that out. Yeah I agree that it will in most cases be far outweighed by the time required to obj.take |
Alright, I just added the test for is_list_like and it seems to have not impacted performance. |
@jtratner ok? |
@dwiel actually..can you add release notes? does this have a reference issue? |
there isn't a reference issue that I know of. Should I create one? Just added release notes |
Alright, added this example:
|
There is no test of default being a list of indices. The only tests right now are for the empty list and string (an invalid default). |
is their any reason to ever have a value for default or is maybe having it True/False enough? |
Tbh I don't even understand the third option which is why I'm asking for test case. |
to build on that, can you offer a real-world (ish) case where this is helpful? |
I added a test case for a default value which wasn't None or []:
|
default as boolean makes sense to me. The counter argument is that a non boolean default better matches dict.get, but a boolean value also side steps this confusion about how the various types should be handled. Any name ideas? default=True doesn't make intuitive sense. |
Maybe: use_default or missing_as_empty. |
The use case for the third option ie what you offered as the test case above is confusing to me. |
yeah, the test case is quite contrived. It might be slightly more clear to do this instead:
The contrived idea was to for get_group to return the same thing as get_group('other') in the event that name isn't available. Not necessarily useful, but someone wanted an example. All of that said, at this point it might make more sense to keep it simple for now with a boolean default_as_empty parameter. If we wanted different semantics later we could add them: default_index = [0,1,2] or default = 'string', etc. Since we really only have a reasonable use case for True and False at the moment, that might be the best combination of utility and simplicity. |
Again, can you offer where you'd actually need this? |
@jtratner where you'd actually need a default at all, or where you'd actually need an index slice default? |
@dwiel my problem with this PR is that every example I've seen makes the code harder to follow, because it hides what's actually being returned when the default is used. For example, it's not intuitive that g.get_group('c', default = [0]) means 'get row 0 from the original DataFrame if you don't have anything'. In contrast, this is very clear: try:
group = g.get_group('c')
except KeyError:
group = data.iloc[[0]] At best, I could see emulating g.get_group('c', default=data.iloc[[0]]) because it follows the flow of dict.get equivalently. It also works just as well with the example you gave earlier too. |
I also think that following dict.get would be the most sensible, so just giving the value/object itself that you want to return as argument to
and also to return another group:
The only problem with this is that returing an empty DataFrame with providing |
Alright, for sake of ease of conversation here are the proposed options so far:
I had no idea this minor change would be so involved! |
@jtratner what do you think? |
circling back to this... ok I like @jorisvandenbossche / @jtratner examples, e.g. if you provide a default it will get returned
but I will suggest an API change here that I think makes sense. if the group is not found, instead of raising a
this is consistent with get; not sure what will break if you do this, but it makes it intuitive |
@dwiel closing, but if you'd like to update, pls do and we can reopen |
I've used a try except block as opposed to and if else (below) to keep the non-default code path fast. What I didn't do:
Here is an example use case:
vs
Many of the simple cases are actually handled by other higher level functions