Skip to content

ENH: Feature Request for Ungroup Method for Grouped Data Frames #43902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdancho84 opened this issue Oct 6, 2021 · 16 comments
Open

ENH: Feature Request for Ungroup Method for Grouped Data Frames #43902

mdancho84 opened this issue Oct 6, 2021 · 16 comments
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action

Comments

@mdancho84
Copy link

Hi, thanks for your work developing pandas. I'd like to request a feature to add an ungroup() method for grouped data frames. It's related to this StackOverflow where I've developed a hack to extract using the .obj to pull out the original data frame from the grouped data frame.

However, it would be helpful to have a method developed, which would do the extraction to prevent users from depending on my hack.

>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
    order_date          category_2     value
1   2011-02-01  Cross Country Race  324400.0
2   2011-03-01  Cross Country Race  142000.0
3   2011-04-01  Cross Country Race  498580.0
4   2011-05-01  Cross Country Race  220310.0
5   2011-06-01  Cross Country Race  364420.0
..         ...                 ...       ...
535 2015-08-01          Triathalon   39200.0
536 2015-09-01          Triathalon   75600.0
537 2015-10-01          Triathalon   58600.0
538 2015-11-01          Triathalon   70050.0
539 2015-12-01          Triathalon   38600.0

[531 rows x 3 columns]
@mdancho84 mdancho84 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2021
@jreback
Copy link
Contributor

jreback commented Oct 6, 2021

-1 as pd.concat([grp for g, grp in df.groupby...()]) is idiomatic. this is not worth a method.

@jreback jreback added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2021
@hnagaty
Copy link

hnagaty commented Oct 7, 2021

That would be a nice feature and may come handy at some times.

@mdancho84
Copy link
Author

It's definitely a handy tool to implement that would help beginners seeking to extract groups. It also parallels R's tidyverse which has ungroup() in dplyr, so it might make it easier for R users to transition to pandas.

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Oct 13, 2021
@s-pike
Copy link

s-pike commented Jan 28, 2022

For me, this feature could be useful.

For reference, pd.concat([grp for g, grp in df.groupby...()]) doesn't seem to have quite the same output as df.groupby().obj. The former sorts the dataframe into groups, whereas the latter maintains the original order. The .obj hack is also an order of magnitude faster in my tests (even if you df.sort_values(group), although again, the results aren't identical).

It would really come into its own if the DataFrameGroupBy object had its own assign method, equivalent to dplyr's group_by %>% mutate functionality (see this stackoverflow question). If you like method chaining, the best current approach is using x.groupby('group').transform('fn')['value'], but that's potentially awkward if you want to use the group for multiple assignments, e.g.:

(df.assign(normalised_value = lambda x: x['value'] / x.groupby('group').transform('sum')['value'],
           normalising_value = lambda x: x.groupby('group').transform('sum')['value'])
  .more_methods...()
)

It'd be nice to have something like:

(df.groupby('group')
  .assign(normalised_value = lambda x: x['value']/x['value'].sum(),
          normalising_value = lambda x: x['value'].sum())
  .ungroup()
  .more_methods...()
)

The R dplyr equivalent being:

df %>%
  group_by(group) %>%
  mutate(normalised_value = value / sum(value)
         normalising_value = sum(value)) %>%
  ungroup() %>%
  more_methods...()

@pwwang
Copy link

pwwang commented Mar 17, 2022

Looks like some of you are leaning toward R/dplyr styles.

Check out datar, which reimages pandas APIs to align with R/dplyr's.

An example with @s-pike 's R code:

>>> from datar.all import f, tibble, group_by, mutate, ungroup, row_number, sum
[2022-03-17 11:25:33][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>> df = tibble(group=[1,1,2,2], value=[1,2,3,4])
>>> (
...     df
...     >> group_by(f.group)
...     >> mutate(normalised_value=f.value/sum(f.value), normalising_value=sum(f.value))
...     >> ungroup()
...     >> mutate(n=row_number())
... )
    group   value  normalised_value  normalising_value         n
  <int64> <int64>         <float64>            <int64> <float64>
0       1       1          0.333333                  3       1.0
1       1       2          0.666667                  3       2.0
2       2       3          0.428571                  7       3.0
3       2       4          0.571429                  7       4.0

@M-Harrington
Copy link

ungroup as a simple wrapper seems like a no brainer. Especially for people new to python that came from R. But in general why would you write pd.concat([grp for g, grp in df.groupby...()]) if there could exist a method as simple as df.ungroup() even if ungroup just calls the prior function. Seems like a simple change to clear up the multiple ways that "ungrouping" can be done and decrease the fatigue of choice.

@jreback
Copy link
Contributor

jreback commented May 29, 2022

if u think this is a useful then show a complete example

the above is not very compelling

@M-Harrington
Copy link

I'm not sure I understand what you're looking for @jreback. Especially if you're referring to @s-pike 's example. An example of why it might be useful to have a wrapper for ungrouping a dataframe? If you need to recover the original order as is common with unlabeled numpy data for machine learning, having an ordered df by group makes matching the two datasets difficult.

This is a task that happens to me frequently.

@jreback
Copy link
Contributor

jreback commented May 29, 2022

a compelling example in code not words

@M-Harrington
Copy link

Can you answer my question by any chance? That'll make it easier for me to know what you're looking for.

@jreback
Copy link
Contributor

jreback commented May 29, 2022

yes if u have something that could be useful api i need a compelling example
the one above is not

@M-Harrington
Copy link

#use groupby to create intermediate results (e.g. for data science)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': range(6)})
df = df.groupby('A')
means = df,mean()

# return to methods not defined for groupby
df= df.ungroup()
print(df)
df['constant'] = 3
df.iloc[0,2]

The point isn't that the above can't be used before groupby, it's more of a matter of workflow. Especially in the case of interactively looking through the data, an ungroup option is super useful. Especially because new users will not have the capacity to learn methods such as transform that are defined for grouped objects.

@M-Harrington
Copy link

M-Harrington commented May 30, 2022

PS because you are being kind of rude in how you're responding to me and the other users: more people in this thread think this would be useful than do not so it would be great if you could explain why you think this isn't useful beyond an appeal to tradition that there's a more "idiomatic" way of doing it.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2022

@M-Harrington it's amazing how these comments just hurt open source maintainers - woa if i actually criticized something.

that said - your example still doesn't explain how ungroup actually adds anything to syntax, clarity or understanding of the code

i was expecting a lot more from someone who teaches

@stephenjfox
Copy link

stephenjfox commented Jun 21, 2022

@jreback
I may have a decent code example from something that I wrote just the other day: I wanted to combine multiple DataFrameGroupby instances, which happen to be fields of a container object I have (called Dataset), in a sound way that wouldn't lose any information.

Here's a slimmed down version of the code:

def combine_multiple_datasets(backing_dses: List[Dataset]) -> Dataset:
    """A simple wiring together of multiple Datasets into one Dataset that is effectively the children, combined."""
    assert len(backing_dses), "Should have at least one backing dataset"
    ds_instance = Dataset.__new__(Dataset)
    
    # elided: copy fields (grouping_key, feature_columns, etc.) from children.
    
    # Combining Groupby's manually
    all_data = [
        (group_name, df)
        for groupby in map(attrgetter('grouped'), backing_dses)
        for group_name, df in groupby
    ]

    ds_instance.grouped = pd.concat([df for _, df in all_data]).groupby(ds_instance.grouping_key)
    return ds_instance

When an ungroup() could facilitate the following:

def combine_multiple_datasets_PREFERED(backing_dses: List[Dataset]) -> Dataset:
    """A simple wiring together of multiple Datasets into one Dataset that is effectively the children, combined."""
    assert len(backing_dses), "Should have at least one backing dataframe"
    ds_instance = Dataset.__new__(Dataset)
    
    # elided: copy fields (grouping_key, feature_columns, etc.) from children.
    
    # Combining Groupby's with DataFrameGroupby.ungroup()
    ds_instance.grouped = pd.concat([ds.grouped.ungroup() for ds in backing_dses]).groupby(ds_instance.grouping_key)
    return ds_instance

Also, I don't have an R background. Just do OOP occasionally and want to leverage convenient lower-level abstractions in an elegant way.

@M-Harrington
Copy link

jreback, nobody is forcing you to ad hominem. If that's what being part of the open source community means to you, by all means, please stop doing so. No seriously, just don't respond to this issue or this comment. Somebody else will pick it up, or not and then whatever. When you treat the people who use your package poorly, you're not doing anyone a service, either the package, or the people who are trying to use and learn about it.

As @stephenjfox said, we're just asking for something that "leverage[s] convenient lower-level abstractions in an elegant way". Other benefits include: a chance to implement it in a more efficient way than allocating more memory to an object that already exists within the groupby object as df.groupby.obj .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants