Skip to content

API for splitting pandas objects #4059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Jun 27, 2013 · 6 comments
Closed

API for splitting pandas objects #4059

wesm opened this issue Jun 27, 2013 · 6 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@wesm
Copy link
Member

wesm commented Jun 27, 2013

http://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe

related #414

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

related #3066

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

groupby in the backend?

In [5]: df = DataFrame(randn(10,10))

In [6]: gb = df.groupby(lambda x: x < 5, axis=0)

In [7]: [v for _, v in gb]
Out[7]:
[       0      1      2      3      4      5      6      7      8      9
5 -0.047  0.813 -0.253 -1.443 -0.683  0.116 -0.155  0.159  0.359  0.497
6 -1.626  0.496  1.572 -1.056  0.579  0.312 -1.139  1.367 -0.158  1.679
7 -0.029  0.541  1.299  0.513 -0.562  0.489  0.408 -0.305  0.824 -0.200
8  0.318 -0.764  1.497 -1.704 -0.540  1.045  0.143 -0.457 -2.026 -0.795
9 -0.082 -1.585  0.623  0.251 -0.528 -0.270  0.874 -1.674 -0.711 -0.110,
        0      1      2      3      4      5      6      7      8      9
0 -0.736  0.413  0.837 -1.141 -0.112  1.974 -0.861 -0.795  0.487  1.169
1 -1.150  0.914 -0.847 -0.009  1.028 -1.988 -1.140 -0.515  0.080  0.094
2 -1.013  0.546 -0.603  0.874  1.123  0.950  0.710 -2.143 -1.726 -1.555
3 -0.824 -0.051 -1.438 -0.821 -0.541 -0.851 -0.135 -0.331 -1.607 -0.250
4 -1.309 -0.197 -0.042  0.909  0.695  0.364  0.364  0.860 -1.074  1.805]

@ghost ghost mentioned this issue Nov 22, 2013
@ghost
Copy link

ghost commented Nov 22, 2013

In retrospect, #3066 actually points out two missing operations from the API: split_by and partition.

>>> [1 1 2 2 11].groupby( identity)
[(1,1,1,1) (2,2)]
>>> [1 1 2 2 11].partition(identity)
[(1,1) (2,2) (1,1)]
>>> [1 1 2 2 11].split_by(is_2)
[(1 1 2) (2) (1 1)]

partition and split_by can be thought of as the same operation with edge
exclusive/inclusive semantics respectively.

Should probably return a groupby-like object, rather then a collection of frames directly
like the SO question wanted. Easy to recover the frames from that. Though a map
won't do since keys may not be unique. Just the per group operations provided by
the container class.

Update:

>>> [1 2 3 4 5].partition(3,2)
[(1,2, 3) (3,4,5)]

related #5494, #936

@TomAugspurger
Copy link
Contributor

Another example where y-p's split_at could be useful. In that case something like df.split_at(pd.isnull) would partition into the contiguous groups of valid points. From there it would be .apply(lambda x: [x.head(1)['high'], x.tail(10)['low']) or something like that.

@ghost
Copy link

ghost commented Jan 27, 2014

I think the groupby idiom can be usefull generalized to support different types of
partitioning/splitting semantics.

One kink is that In general, group keys may not be distinct ( result keys may look like [1 2 1]).
That's not a problem for the apply step which iterates over all the groups anyway.
But we'll have to break away from groupby's dict mechanism in favor of
of an ordered list of groups and a multisey mapping keys to positions in the group list.

The different kinds of split/partition/group semantics possible, such as
inclusive/exclusive splitting may require a keyfunc that consumes a pair of (or n) rows
(Examples: split when delta_foo > 0.3 for example, split on delta_moving_avarage(nwin) > 0.2),
and I haven't come up with a good way to do that without getting baroque.

Allowing overlapping groups is another twist.

Should trim fluff features before attempting implementation.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Groupby API Design labels Apr 11, 2021
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@MarcoGorelli MarcoGorelli added the Closing Candidate May be closeable, needs more eyeballs label Mar 27, 2023
@MarcoGorelli
Copy link
Member

closing as there's been no activity in about a decade, if there's a need for this feature I presume someone will comment / open a new issue (though at this point, in 2023, I doubt it would be accepted)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants