Skip to content

ENH: stratified samplig as a DataFrame method #33777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flaboss opened this issue Apr 24, 2020 · 4 comments
Closed

ENH: stratified samplig as a DataFrame method #33777

flaboss opened this issue Apr 24, 2020 · 4 comments
Labels
Duplicate Report Duplicate issue or pull request Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@flaboss
Copy link

flaboss commented Apr 24, 2020

Is your feature request related to a problem?

Often times when conducting hypothesis tests we need to prepare the data and sample it using stratified sampling (i.e.: a sample that mimics the population distribution according to a variable or strata). Pandas has a sample feature, but it does not take strata into account today.

Describe the solution you'd like

I would like to propose a solution (in fact I have already pulled the pandas repo and developed it).
I wrote a stratified_sample method that does exactly that. Given a DataFrame columns, it performs a stratified sample. This is a method of the object DataFrame just as the "sample" method.

It returns a sampled DataFrame using proportionate stratification.

API breaking implications

I think that this simple method will not break the api since it just samples a DataFrame object.

Describe alternatives you've considered

The method I developed returns a sampled DataFrame. The parameters it takes are:

  • n: sample size. If not provided it estimates a sample size using the Adjusted Cochran sampling formula.
  • strata: a list containing columns to be used as strata
  • random_state: for reproducibility
  • reset_index: True or False

Additional context

https://en.wikipedia.org/wiki/Stratified_sampling

@flaboss flaboss added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020
@flaboss flaboss mentioned this issue Apr 24, 2020
5 tasks
@dsaxton
Copy link
Member

dsaxton commented Apr 24, 2020

While not very nice-looking it is possible to get this behavior already using groupby:

df.groupby(some_grouper).apply(lambda g: g.sample(frac=some_frac)). # or using n if you want the same counts

I wonder if instead it could make sense to implement sample directly as a groupby method (in a way that would be more performant than the above)?

@dsaxton dsaxton added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020
@flaboss
Copy link
Author

flaboss commented Apr 24, 2020

Hi @dsaxton, it is true we can get something like that using group by. I thought of adding a more robust method to pandas since stratified sampling is an important treat of Design of Experiments. Since pandas is a go to python library for data analysis, having this functionality would be great.

@mroeschke
Copy link
Member

I agree with @dsaxton. Generally I think it's more idiomatic and less of a maintenance burden to compose this functionality from existing methods than adding a separate API.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2020

yeah this is essentially a duplicate of #31775

@jreback jreback added the Duplicate Report Duplicate issue or pull request label Apr 25, 2020
@jreback jreback added this to the No action milestone Apr 25, 2020
@jreback jreback closed this as completed Apr 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants