Skip to content

ENH: stratified samplig as a DataFrame method #33777

Closed
@flaboss

Description

@flaboss

Is your feature request related to a problem?

Often times when conducting hypothesis tests we need to prepare the data and sample it using stratified sampling (i.e.: a sample that mimics the population distribution according to a variable or strata). Pandas has a sample feature, but it does not take strata into account today.

Describe the solution you'd like

I would like to propose a solution (in fact I have already pulled the pandas repo and developed it).
I wrote a stratified_sample method that does exactly that. Given a DataFrame columns, it performs a stratified sample. This is a method of the object DataFrame just as the "sample" method.

It returns a sampled DataFrame using proportionate stratification.

API breaking implications

I think that this simple method will not break the api since it just samples a DataFrame object.

Describe alternatives you've considered

The method I developed returns a sampled DataFrame. The parameters it takes are:

  • n: sample size. If not provided it estimates a sample size using the Adjusted Cochran sampling formula.
  • strata: a list containing columns to be used as strata
  • random_state: for reproducibility
  • reset_index: True or False

Additional context

https://en.wikipedia.org/wiki/Stratified_sampling

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestEnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions