Description
Is your feature request related to a problem?
Often times when conducting hypothesis tests we need to prepare the data and sample it using stratified sampling (i.e.: a sample that mimics the population distribution according to a variable or strata). Pandas has a sample feature, but it does not take strata into account today.
Describe the solution you'd like
I would like to propose a solution (in fact I have already pulled the pandas repo and developed it).
I wrote a stratified_sample method that does exactly that. Given a DataFrame columns, it performs a stratified sample. This is a method of the object DataFrame just as the "sample" method.
It returns a sampled DataFrame using proportionate stratification.
API breaking implications
I think that this simple method will not break the api since it just samples a DataFrame object.
Describe alternatives you've considered
The method I developed returns a sampled DataFrame. The parameters it takes are:
- n: sample size. If not provided it estimates a sample size using the Adjusted Cochran sampling formula.
- strata: a list containing columns to be used as strata
- random_state: for reproducibility
- reset_index: True or False