-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: stratified samplig as a DataFrame method #33777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
While not very nice-looking it is possible to get this behavior already using groupby: df.groupby(some_grouper).apply(lambda g: g.sample(frac=some_frac)). # or using n if you want the same counts I wonder if instead it could make sense to implement sample directly as a groupby method (in a way that would be more performant than the above)? |
Hi @dsaxton, it is true we can get something like that using group by. I thought of adding a more robust method to pandas since stratified sampling is an important treat of Design of Experiments. Since pandas is a go to python library for data analysis, having this functionality would be great. |
I agree with @dsaxton. Generally I think it's more idiomatic and less of a maintenance burden to compose this functionality from existing methods than adding a separate API. |
yeah this is essentially a duplicate of #31775 |
Is your feature request related to a problem?
Often times when conducting hypothesis tests we need to prepare the data and sample it using stratified sampling (i.e.: a sample that mimics the population distribution according to a variable or strata). Pandas has a sample feature, but it does not take strata into account today.
Describe the solution you'd like
I would like to propose a solution (in fact I have already pulled the pandas repo and developed it).
I wrote a stratified_sample method that does exactly that. Given a DataFrame columns, it performs a stratified sample. This is a method of the object DataFrame just as the "sample" method.
It returns a sampled DataFrame using proportionate stratification.
API breaking implications
I think that this simple method will not break the api since it just samples a DataFrame object.
Describe alternatives you've considered
The method I developed returns a sampled DataFrame. The parameters it takes are:
Additional context
https://en.wikipedia.org/wiki/Stratified_sampling
The text was updated successfully, but these errors were encountered: