-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
You can find the failing errors in the CI here: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=32878&view=results
For the linting errors, running black .
should resolve the issue.
I left comments on some of my concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add a test to check that SimpleImputer
works with pd.NA
and string extension arrays.
This PR treats pd.NA
explicitly when the dataframe is converted into an object dtype. This works well when the extension array is naturally convert into a object dtype, i.e. strings.
- Make private API: is_pd_na ==> _is_pandas_na. - Add comments about suppressing `AttributeError`. - Move the `_more_tags` function from `_BaseImputer` to its children. - Add comments about skip validation in `_check_inputs_dtype`.
- Add unit test for floating point array - Fix linter errors on "line too long" - Add notes to doc/whats_new/v1.1.rst
- Move `_more_tags` back to the base class - Revert doc/whats_new/_contributors.rst - Update the docstring of `SimpleImputer` and add another test for using `np.nan` as missing value
Thanks again for the code review, @thomasjpfan and @ogrisel . Please take another look. |
Friendly ping @thomasjpfan @ogrisel . Please let me know if you have other comments. If things look good, what is the right procedure to merge this into the main branch? |
- Add unit test for 'median' strategy on integer-type arrays - Add xfailing test for 'median' strategy on float-typed arrays - Update code style to only suppress `ImportError`, not `AttributeError`
Updated and all tests passed. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM!
Thanks again for the review @thomasjpfan ! @ogrisel please take another look and let me know if you have other comments. |
Friendly ping on this PR @thomasjpfan @ogrisel . Could you kindly advice what's the right procedure to get this merged into the main branch? Is there any other actions needed from my side at the moment? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow feedback @yxiong.
Overall this looks good but there are things to improve with the dtype handling I think. See below.
@thomasjpfan let me know if you agree.
@ogrisel Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new dtype checks in the tests.
- Add test case for `strategy="mean"` - Use `assert_allclose` for float arrays
Friendly ping @thomasjpfan @ogrisel . Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you very much!
…learn#21114) Co-authored-by: Olivier Grisel <[email protected]>
…learn#21114) Co-authored-by: Olivier Grisel <[email protected]>
Reference Issues/PRs
Fixes #21112 .
What does this implement/fix? Explain your changes.
This is a starting point for discussing potential fixes for #21112 , containing two parts:
sklearn.utils.is_scalar_nan(x)
return true whenx
ispd.NA
. This is necessary forimputer._validate_input
to successfully validatepd.StringDtype
data withpd.NA
.pd.NA
insklearn.utils._mask._get_dense_mask
.With these changes, the code snippet in #21112 will run successfully and imputes
pd.NA
to empty strings.Any other comments?
I am new in contributing to sklearn and unfamiliar with the custom and norm (e.g. what's the proper way to import pandas). This PR is just a proof-of-concept to initiate some discussion. If the direction looks promising, I can update the code to adhere to package's convention, add documentation and unit tests, etc. Please kindly advice. Thanks!