Skip to content

[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Nov 5, 2021

Conversation

yxiong
Copy link
Contributor

@yxiong yxiong commented Sep 23, 2021

Reference Issues/PRs

Fixes #21112 .

What does this implement/fix? Explain your changes.

This is a starting point for discussing potential fixes for #21112 , containing two parts:

  1. Make sklearn.utils.is_scalar_nan(x) return true when x is pd.NA. This is necessary for imputer._validate_input to successfully validate pd.StringDtype data with pd.NA.
  2. Support pd.NA in sklearn.utils._mask._get_dense_mask.

With these changes, the code snippet in #21112 will run successfully and imputes pd.NA to empty strings.

Any other comments?

I am new in contributing to sklearn and unfamiliar with the custom and norm (e.g. what's the proper way to import pandas). This PR is just a proof-of-concept to initiate some discussion. If the direction looks promising, I can update the code to adhere to package's convention, add documentation and unit tests, etc. Please kindly advice. Thanks!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

You can find the failing errors in the CI here: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=32878&view=results

For the linting errors, running black . should resolve the issue.

I left comments on some of my concerns.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a test to check that SimpleImputer works with pd.NA and string extension arrays.

This PR treats pd.NA explicitly when the dataframe is converted into an object dtype. This works well when the extension array is naturally convert into a object dtype, i.e. strings.

- Make private API: is_pd_na ==> _is_pandas_na.
- Add comments about suppressing `AttributeError`.
- Move the `_more_tags` function from `_BaseImputer` to its children.
- Add comments about skip validation in `_check_inputs_dtype`.
- Add unit test for floating point array
- Fix linter errors on "line too long"
- Add notes to doc/whats_new/v1.1.rst
- Move `_more_tags` back to the base class
- Revert doc/whats_new/_contributors.rst
- Update the docstring of `SimpleImputer` and add another test for using `np.nan` as missing value
@yxiong
Copy link
Contributor Author

yxiong commented Sep 28, 2021

Thanks again for the code review, @thomasjpfan and @ogrisel . Please take another look.

@yxiong
Copy link
Contributor Author

yxiong commented Oct 5, 2021

Friendly ping @thomasjpfan @ogrisel . Please let me know if you have other comments. If things look good, what is the right procedure to merge this into the main branch?

@yxiong yxiong changed the title Support pd.NA in StringDtype columns for SimpleImputer [MRG] Support pd.NA in StringDtype columns for SimpleImputer Oct 7, 2021
@yxiong yxiong requested review from thomasjpfan and ogrisel October 7, 2021 23:15
- Add unit test for 'median' strategy on integer-type arrays
- Add xfailing test for 'median' strategy on float-typed arrays
- Update code style to only suppress `ImportError`, not `AttributeError`
@yxiong yxiong requested a review from thomasjpfan October 8, 2021 23:17
@yxiong
Copy link
Contributor Author

yxiong commented Oct 13, 2021

Updated and all tests passed. Please take another look.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM!

@yxiong yxiong requested a review from thomasjpfan October 20, 2021 04:51
@yxiong
Copy link
Contributor Author

yxiong commented Oct 20, 2021

Thanks again for the review @thomasjpfan !

@ogrisel please take another look and let me know if you have other comments.

@yxiong
Copy link
Contributor Author

yxiong commented Oct 27, 2021

Friendly ping on this PR @thomasjpfan @ogrisel . Could you kindly advice what's the right procedure to get this merged into the main branch? Is there any other actions needed from my side at the moment? Thanks!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow feedback @yxiong.

Overall this looks good but there are things to improve with the dtype handling I think. See below.

@thomasjpfan let me know if you agree.

@yxiong yxiong requested a review from ogrisel October 28, 2021 19:59
@yxiong
Copy link
Contributor Author

yxiong commented Oct 29, 2021

@ogrisel Please take another look.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new dtype checks in the tests.

- Add test case for `strategy="mean"`
- Use `assert_allclose` for float arrays
@yxiong yxiong requested a review from thomasjpfan November 4, 2021 23:49
@yxiong
Copy link
Contributor Author

yxiong commented Nov 4, 2021

Friendly ping @thomasjpfan @ogrisel . Please take another look.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much!

@ogrisel ogrisel merged commit 5256fb3 into scikit-learn:main Nov 5, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer cannot impute pd.DataFrame of StringDtype
3 participants