[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114

yxiong · 2021-09-23T05:20:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is a starting point for discussing potential fixes for #21112 , containing two parts:

Make sklearn.utils.is_scalar_nan(x) return true when x is pd.NA. This is necessary for imputer._validate_input to successfully validate pd.StringDtype data with pd.NA.
Support pd.NA in sklearn.utils._mask._get_dense_mask.

With these changes, the code snippet in #21112 will run successfully and imputes pd.NA to empty strings.

Any other comments?

I am new in contributing to sklearn and unfamiliar with the custom and norm (e.g. what's the proper way to import pandas). This PR is just a proof-of-concept to initiate some discussion. If the direction looks promising, I can update the code to adhere to package's convention, add documentation and unit tests, etc. Please kindly advice. Thanks!

thomasjpfan

Thank you for the PR!

You can find the failing errors in the CI here: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=32878&view=results

For the linting errors, running black . should resolve the issue.

I left comments on some of my concerns.

sklearn/utils/fixes.py

…teger as well

thomasjpfan

We need to add a test to check that SimpleImputer works with pd.NA and string extension arrays.

This PR treats pd.NA explicitly when the dataframe is converted into an object dtype. This works well when the extension array is naturally convert into a object dtype, i.e. strings.

sklearn/utils/__init__.py

sklearn/utils/_mask.py

sklearn/impute/_base.py

- Make private API: is_pd_na ==> _is_pandas_na. - Add comments about suppressing `AttributeError`. - Move the `_more_tags` function from `_BaseImputer` to its children. - Add comments about skip validation in `_check_inputs_dtype`.

- Add unit test for floating point array - Fix linter errors on "line too long" - Add notes to doc/whats_new/v1.1.rst

sklearn/impute/_iterative.py

sklearn/impute/_knn.py

sklearn/utils/__init__.py

doc/whats_new/_contributors.rst

- Move `_more_tags` back to the base class - Revert doc/whats_new/_contributors.rst - Update the docstring of `SimpleImputer` and add another test for using `np.nan` as missing value

yxiong · 2021-09-28T19:06:17Z

Thanks again for the code review, @thomasjpfan and @ogrisel . Please take another look.

yxiong · 2021-10-05T05:12:37Z

Friendly ping @thomasjpfan @ogrisel . Please let me know if you have other comments. If things look good, what is the right procedure to merge this into the main branch?

sklearn/impute/tests/test_impute.py

sklearn/utils/__init__.py

sklearn/impute/tests/test_impute.py

- Add unit test for 'median' strategy on integer-type arrays - Add xfailing test for 'median' strategy on float-typed arrays - Update code style to only suppress `ImportError`, not `AttributeError`

yxiong · 2021-10-13T02:27:51Z

Updated and all tests passed. Please take another look.

thomasjpfan

Minor comment, otherwise LGTM!

sklearn/utils/__init__.py

sklearn/impute/tests/test_impute.py

yxiong · 2021-10-20T06:24:08Z

Thanks again for the review @thomasjpfan !

@ogrisel please take another look and let me know if you have other comments.

yxiong · 2021-10-27T06:12:21Z

Friendly ping on this PR @thomasjpfan @ogrisel . Could you kindly advice what's the right procedure to get this merged into the main branch? Is there any other actions needed from my side at the moment? Thanks!

doc/whats_new/v1.1.rst

sklearn/impute/tests/test_impute.py

ogrisel

Sorry for the slow feedback @yxiong.

Overall this looks good but there are things to improve with the dtype handling I think. See below.

@thomasjpfan let me know if you agree.

sklearn/impute/tests/test_impute.py

yxiong · 2021-10-29T21:45:14Z

@ogrisel Please take another look.

thomasjpfan

I like the new dtype checks in the tests.

sklearn/impute/tests/test_impute.py

- Add test case for `strategy="mean"` - Use `assert_allclose` for float arrays

yxiong · 2021-11-04T23:50:05Z

Friendly ping @thomasjpfan @ogrisel . Please take another look.

ogrisel

LGTM, thank you very much!

…learn#21114) Co-authored-by: Olivier Grisel <[email protected]>

Support pd.NA in StringDtype columns for SimpleImputer

6617944

github-actions bot added the module:utils label Sep 23, 2021

yxiong added 2 commits September 23, 2021 09:27

Add suppress(ImportError) when importing pandas

c80f0b2

Support pd.NA in _object_dtype_isnan

1e984b1

yxiong mentioned this pull request Sep 23, 2021

SimpleImputer cannot impute pd.DataFrame of StringDtype #21112

Closed

thomasjpfan reviewed Sep 25, 2021

View reviewed changes

sklearn/utils/fixes.py Outdated Show resolved Hide resolved

yxiong added 5 commits September 24, 2021 21:34

Fix linter errors by running

cef59b8

Move the pd.NA check to call site and have the imputer to work for in…

237399d

…teger as well

Add documentation for is_pd_na

599258a

Suppress AttributeError to support earlier pandas versions

89044b5

Suppress AttributeError to fix another unit test

dde89b4

thomasjpfan reviewed Sep 25, 2021

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/_mask.py Show resolved Hide resolved

sklearn/impute/_base.py Outdated Show resolved Hide resolved

sklearn/impute/_base.py Show resolved Hide resolved

yxiong added 4 commits September 26, 2021 14:15

Add unit test for imputing pd.NA

60b7fe4

Address reviewer's comments

57accc1

- Make private API: is_pd_na ==> _is_pandas_na. - Add comments about suppressing `AttributeError`. - Move the `_more_tags` function from `_BaseImputer` to its children. - Add comments about skip validation in `_check_inputs_dtype`.

Fix CI tests

a2d8a49

- Add unit test for floating point array - Fix linter errors on "line too long" - Add notes to doc/whats_new/v1.1.rst

Fix newly added unit test: require pandas minversion 1.0

a97b63f

ogrisel reviewed Sep 28, 2021

View reviewed changes

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved

sklearn/impute/_knn.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Show resolved Hide resolved

thomasjpfan reviewed Sep 28, 2021

View reviewed changes

doc/whats_new/_contributors.rst Outdated Show resolved Hide resolved

Address reviewer's feedback

eb034e7

- Move `_more_tags` back to the base class - Revert doc/whats_new/_contributors.rst - Update the docstring of `SimpleImputer` and add another test for using `np.nan` as missing value

Add comment explaining is_scalar_nan(pd.NA) == False

bc69ca8

Merge branch 'main' into impute-pd-na

c7fc913

yxiong changed the title ~~Support pd.NA in StringDtype columns for SimpleImputer~~ [MRG] Support pd.NA in StringDtype columns for SimpleImputer Oct 7, 2021

yxiong requested review from thomasjpfan and ogrisel October 7, 2021 23:15

thomasjpfan mentioned this pull request Oct 8, 2021

ENH Adds float extension array support to check_array #21278

Merged

thomasjpfan reviewed Oct 8, 2021

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

Address reviewer's comments:

a414148

- Add unit test for 'median' strategy on integer-type arrays - Add xfailing test for 'median' strategy on float-typed arrays - Update code style to only suppress `ImportError`, not `AttributeError`

yxiong requested a review from thomasjpfan October 8, 2021 23:17

yxiong added 2 commits October 12, 2021 11:10

Merge remote-tracking branch 'origin/main' into impute-pd-na

4ccb6b2

Merge branch 'main' into impute-pd-na

25e7e38

thomasjpfan approved these changes Oct 19, 2021

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

yxiong requested a review from thomasjpfan October 20, 2021 04:51

Move comments to _is_pandas_na

ea7e4a7

Merge branch 'main' into impute-pd-na

8e6d41e

Merge branch 'main' into impute-pd-na

ea2358d

ogrisel reviewed Oct 28, 2021

View reviewed changes

doc/whats_new/v1.1.rst Show resolved Hide resolved

ogrisel reviewed Oct 28, 2021

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

ogrisel reviewed Oct 28, 2021

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

ogrisel and others added 2 commits October 28, 2021 19:30

cosmetics

c0b9a26

Add a test case with no missing value

3ed5671

yxiong requested a review from ogrisel October 28, 2021 19:59

yxiong added 3 commits October 28, 2021 15:20

Assert array equal and same dtype

e8d071a

Merge branch 'main' into impute-pd-na

410cfdf

Remove xfail since scikit-learn#21278 is merged

3e89b93

thomasjpfan reviewed Nov 1, 2021

View reviewed changes

sklearn/impute/tests/test_impute.py Show resolved Hide resolved

sklearn/impute/tests/test_impute.py Show resolved Hide resolved

Address reviewer's feedback

afd20d6

- Add test case for `strategy="mean"` - Use `assert_allclose` for float arrays

yxiong requested a review from thomasjpfan November 4, 2021 23:49

ogrisel approved these changes Nov 5, 2021

View reviewed changes

ogrisel merged commit 5256fb3 into scikit-learn:main Nov 5, 2021

yxiong mentioned this pull request Nov 8, 2021

[BUG] pyfunc cannot predict dataframe with None value properly mlflow/mlflow#4827

Closed

23 tasks

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021

[MRG] Support pd.NA in StringDtype columns for SimpleImputer (scikit-…

844a207

…learn#21114) Co-authored-by: Olivier Grisel <[email protected]>

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

[MRG] Support pd.NA in StringDtype columns for SimpleImputer (scikit-…

972805f

…learn#21114) Co-authored-by: Olivier Grisel <[email protected]>

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

54 tasks

Uh oh!

[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114

[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114

Uh oh!

Conversation

yxiong commented Sep 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxiong commented Sep 28, 2021

Uh oh!

yxiong commented Oct 5, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxiong commented Oct 13, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yxiong commented Oct 20, 2021

Uh oh!

yxiong commented Oct 27, 2021

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxiong commented Oct 29, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yxiong commented Nov 4, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yxiong commented Sep 23, 2021 •

edited

Loading

ogrisel left a comment •

edited

Loading