Skip to content

replace method does't work with string type Series #31644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GYHHAHA opened this issue Feb 4, 2020 · 18 comments · Fixed by #44940
Closed

replace method does't work with string type Series #31644

GYHHAHA opened this issue Feb 4, 2020 · 18 comments · Fixed by #44940
Assignees
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions replace replace method Strings String extension data type and string data
Milestone

Comments

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Feb 4, 2020

Code Sample, a copy-pastable example if possible

>>>pd.Series(['A','B']).replace(r'.','C',regex=True)
0    C
1    C
dtype: object
pd.Series(['A','B']).astype('string').replace(r'.','C',regex=True)
0    A
1    B
dtype: string

Problem description

It seems that replace doesn't work with the string type Series.
Why these two codes return different results?

@charlesdong1991
Copy link
Member

charlesdong1991 commented Feb 4, 2020

looks buggy

investigation is welcome

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

>>>pd.Series(['A','B']).astype('string').replace('.','C',regex=True)
0    A
1    B
dtype: string

Thanks for answer !
But still not work.
And also I want to ask a related question.
Since the str.replace for string does not allow pd.NA for the parameter 'repl', if I want change some strings which meet a certain regex condition to pd.NA, how can I get the correct result. Thanks !

@charlesdong1991
Copy link
Member

charlesdong1991 commented Feb 4, 2020

you mean something like this pd.Series(['A','B']).astype('string').replace('A', pd.NA)?
works to me at least on master:

>>> pd.Series(['A','B']).astype('string').replace('A', pd.NA)
0    <NA>
1       B
dtype: string

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

Oh? I get an error.

>>>pd.Series(['A','B']).astype('string').replace('A', pd.NA)
IndexError: arrays used as indices must be of integer (or boolean) type

@charlesdong1991
Copy link
Member

Are you running it on master branch?

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

I have already updated to the latest version 1.0.0.

@charlesdong1991
Copy link
Member

yeah, there are some fixes after 1.0.0, but not released yet, so some new fixes can only be tested on master. Please let me know if you still have this issue on pandas master branch

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

Oh, I haven't do that.
I will run it on master branch and check the issue again.
Thanks !

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

It works for the pd.NA issue on master branch, but still not work for the original issue. @charlesdong1991

@charlesdong1991
Copy link
Member

charlesdong1991 commented Feb 4, 2020

thanks for confirming the issue on master, are you interested in investigating it? @GYHHAHA

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

Sorry, I'm not sophisticated on the Pandas source code, but I will pay close attention on that when the next version releases.
And also, it seems when pd.NA appears in the string type Series, an error will be raised for the replace method.

>>>pd.Series(['A',np.nan],dtype='O').replace('A','B')
0      B
1    NaN
dtype: object
>>>pd.Series(['A',np.nan],dtype='string').replace('A','B')
AssertionError: B

The error seems not very clear.

@charlesdong1991
Copy link
Member

thanks for the report, your finding is very helpful!! @GYHHAHA

i will look into it a bit

@charlesdong1991
Copy link
Member

take

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

I guess a rough reason for that is, not like the np.nan, the pd.NA doesn't stand for a constant value, so when launch a match for 'A', it's not clear whether pd.NA equals 'A', so the error is raised.
Just my guess. : )

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Feb 4, 2020

Is it possible to take pd.NA as a legal choice for the 'repl' parameter of str.replace method in the latter version? It seems to be more natural.

@simonjayhawkins simonjayhawkins added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays Strings String extension data type and string data labels Apr 5, 2020
@mroeschke mroeschke added the replace replace method label Apr 28, 2020
@mroeschke
Copy link
Member

Looks to work on master now. Could use a test

In [20]: pd.Series(['A','B']).astype('string').replace(r'.','C',regex=True)
Out[20]:
0    C
1    C
dtype: string

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 27, 2021
@SidharthArya
Copy link

take

@klimpt
Copy link

klimpt commented Feb 25, 2022

I think there is another bug:

>>> pd.Series(["a", pd.NA, "a"]).astype("string").replace(["a"], "b", regex=True) #strange behaviour
0       b
1    <NA>
2       a
dtype: string
>>> pd.Series(["a", pd.NA, "a"]).astype("string").replace("a", "b", regex=True) #replace works fine 
0       b
1    <NA>
2       b
dtype: string

Problem description
If series contains pd.NA and dtype is string and regex=Trueand to_replace is a list or a dict, then replace does not work for elements after pd.NA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions replace replace method Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants