Skip to content

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 26, 2021

Conversation

simonjayhawkins
Copy link
Member

not yet dealt with na. no tests are failing so we need tests for this.

we can either use the fallback when na is specified or handle na in the array method, which may be more performant.

should we replicate StringArray:

>>> s = pd.Series(["Mouse", "dog", "house and parrot", "23", np.NaN], dtype="string")
>>> 
>>> s.str.contains("og", na=3, regex=False)
0    False
1     True
2    False
3    False
4     True
dtype: boolean
>>> 
>>> s.str.contains("og", na=np.nan, regex=False)
0    False
1     True
2    False
3    False
4     <NA>
dtype: boolean
>>> 

or return an object array to preserve na if not pd.NA, True or False instead of coercing to bool/null?

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels Apr 18, 2021
@jorisvandenbossche
Copy link
Member

should we replicate StringArray ... or return an object array to preserve na if not pd.NA, True or False instead of coercing to bool/null?

I think replicating StringArray is more useful. At least, I think the contains method should always return a boolean array. So I would either coerce any value passed to na, or raise an error if it is not pd.NA, True or False.

I think the main goal of this na keyword was to be able to specify eg False to get a fully boolean series instead of object type:

In [22]: s = pd.Series(["Mouse", "dog", "house and parrot", "23", np.NaN], dtype=object)

In [23]: s.str.contains("og", regex=False)
Out[23]: 
0    False
1     True
2    False
3    False
4      NaN
dtype: object

In [24]: s.str.contains("og", na=False, regex=False)
Out[24]: 
0    False
1     True
2    False
3    False
4    False
dtype: bool

@simonjayhawkins
Copy link
Member Author

I think the main goal of this na keyword was to be able to specify eg False to get a fully boolean series instead of object type:

It maybe that, since that would not be relevant for the nullable string arrays, we simply raise if na is passed for both StringArray and ArrowStringArray.

@jorisvandenbossche
Copy link
Member

I think I have a slight preference to just keep the na keyword working as is. It's indeed less useful now with the nullable boolean dtype (.contains(..., na=False) could be replicated with .contains(...).fillna(False)), but this keeps existing workflows working.

@simonjayhawkins
Copy link
Member Author

sure. also since the default for regex is True, should we issue a performance warning when regex is not False/not specified... although i've not yet merged #41051 into this branch to know what gains we get.


I'm looking at str.split atm and for a simple whitespace split, I'm getting much worse performance... I need a better way to convert the pyarrow list array to a numpy object array of lists.

also looking at pyarrow.compute.is_in, but that is not a str accessor method so may need to special case in the first instance and work out dispatch logic thereafter.

@jorisvandenbossche
Copy link
Member

since the default for regex is True, should we issue a performance warning when regex is not False/not specified...

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

I'm looking at str.split atm and for a simple whitespace split, I'm getting much worse performance... I need a better way to convert the pyarrow list array to a numpy object array of lists.

Yeah, since we don't have a proper list type, this might not necessarily give an advantage (although benchmarking will need to tell).
How are you currently converting the pyarrow list array? Using pyarrow_array.to_pandas() uses an array of arrays and not an array of lists, is that the reason to not use it? For the expand=True case, that might not matter?

also looking at pyarrow.compute.is_in, but that is not a str accessor method so may need to special case in the first instance and work out dispatch logic thereafter.

The ExtensionArray has an isin method that can be overridden in StringArray

@simonjayhawkins
Copy link
Member Author

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

not in pyarrow 3.0.0

@simonjayhawkins
Copy link
Member Author

[  0.00%] ·· Benchmarking existing-py_home_simon_miniconda3_envs_pandas-dev_bin_python
[ 50.00%] ··· strings.Contains.time_contains                                                                                                              ok
[ 50.00%] ··· ============== ========== ==========
              --                     regex        
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        23.1±0ms   15.4±0ms 
                  string      18.6±0ms   11.5±0ms 
               arrow_string   21.4±0ms   2.42±0ms 
              ============== ========== ==========

@simonjayhawkins simonjayhawkins marked this pull request as ready for review April 23, 2021 14:05
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Apr 23, 2021
@jorisvandenbossche
Copy link
Member

I would not raise an error in general: it's a shortcut for fillna(False)

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

not in pyarrow 3.0.0

It's in master, and in the 4.0.0 which will probably be released tomorrow. BTW, if we are going to use pyarrow more extensively as we are doing for StringArray, we should probably add the nightly version to one of the CI builds (not for this PR, just thinking about it). There are nightly builds available from conda-forge (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)

@simonjayhawkins
Copy link
Member Author

I would not raise an error in general: it's a shortcut for fillna(False)

yep. have added test_contains_na_kwarg_for_nullable_string_dtype

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jorisvandenbossche jorisvandenbossche merged commit 0fba740 into pandas-dev:master Apr 26, 2021
@jorisvandenbossche
Copy link
Member

BTW, if we are going to use pyarrow more extensively as we are doing for StringArray, we should probably add the nightly version to one of the CI builds (not for this PR, just thinking about it). There are nightly builds available from conda-forge (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)

@simonjayhawkins ^ might be a useful follow-up

@simonjayhawkins
Copy link
Member Author

@simonjayhawkins ^ might be a useful follow-up

sure. would be better on ci than needing to regularly update a local environment.

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche can't seem to get the nightlies to install

using conda install -c arrow-nightlies pyarrow get pyarrow 3.0.0

using pip install -U --extra-index-url https://pypi.fury.io/arrow-nightlies --prefer-binary --pre pyarrow get pyarrow 4.0.0

this is using ci/deps/actions-38-numpydev.yaml as the environment (and on wsl)

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants