[ArrowStringArray] implement ArrowStringArray._str_contains #41025

simonjayhawkins · 2021-04-18T19:41:59Z

not yet dealt with na. no tests are failing so we need tests for this.

we can either use the fallback when na is specified or handle na in the array method, which may be more performant.

should we replicate StringArray:

>>> s = pd.Series(["Mouse", "dog", "house and parrot", "23", np.NaN], dtype="string")
>>> 
>>> s.str.contains("og", na=3, regex=False)
0    False
1     True
2    False
3    False
4     True
dtype: boolean
>>> 
>>> s.str.contains("og", na=np.nan, regex=False)
0    False
1     True
2    False
3    False
4     <NA>
dtype: boolean
>>>

or return an object array to preserve na if not pd.NA, True or False instead of coercing to bool/null?

jorisvandenbossche · 2021-04-20T07:28:45Z

should we replicate StringArray ... or return an object array to preserve na if not pd.NA, True or False instead of coercing to bool/null?

I think replicating StringArray is more useful. At least, I think the contains method should always return a boolean array. So I would either coerce any value passed to na, or raise an error if it is not pd.NA, True or False.

I think the main goal of this na keyword was to be able to specify eg False to get a fully boolean series instead of object type:

In [22]: s = pd.Series(["Mouse", "dog", "house and parrot", "23", np.NaN], dtype=object)

In [23]: s.str.contains("og", regex=False)
Out[23]: 
0    False
1     True
2    False
3    False
4      NaN
dtype: object

In [24]: s.str.contains("og", na=False, regex=False)
Out[24]: 
0    False
1     True
2    False
3    False
4    False
dtype: bool

pandas/core/arrays/string_arrow.py

simonjayhawkins · 2021-04-20T13:44:43Z

I think the main goal of this na keyword was to be able to specify eg False to get a fully boolean series instead of object type:

It maybe that, since that would not be relevant for the nullable string arrays, we simply raise if na is passed for both StringArray and ArrowStringArray.

jorisvandenbossche · 2021-04-21T13:30:15Z

I think I have a slight preference to just keep the na keyword working as is. It's indeed less useful now with the nullable boolean dtype (.contains(..., na=False) could be replicated with .contains(...).fillna(False)), but this keeps existing workflows working.

simonjayhawkins · 2021-04-21T14:14:50Z

sure. also since the default for regex is True, should we issue a performance warning when regex is not False/not specified... although i've not yet merged #41051 into this branch to know what gains we get.

I'm looking at str.split atm and for a simple whitespace split, I'm getting much worse performance... I need a better way to convert the pyarrow list array to a numpy object array of lists.

also looking at pyarrow.compute.is_in, but that is not a str accessor method so may need to special case in the first instance and work out dispatch logic thereafter.

jorisvandenbossche · 2021-04-21T19:39:57Z

since the default for regex is True, should we issue a performance warning when regex is not False/not specified...

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

I'm looking at str.split atm and for a simple whitespace split, I'm getting much worse performance... I need a better way to convert the pyarrow list array to a numpy object array of lists.

Yeah, since we don't have a proper list type, this might not necessarily give an advantage (although benchmarking will need to tell).
How are you currently converting the pyarrow list array? Using pyarrow_array.to_pandas() uses an array of arrays and not an array of lists, is that the reason to not use it? For the expand=True case, that might not matter?

also looking at pyarrow.compute.is_in, but that is not a str accessor method so may need to special case in the first instance and work out dispatch logic thereafter.

The ExtensionArray has an isin method that can be overridden in StringArray

simonjayhawkins · 2021-04-23T13:59:41Z

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

not in pyarrow 3.0.0

simonjayhawkins · 2021-04-23T14:05:35Z

[  0.00%] ·· Benchmarking existing-py_home_simon_miniconda3_envs_pandas-dev_bin_python
[ 50.00%] ··· strings.Contains.time_contains                                                                                                              ok
[ 50.00%] ··· ============== ========== ==========
              --                     regex        
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        23.1±0ms   15.4±0ms 
                  string      18.6±0ms   11.5±0ms 
               arrow_string   21.4±0ms   2.42±0ms 
              ============== ========== ==========

jorisvandenbossche · 2021-04-25T13:10:22Z

I would not raise an error in general: it's a shortcut for fillna(False)

There is actually also a pc.match_substring_regex in addition to match_substring, that can be used in that case (but that might only be available in the latest release)

not in pyarrow 3.0.0

It's in master, and in the 4.0.0 which will probably be released tomorrow. BTW, if we are going to use pyarrow more extensively as we are doing for StringArray, we should probably add the nightly version to one of the CI builds (not for this PR, just thinking about it). There are nightly builds available from conda-forge (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)

simonjayhawkins · 2021-04-25T15:23:14Z

I would not raise an error in general: it's a shortcut for fillna(False)

yep. have added test_contains_na_kwarg_for_nullable_string_dtype

jorisvandenbossche

Looks good!

jorisvandenbossche · 2021-04-26T12:17:27Z

BTW, if we are going to use pyarrow more extensively as we are doing for StringArray, we should probably add the nightly version to one of the CI builds (not for this PR, just thinking about it). There are nightly builds available from conda-forge (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)

@simonjayhawkins ^ might be a useful follow-up

simonjayhawkins · 2021-04-26T12:19:37Z

@simonjayhawkins ^ might be a useful follow-up

sure. would be better on ci than needing to regularly update a local environment.

simonjayhawkins · 2021-05-01T15:51:55Z

@jorisvandenbossche can't seem to get the nightlies to install

using conda install -c arrow-nightlies pyarrow get pyarrow 3.0.0

using pip install -U --extra-index-url https://pypi.fury.io/arrow-nightlies --prefer-binary --pre pyarrow get pyarrow 4.0.0

this is using ci/deps/actions-38-numpydev.yaml as the environment (and on wsl)

…ev#41025)

[ArrowStringArray] implement ArrowStringArray._str_contains

5b8aca3

simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels Apr 18, 2021

Merge remote-tracking branch 'upstream/master' into _str_contains

6901807

jorisvandenbossche reviewed Apr 20, 2021

View reviewed changes

pandas/core/arrays/string_arrow.py Show resolved Hide resolved

simonjayhawkins added 4 commits April 22, 2021 18:55

Merge remote-tracking branch 'upstream/master' into _str_contains

419f82e

add benchmark

26719a1

Merge remote-tracking branch 'upstream/master' into _str_contains

607c8ca

add tests

9b8c404

handle na kwarg

5f68797

simonjayhawkins marked this pull request as ready for review April 23, 2021 14:05

simonjayhawkins added this to the 1.3 milestone Apr 23, 2021

Merge remote-tracking branch 'upstream/master' into _str_contains

66251de

jorisvandenbossche approved these changes Apr 26, 2021

View reviewed changes

jorisvandenbossche merged commit 0fba740 into pandas-dev:master Apr 26, 2021

simonjayhawkins deleted the _str_contains branch April 26, 2021 12:20

simonjayhawkins mentioned this pull request Apr 29, 2021

[ArrowStringArray] use pyarrow.compute.match_substring_regex if available #41217

Merged

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

[ArrowStringArray] implement ArrowStringArray._str_contains (pandas-d…

cd2c598

…ev#41025)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

[ArrowStringArray] implement ArrowStringArray._str_contains (pandas-d…

e884668

…ev#41025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

Uh oh!

simonjayhawkins commented Apr 18, 2021

Uh oh!

jorisvandenbossche commented Apr 20, 2021

Uh oh!

Uh oh!

simonjayhawkins commented Apr 20, 2021

Uh oh!

jorisvandenbossche commented Apr 21, 2021

Uh oh!

simonjayhawkins commented Apr 21, 2021

Uh oh!

jorisvandenbossche commented Apr 21, 2021

Uh oh!

simonjayhawkins commented Apr 23, 2021

Uh oh!

simonjayhawkins commented Apr 23, 2021

Uh oh!

jorisvandenbossche commented Apr 25, 2021

Uh oh!

simonjayhawkins commented Apr 25, 2021

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche commented Apr 26, 2021

Uh oh!

simonjayhawkins commented Apr 26, 2021

Uh oh!

simonjayhawkins commented May 1, 2021

Uh oh!

Uh oh!

Uh oh!

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

Uh oh!

Conversation

simonjayhawkins commented Apr 18, 2021

Uh oh!

jorisvandenbossche commented Apr 20, 2021

Uh oh!

Uh oh!

simonjayhawkins commented Apr 20, 2021

Uh oh!

jorisvandenbossche commented Apr 21, 2021

Uh oh!

simonjayhawkins commented Apr 21, 2021

Uh oh!

jorisvandenbossche commented Apr 21, 2021

Uh oh!

simonjayhawkins commented Apr 23, 2021

Uh oh!

simonjayhawkins commented Apr 23, 2021

Uh oh!

jorisvandenbossche commented Apr 25, 2021

Uh oh!

simonjayhawkins commented Apr 25, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 26, 2021

Uh oh!

simonjayhawkins commented Apr 26, 2021

Uh oh!

simonjayhawkins commented May 1, 2021

Uh oh!

Uh oh!