-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrowStringArray] implement ArrowStringArray._str_contains #41025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrowStringArray] implement ArrowStringArray._str_contains #41025
Conversation
I think replicating StringArray is more useful. At least, I think the I think the main goal of this
|
It maybe that, since that would not be relevant for the nullable string arrays, we simply raise if |
I think I have a slight preference to just keep the |
sure. also since the default for regex is True, should we issue a performance warning when regex is not False/not specified... although i've not yet merged #41051 into this branch to know what gains we get. I'm looking at str.split atm and for a simple whitespace split, I'm getting much worse performance... I need a better way to convert the pyarrow list array to a numpy object array of lists. also looking at pyarrow.compute.is_in, but that is not a str accessor method so may need to special case in the first instance and work out dispatch logic thereafter. |
There is actually also a
Yeah, since we don't have a proper list type, this might not necessarily give an advantage (although benchmarking will need to tell).
The ExtensionArray has an |
not in pyarrow 3.0.0 |
|
I would not raise an error in general: it's a shortcut for
It's in master, and in the 4.0.0 which will probably be released tomorrow. BTW, if we are going to use pyarrow more extensively as we are doing for StringArray, we should probably add the nightly version to one of the CI builds (not for this PR, just thinking about it). There are nightly builds available from conda-forge (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages) |
yep. have added test_contains_na_kwarg_for_nullable_string_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@simonjayhawkins ^ might be a useful follow-up |
sure. would be better on ci than needing to regularly update a local environment. |
@jorisvandenbossche can't seem to get the nightlies to install using using this is using ci/deps/actions-38-numpydev.yaml as the environment (and on wsl) |
not yet dealt with
na
. no tests are failing so we need tests for this.we can either use the fallback when
na
is specified or handlena
in the array method, which may be more performant.should we replicate StringArray:
or return an object array to preserve
na
if not pd.NA, True or False instead of coercing to bool/null?