ASCII fast path for some String scalar functions

### Is your feature request related to a problem or challenge?

String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes

For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)
```
[x][x][x][x][x][xxxx][xxx][xxx]
```
Some seemingly cheap operation liks `substr(utf8_col, i, j)`, `character_length(utf8_col)` will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.

However:
- Many data are ASCII encoded (1 Byte encoding subset of UTF8), which includes the most common English characters, numbers, etc.
- Validating if a string array is ASCII-encoded is fast
    - Validation implementation is compiler/CPU friendly, can run ~memory bandwidth
    - It's possible to check in batch, for each string array

So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.

This should be a common trick which has been implemented in [Velox](https://vldb.org/pvldb/vol15/p3372-pedreira.pdf) and [Photon](https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf), as their paper has mentioned.
Below is the numbers from Velox
![image](https://github.com/user-attachments/assets/080be523-bd5e-432f-a2d4-1fdd61ef5810)

I did a quick experiment on `character_length()`/ `substr()` scalar functions, and got some speedup for ASCII cases, the validation overhead is very little.
`substr()` can get another 80% faster upon https://github.com/apache/datafusion/pull/12044,  for some microbenches with string length 128B

#### Update
It has been done on `char_length()` string function and got some performance improvement https://github.com/apache/datafusion/pull/12356

The remaining tasks:
- [ ] Make `is_ascii()` faster as suggested by https://github.com/apache/datafusion/issues/12306#issuecomment-2334771997
- [ ] https://github.com/apache/datafusion/issues/12365
- [x] https://github.com/apache/datafusion/issues/12366
- [x] https://github.com/apache/datafusion/issues/12367
- [x] https://github.com/apache/datafusion/issues/12445
- [ ] Investigate if there is other string operations that can also be optimized by ASCII fast path

### Describe the solution you'd like

For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up: `character_length()`, `substr()`, `lower()`, `upper()`
(And maybe some more like regex functions, need some further investigation)

### Describe alternatives you've considered

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ASCII fast path for some String scalar functions #12306

Is your feature request related to a problem or challenge?

Update

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ASCII fast path for some String scalar functions #12306

Description

Is your feature request related to a problem or challenge?

Update

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions