Skip to content

ASCII fast path for some String scalar functions #12306

Open
@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes

For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)

[x][x][x][x][x][xxxx][xxx][xxx]

Some seemingly cheap operation liks substr(utf8_col, i, j), character_length(utf8_col) will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.

However:

  • Many data are ASCII encoded (1 Byte encoding subset of UTF8), which includes the most common English characters, numbers, etc.
  • Validating if a string array is ASCII-encoded is fast
    • Validation implementation is compiler/CPU friendly, can run ~memory bandwidth
    • It's possible to check in batch, for each string array

So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.

This should be a common trick which has been implemented in Velox and Photon, as their paper has mentioned.
Below is the numbers from Velox
image

I did a quick experiment on character_length()/ substr() scalar functions, and got some speedup for ASCII cases, the validation overhead is very little.
substr() can get another 80% faster upon #12044, for some microbenches with string length 128B

Update

It has been done on char_length() string function and got some performance improvement #12356

The remaining tasks:

Describe the solution you'd like

For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up: character_length(), substr(), lower(), upper()
(And maybe some more like regex functions, need some further investigation)

Describe alternatives you've considered

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions