-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: tight code optimization opportunities #47120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is another, separate opportunity, for GOAMD64=v3 compilation. The SHRXQ instruction takes an explicit shift register, has separate source and destination operands, and can read source from memory. That allows reducing the loop to
That change runs at 3400 MB/s (!). (The DFA tables were carefully constructed exactly to enable this implementation.) |
@rsc sorry for hijacked, but what means |
I see this hasn't had attention for a while but this is a problem I've noticed in ppc64 code too. Invariant values are not moved out of loops. I thought at one time there was work to do this but it must have been abandoned. Here is one example:
|
Change https://go.dev/cl/385174 mentions this issue: |
The SHRX/SHLX instruction can take any general register as the shift count operand, and can read source from memory. This CL introduces some operators to combine load and shift to one instruction. For #47120 Change-Id: I13b48f53c7d30067a72eb2c8382242045dead36a Reviewed-on: https://go-review.googlesource.com/c/go/+/385174 Reviewed-by: Keith Randall <[email protected]> Trust: Cherry Mui <[email protected]>
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
The generated x86 code can be improved in some fairly simple ways - hoisting computed constants out of loop bodies, and avoiding unnecessary register moves - that have a significant performance impact on tight loops. In the following example those improvements produce a 35% speedup.
Here is an alternate, DFA-based implementation of
utf8.Valid
that I have been playing with:There are no big benchmarks of Valid in the package, but here are some that could be added:
The old Valid implementation runs at around 1450 MB/s.
The implementation above runs at around 1600 MB/s.
Better but not what I had hoped.
It compiles as follows:
Translating this to proper non-regabi assembly I get:
This runs also at about 1600 MB/s.
First optimization: the
LEAQ ·dfa(SB), R8
should be hoisted out of the loop body.(I tried to do this in the Go version with
dfa := &dfa
but it got constant propagated away!)That change brings it up to 1750 MB/s.
Second optimization: use DI for
i
instead of CX, to avoid the pressure on CX.This lets the
LEAQ 1(CX), DI
and the laterMOVQ DI, CX
collapse to justLEAQ 1(DI), DI
.That change brings it up to 1900 MB/s.
The body is now:
Third optimization: since
DX
is moving intoCX
, do that one instruction earlier, allowing the use ofSI
to be optimized intoDX
to eliminate the finalMOVQ
:I think this ends up being just "compute the shift amount before the shifted value".
That change brings it up to 2150 MB/s.
This is still a direct translation of the Go code: there are no tricks the compiler couldn't do.
For this particular loop, the optimizations make the code run 35% faster.
Final assembly:
The text was updated successfully, but these errors were encountered: