Process five symbols per stream per iteration on AArch64. #3299

DivineOb · 2022-10-21T19:56:30Z

Decode five symbols per stream per iteration in X1 huffman decompression on AArch64 rather than the default 4. The x86 assembly version already implements this change. Doing so gives a modest decompression speedup on Neoverse N1. Because the portion of runtime used by huffman compression is small this represents a significant speedup to those functions.

gcc: 11.2.0
clang: 14.0.6-2
Tests: silesia.tar
Platform: Neoverse N1

Decompression speedup for different compression levels.

Level         | Clang | gcc
2             | 0.44% | 0.32%
3             | 0.80% | 0.37%
9             | 0.64% | 0.28%
10            | 0.59% | 0.38%
11            | 0.71% | 0.38%

terrelln

Sorry for the delay!

We have to retain support for the Huffman table log = 12. Which means that we can't blindly decode 5 symbols per loop (5 * 12 = 60, but we are only guaranteed to have 57 bits in our bitstream). The ASM implementation is only called when tableLog == 11, so it is allowed to make that assumption.

The Zstandard format doesn't actually allow tableLog=12, but we had a bug in our dictionary builder in an early version that could potentially emit tableLog=12. So we want to retain that support.

Rather than just adding a 5th symbol to the loops, I'd likely re-write the decoding loop to use a similar approach to the assembly, somewhat like #3155.

I am going to close this PR in favor of writing an optimized C version of the Huffman decoder in Issue #3425.

Thanks for the PR!

terrelln · 2023-01-25T22:01:03Z

This is handled in PR #3449. I'd be happy to accept any patches to the fast C decoder that improve aarch64 performance.

Process five symbols per stream per iteration on AArch64.

3c873b6

facebook-github-bot added the CLA Signed label Oct 21, 2022

terrelln self-requested a review December 8, 2022 18:31

terrelln self-assigned this Dec 8, 2022

terrelln mentioned this pull request Jan 13, 2023

Add faster Huffman decoding in generic C #3425

Closed

terrelln reviewed Jan 13, 2023

View reviewed changes

terrelln closed this Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process five symbols per stream per iteration on AArch64. #3299

Process five symbols per stream per iteration on AArch64. #3299

Uh oh!

DivineOb commented Oct 21, 2022

Uh oh!

terrelln left a comment

Uh oh!

terrelln commented Jan 25, 2023

Uh oh!

Uh oh!

Process five symbols per stream per iteration on AArch64. #3299

Process five symbols per stream per iteration on AArch64. #3299

Uh oh!

Conversation

DivineOb commented Oct 21, 2022

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

terrelln commented Jan 25, 2023

Uh oh!

Uh oh!