Skip to content

Conversation

DivineOb
Copy link

Decode five symbols per stream per iteration in X1 huffman decompression on AArch64 rather than the default 4. The x86 assembly version already implements this change. Doing so gives a modest decompression speedup on Neoverse N1. Because the portion of runtime used by huffman compression is small this represents a significant speedup to those functions.

gcc: 11.2.0
clang: 14.0.6-2
Tests: silesia.tar
Platform: Neoverse N1

Decompression speedup for different compression levels.

Level         | Clang | gcc
2             | 0.44% | 0.32%
3             | 0.80% | 0.37%
9             | 0.64% | 0.28%
10            | 0.59% | 0.38%
11            | 0.71% | 0.38%

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

We have to retain support for the Huffman table log = 12. Which means that we can't blindly decode 5 symbols per loop (5 * 12 = 60, but we are only guaranteed to have 57 bits in our bitstream). The ASM implementation is only called when tableLog == 11, so it is allowed to make that assumption.

The Zstandard format doesn't actually allow tableLog=12, but we had a bug in our dictionary builder in an early version that could potentially emit tableLog=12. So we want to retain that support.

Rather than just adding a 5th symbol to the loops, I'd likely re-write the decoding loop to use a similar approach to the assembly, somewhat like #3155.

I am going to close this PR in favor of writing an optimized C version of the Huffman decoder in Issue #3425.

Thanks for the PR!

@terrelln terrelln closed this Jan 13, 2023
@terrelln
Copy link
Contributor

This is handled in PR #3449. I'd be happy to accept any patches to the fast C decoder that improve aarch64 performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants