-
Notifications
You must be signed in to change notification settings - Fork 18k
compress/flate: improve decompression speed #38324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Improve decompression speed, mainly through 3 optimizations: 1) Take advantage of the fact that we can read further ahead when we know current block isn't the last. The reader guarantees that it will not read beyond the end of the stream. This poses limitations on the decoder in terms of how far it can read ahead and is set to the size of an end-of-block marker in `f.h1.min = f.bits[endBlockMarker]`. We can however take advantage of the fact that each block gives information on whether it is the final block on a stream. So if we are not reading the final block we can safely add the size of the smallest block possible with nothing but an EOB marker. That is a block with a predefined table and a single EOB marker. Since we know the size of the block header and the encoding of the EOB this totals to 10 additional bits. Adding 10 bits reduces the number of stream reads significantly. Approximately 5% throughput increase. 2) Manually inline f.huffSym call This change by itself give about about 13% throughput increase. 3) Generate decoders for stdlib io.ByteReader types We generate decoders for the known implementations of `io.ByteReader`, namely `*bytes.Buffer`, `*bytes.Reader`, `*bufio.Reader` and `*strings.Reader`. This change by itself gives about 20-25% throughput increase, including when an `io.Reader` is passed. I would say only `*strings.Reader` probably isn't that common. Minor changes: * Reuse `h.chunks` and `h.links`. * Trade some bounds checks for AND operations. * Change chunks from uint32 to uint16. * Avoid padding of decompressor struct members. Per loop allocation removed from benchmarks. The numbers in the benchmark below includes this change for the 'old' numbers. ``` name old time/op new time/op delta Decode/Digits/Huffman/1e4-32 78.0µs ± 0% 50.5µs ± 1% -35.26% (p=0.008 n=5+5) Decode/Digits/Huffman/1e5-32 779µs ± 2% 487µs ± 0% -37.48% (p=0.008 n=5+5) Decode/Digits/Huffman/1e6-32 7.68ms ± 0% 4.88ms ± 1% -36.44% (p=0.008 n=5+5) Decode/Digits/Speed/1e4-32 88.5µs ± 1% 59.9µs ± 1% -32.33% (p=0.008 n=5+5) Decode/Digits/Speed/1e5-32 963µs ± 1% 678µs ± 1% -29.58% (p=0.008 n=5+5) Decode/Digits/Speed/1e6-32 9.75ms ± 1% 6.90ms ± 0% -29.21% (p=0.008 n=5+5) Decode/Digits/Default/1e4-32 91.2µs ± 1% 61.4µs ± 0% -32.72% (p=0.008 n=5+5) Decode/Digits/Default/1e5-32 954µs ± 0% 675µs ± 0% -29.25% (p=0.008 n=5+5) Decode/Digits/Default/1e6-32 9.67ms ± 0% 6.79ms ± 1% -29.76% (p=0.008 n=5+5) Decode/Digits/Compression/1e4-32 90.7µs ± 1% 61.5µs ± 1% -32.21% (p=0.008 n=5+5) Decode/Digits/Compression/1e5-32 953µs ± 1% 672µs ± 0% -29.46% (p=0.016 n=4+5) Decode/Digits/Compression/1e6-32 9.76ms ± 4% 6.78ms ± 0% -30.54% (p=0.008 n=5+5) Decode/Newton/Huffman/1e4-32 90.4µs ± 0% 54.7µs ± 0% -39.52% (p=0.008 n=5+5) Decode/Newton/Huffman/1e5-32 885µs ± 0% 538µs ± 0% -39.19% (p=0.008 n=5+5) Decode/Newton/Huffman/1e6-32 8.84ms ± 0% 5.44ms ± 0% -38.46% (p=0.016 n=4+5) Decode/Newton/Speed/1e4-32 81.5µs ± 0% 55.1µs ± 1% -32.42% (p=0.016 n=4+5) Decode/Newton/Speed/1e5-32 751µs ± 4% 528µs ± 0% -29.70% (p=0.008 n=5+5) Decode/Newton/Speed/1e6-32 7.49ms ± 2% 5.32ms ± 0% -28.92% (p=0.008 n=5+5) Decode/Newton/Default/1e4-32 73.3µs ± 1% 48.9µs ± 1% -33.36% (p=0.008 n=5+5) Decode/Newton/Default/1e5-32 601µs ± 2% 418µs ± 0% -30.40% (p=0.008 n=5+5) Decode/Newton/Default/1e6-32 5.92ms ± 0% 4.17ms ± 0% -29.60% (p=0.008 n=5+5) Decode/Newton/Compression/1e4-32 72.7µs ± 0% 48.5µs ± 0% -33.21% (p=0.008 n=5+5) Decode/Newton/Compression/1e5-32 597µs ± 0% 418µs ± 0% -29.90% (p=0.008 n=5+5) Decode/Newton/Compression/1e6-32 5.90ms ± 0% 4.15ms ± 0% -29.63% (p=0.016 n=4+5) name old speed new speed delta Decode/Digits/Huffman/1e4-32 128MB/s ± 0% 198MB/s ± 1% +54.46% (p=0.008 n=5+5) Decode/Digits/Huffman/1e5-32 128MB/s ± 2% 205MB/s ± 0% +59.92% (p=0.008 n=5+5) Decode/Digits/Huffman/1e6-32 130MB/s ± 0% 205MB/s ± 1% +57.33% (p=0.008 n=5+5) Decode/Digits/Speed/1e4-32 113MB/s ± 1% 167MB/s ± 1% +47.79% (p=0.008 n=5+5) Decode/Digits/Speed/1e5-32 104MB/s ± 1% 147MB/s ± 1% +42.01% (p=0.008 n=5+5) Decode/Digits/Speed/1e6-32 103MB/s ± 1% 145MB/s ± 0% +41.26% (p=0.008 n=5+5) Decode/Digits/Default/1e4-32 110MB/s ± 1% 163MB/s ± 0% +48.63% (p=0.008 n=5+5) Decode/Digits/Default/1e5-32 105MB/s ± 0% 148MB/s ± 0% +41.34% (p=0.008 n=5+5) Decode/Digits/Default/1e6-32 103MB/s ± 0% 147MB/s ± 1% +42.37% (p=0.008 n=5+5) Decode/Digits/Compression/1e4-32 110MB/s ± 1% 163MB/s ± 1% +47.51% (p=0.008 n=5+5) Decode/Digits/Compression/1e5-32 105MB/s ± 1% 149MB/s ± 0% +41.77% (p=0.016 n=4+5) Decode/Digits/Compression/1e6-32 102MB/s ± 4% 147MB/s ± 0% +43.91% (p=0.008 n=5+5) Decode/Newton/Huffman/1e4-32 111MB/s ± 0% 183MB/s ± 0% +65.35% (p=0.008 n=5+5) Decode/Newton/Huffman/1e5-32 113MB/s ± 0% 186MB/s ± 0% +64.44% (p=0.008 n=5+5) Decode/Newton/Huffman/1e6-32 113MB/s ± 0% 184MB/s ± 0% +62.50% (p=0.016 n=4+5) Decode/Newton/Speed/1e4-32 123MB/s ± 0% 182MB/s ± 1% +47.98% (p=0.016 n=4+5) Decode/Newton/Speed/1e5-32 133MB/s ± 4% 189MB/s ± 0% +42.20% (p=0.008 n=5+5) Decode/Newton/Speed/1e6-32 134MB/s ± 2% 188MB/s ± 0% +40.67% (p=0.008 n=5+5) Decode/Newton/Default/1e4-32 136MB/s ± 1% 205MB/s ± 1% +50.05% (p=0.008 n=5+5) Decode/Newton/Default/1e5-32 166MB/s ± 2% 239MB/s ± 0% +43.67% (p=0.008 n=5+5) Decode/Newton/Default/1e6-32 169MB/s ± 0% 240MB/s ± 0% +42.04% (p=0.008 n=5+5) Decode/Newton/Compression/1e4-32 138MB/s ± 0% 206MB/s ± 0% +49.73% (p=0.008 n=5+5) Decode/Newton/Compression/1e5-32 168MB/s ± 0% 239MB/s ± 0% +42.66% (p=0.008 n=5+5) Decode/Newton/Compression/1e6-32 170MB/s ± 0% 241MB/s ± 0% +42.11% (p=0.016 n=4+5) name old alloc/op new alloc/op delta Decode/Digits/Huffman/1e4-32 0.00B ±NaN% 16.00B ± 0% +Inf% (p=0.008 n=5+5) Decode/Digits/Huffman/1e5-32 7.60B ± 8% 32.00B ± 0% +321.05% (p=0.008 n=5+5) Decode/Digits/Huffman/1e6-32 79.6B ± 1% 264.0B ± 0% +231.66% (p=0.008 n=5+5) Decode/Digits/Speed/1e4-32 80.0B ± 0% 16.0B ± 0% -80.00% (p=0.008 n=5+5) Decode/Digits/Speed/1e5-32 297B ± 0% 33B ± 0% ~ (p=0.079 n=4+5) Decode/Digits/Speed/1e6-32 3.86kB ± 0% 0.27kB ± 0% -92.98% (p=0.008 n=5+5) Decode/Digits/Default/1e4-32 48.0B ± 0% 16.0B ± 0% -66.67% (p=0.008 n=5+5) Decode/Digits/Default/1e5-32 297B ± 0% 49B ± 0% -83.50% (p=0.008 n=5+5) Decode/Digits/Default/1e6-32 4.28kB ± 0% 0.38kB ± 0% ~ (p=0.079 n=4+5) Decode/Digits/Compression/1e4-32 48.0B ± 0% 16.0B ± 0% -66.67% (p=0.008 n=5+5) Decode/Digits/Compression/1e5-32 297B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Decode/Digits/Compression/1e6-32 4.28kB ± 0% 0.38kB ± 0% -91.09% (p=0.000 n=4+5) Decode/Newton/Huffman/1e4-32 705B ± 0% 16B ± 0% -97.73% (p=0.008 n=5+5) Decode/Newton/Huffman/1e5-32 4.50kB ± 0% 0.03kB ± 0% -99.27% (p=0.008 n=5+5) Decode/Newton/Huffman/1e6-32 39.4kB ± 0% 0.3kB ± 0% -99.29% (p=0.008 n=5+5) Decode/Newton/Speed/1e4-32 625B ± 0% 16B ± 0% -97.44% (p=0.008 n=5+5) Decode/Newton/Speed/1e5-32 3.21kB ± 0% 0.03kB ± 0% -98.97% (p=0.008 n=5+5) Decode/Newton/Speed/1e6-32 40.6kB ± 0% 0.3kB ± 0% -99.25% (p=0.008 n=5+5) Decode/Newton/Default/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.008 n=5+5) Decode/Newton/Default/1e5-32 2.37kB ± 0% 0.03kB ± 0% -98.61% (p=0.008 n=5+5) Decode/Newton/Default/1e6-32 21.2kB ± 0% 0.2kB ± 0% -98.97% (p=0.008 n=5+5) Decode/Newton/Compression/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.008 n=5+5) Decode/Newton/Compression/1e5-32 2.37kB ± 0% 0.03kB ± 0% -98.61% (p=0.008 n=5+5) Decode/Newton/Compression/1e6-32 23.0kB ± 0% 0.2kB ± 0% -99.07% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Decode/Digits/Huffman/1e4-32 0.00 ±NaN% 1.00 ± 0% +Inf% (p=0.008 n=5+5) Decode/Digits/Huffman/1e5-32 0.00 ±NaN% 2.00 ± 0% +Inf% (p=0.008 n=5+5) Decode/Digits/Huffman/1e6-32 0.00 ±NaN% 16.00 ± 0% +Inf% (p=0.008 n=5+5) Decode/Digits/Speed/1e4-32 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.008 n=5+5) Decode/Digits/Speed/1e5-32 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.008 n=5+5) Decode/Digits/Speed/1e6-32 68.0 ± 0% 16.0 ± 0% -76.47% (p=0.008 n=5+5) Decode/Digits/Default/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5) Decode/Digits/Default/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.008 n=5+5) Decode/Digits/Default/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.008 n=5+5) Decode/Digits/Compression/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5) Decode/Digits/Compression/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.008 n=5+5) Decode/Digits/Compression/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.008 n=5+5) Decode/Newton/Huffman/1e4-32 9.00 ± 0% 1.00 ± 0% -88.89% (p=0.008 n=5+5) Decode/Newton/Huffman/1e5-32 18.0 ± 0% 2.0 ± 0% -88.89% (p=0.008 n=5+5) Decode/Newton/Huffman/1e6-32 156 ± 0% 16 ± 0% -89.74% (p=0.008 n=5+5) Decode/Newton/Speed/1e4-32 13.0 ± 0% 1.0 ± 0% -92.31% (p=0.008 n=5+5) Decode/Newton/Speed/1e5-32 26.0 ± 0% 2.0 ± 0% -92.31% (p=0.008 n=5+5) Decode/Newton/Speed/1e6-32 223 ± 0% 16 ± 0% -92.83% (p=0.008 n=5+5) Decode/Newton/Default/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.008 n=5+5) Decode/Newton/Default/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.008 n=5+5) Decode/Newton/Default/1e6-32 153 ± 0% 12 ± 0% -92.16% (p=0.008 n=5+5) Decode/Newton/Compression/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.008 n=5+5) Decode/Newton/Compression/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.008 n=5+5) Decode/Newton/Compression/1e6-32 145 ± 0% 12 ± 0% -91.72% (p=0.008 n=5+5) ``` These changes have been included in https://github.com/klauspost/compress for a little more than a month now, which includes fuzz testing. Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
This PR (HEAD: 6180f3c) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/227737 to see it. Tip: You can toggle comments from me using the |
why this is still not merged ? |
@volknanebo We don't use GitHub for code review. If you want to make a comment, please make it at https://golang.org/cl/227737. Thanks. |
@heschi What happened here? |
I closed old PRs to reduce load on the Gerrit importer (#50197), sorry for the trouble. I'll reopen the CL and PR. |
# Conflicts: # src/compress/flate/reader_test.go
Message from Ian Lance Taylor: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Klaus Post: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
This PR (HEAD: c00babd) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/227737 to see it. Tip: You can toggle comments from me using the |
* Inline moreBits * Put values on stack. * Also generate the fallback. Change-Id: I64d03424438ebc5dbacd4f364e3e6d3c4936a008
This PR (HEAD: ae9b62a) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/227737 to see it. Tip: You can toggle comments from me using the |
Message from Klaus Post: Patch Set 5: (2 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Change-Id: If11b81d2de23a2588f3d4c7baa088ed5d614de70
This PR (HEAD: 161f021) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/227737 to see it. Tip: You can toggle comments from me using the |
Syncthing uses that. It keeps compressed web assets in strings to ensure they're in the RODATA section and can decompress them for HTTP clients without gzip support. |
Ping @ianlancetaylor - if there is interest for this in 1.20 it would be good to get started on CR. |
Message from Ian Lance Taylor: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Ian Lance Taylor: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Ian Lance Taylor: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Klaus Post: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Klaus Post: Patch Set 5: (2 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Ian Lance Taylor: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Ian Lance Taylor: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Message from Joseph Tsai: Patch Set 6: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/227737. |
Improve inflate decompression speed, mainly through 3 optimizations:
The reader guarantees that it will not read beyond the end of the stream.
This poses limitations on the decoder in terms of how far it can read ahead
and is set to the size of an end-of-block marker in
f.h1.min = f.bits[endBlockMarker]
.We can however take advantage of the fact that each block gives
information on whether it is the final block on a stream.
So if we are not reading the final block we can safely add the size
of the smallest block possible with nothing but an EOB marker.
That is a block with a predefined table and a single EOB marker.
Since we know the size of the block header and the encoding
of the EOB this totals to 10 additional bits.
Adding 10 bits reduces the number of stream reads significantly.
Approximately 5% throughput increase.
This change by itself give about about 13% throughput increase.
We generate decoders for the known implementations of
io.ByteReader
,namely
*bytes.Buffer
,*bytes.Reader
,*bufio.Reader
and*strings.Reader
.This change by itself gives about 20-25% throughput increase,
including when an
io.Reader
is passed.I would say only
*strings.Reader
probably isn't that common.Minor changes:
h.chunks
andh.links
.Per loop allocation removed from benchmarks.
The numbers in the benchmark below includes this change for the 'old' numbers.
These changes have been included in github.com/klauspost/compress
for a little more than a month now, which includes fuzz testing.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c