Description
At the moment we don't compress the documentation uploaded to S3, wasting a lot of space. While money for S3 isn't an issue at the moment, avoiding compression could hurt docs.rs's sustainability in the future.
I ran some very rough benchmarks on reqwest 0.9.3
, compressing each .html
file separately:
Bad benchmark
Algorithm | Size | Compression time | Decompression time | Options |
---|---|---|---|---|
Plaintext | 33.9 MB | - | - | - |
Gzip | 12.0 MB | 3.2s | 3.4s | -9 (best) |
Gzip | 12.8 MB | 2.5s | 3.4s | -1 (fast) |
Zstd | 11.7 MB | 7.8s | 2.4s | -19 (best) |
Zstd | 12.5 MB | 2.5s | 2.4s | -1 (fast) |
Brotli | 11.5 MB | 5.5s | 2.3s | -9 (best) |
Brotli | 13.0 MB | 2.3s | 2.3s | -0 (fast) |
Looking at the numbers, on average if we compress the uploaded docs we're going to save 63% of storage space, which is great from a sustainability point of view. I think we should compress all the uploaded docs going forward, and try to compress (part of) the initial import as well.
For the algoritm choice, I'd say we can go with gzip
: there isn't much difference between the resulting sizes and the compression time delta between gzip's fast and best modes is the smallest. We can compress the initial import with -1
to speed it up, and all the new crates with -9
.
cc @Mark-Simulacrum @QuietMisdreavus
Benchmark method
Installed compression tools on Ubuntu 18.04 LTS:
$ sudo apt install gzip brotli zstd
Downloaded locally the reqwest documentation:
$ aws s3 cp --recursive s3://rust-docs-rs/rustdoc/reqwest/0.9.3/ .
Compressed every .html
file with find
:
$ time find <dir> -name "*.html" -exec <command> {} \;