Skip to content

New zlib decompressor may read more data than necessary #18967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ianprime0509 opened this issue Feb 17, 2024 · 12 comments
Closed

New zlib decompressor may read more data than necessary #18967

ianprime0509 opened this issue Feb 17, 2024 · 12 comments
Labels
bug Observed behavior contradicts documented or intended behavior regression It worked in a previous version of Zig, but stopped working. standard library This issue involves writing Zig code for the standard library. use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Milestone

Comments

@ianprime0509
Copy link
Contributor

Zig Version

0.12.0-dev.2790+fc7dd3e28

Steps to Reproduce and Observed Behavior

Clone https://github.com/ianprime0509/zig-zlib-issue at commit 076f340acd113b73162800b43cc0add3a0141bd0 and run zig test test-new.zig. The test fails:

Test [1/1] test.decompress... FAIL (TestUnexpectedResult)
/var/home/ian/src/zig/lib/std/testing.zig:546:14: 0x103c6cf in expect (test)
    if (!ok) return error.TestUnexpectedResult;
             ^
/var/home/ian/src/zig/lib/test-new.zig:25:9: 0x103c961 in test.decompress (test)
0 passed; 0 skipped; 1 failed.

If the expect call on line 25 is commented out, the test will fail with the error BadZlibHeader.

Expected Behavior

The test should pass. The input data to the test is the concatenation of two zlib data streams: when using std.compress.zlib.decompressor at the beginning of the concatenated stream and reading the entire decompressed content, the underlying reader should not be advanced past the end of the first zlib data stream, so that the second data stream can be decompressed starting at the end of the first one.

A successful outcome can be reproduced by checking out 7204ecc (the commit just before the new gzip changes were merged) and running zig test test-old.zig, which is identical to test-new.zig except it uses the old std.compress.zlib API.


For context on where this is necessary, an entry in a Git packfile consists of a header followed by zlib-compressed data, and the header only contains the uncompressed length of the data, so it is impossible to know where the second entry in a packfile begins without reading the first entry's compressed data precisely to the end and no further. Unfortunately, this means the Git package fetching support is currently broken.

I haven't delved too deeply into the new zlib code to find out how large of a change this would be, but I think this is a reasonable constraint to make on the decompress reader API, given formats such as Git packfiles which rely on knowing exactly where the decompressed data ends.

@ianprime0509 ianprime0509 added the bug Observed behavior contradicts documented or intended behavior label Feb 17, 2024
@jacobly0 jacobly0 added standard library This issue involves writing Zig code for the standard library. regression It worked in a previous version of Zig, but stopped working. labels Feb 17, 2024
@jacobly0 jacobly0 added this to the 0.13.0 milestone Feb 17, 2024
@jacobly0 jacobly0 added the use case Describes a real use case that is difficult or impossible, but does not propose a solution. label Feb 17, 2024
@squeek502
Copy link
Collaborator

cc @ianic

@ianic
Copy link
Contributor

ianic commented Feb 17, 2024

Thanks @ianprime0509 for bringing this up.

Can we solve problem by implementing reset method on decompressor. Reset will instruct decompressor to reset its internal state to initial so it could parse another zlib data stream if available in input reader.

Here is example:

test "flate bug 18967" {
    const first = @embedFile("testdata/fuzz/first.input");
    const second = @embedFile("testdata/fuzz/second.input");

    var in = std.io.fixedBufferStream(first ++ second);
    var out = std.ArrayList(u8).init(testing.allocator);
    defer out.deinit();

    var dcp = decompressor(.zlib, in.reader());
    try dcp.decompress(out.writer());
    std.debug.print("{s}", .{out.items});

    try dcp.reset();
    try dcp.decompress(out.writer());
    std.debug.print("{s}", .{out.items});
}

outputs:

tree 254b029bac4da1ef2c779044adc3c0fad85f8068
parent 28ddaf81d9cd0a71b139d09b88ae5957c5a91f1d
author Ian Johnson <[email protected]> 1695580224 -0400
committer Ian Johnson <[email protected]> 1695580224 -0400

commit 20
tree 254b029bac4da1ef2c779044adc3c0fad85f8068
parent 28ddaf81d9cd0a71b139d09b88ae5957c5a91f1d
author Ian Johnson <[email protected]> 1695580224 -0400
committer Ian Johnson <[email protected]> 1695580224 -0400

commit 20
tree d71cfb798acac6507d07c311bc6dca2e1c1f3fca
parent 1da9fdcf31b0342d1792cbfd8d6a8300eb21a1dc
author Ian Johnson <[email protected]> 1695580186 -0400
committer Ian Johnson <[email protected]> 1695580186 -0400

commit 19

ianic added a commit to ianic/zig that referenced this issue Feb 17, 2024
@ianprime0509
Copy link
Contributor Author

ianprime0509 commented Feb 17, 2024

Thanks @ianic for your response! That would indeed solve this particular example I created to demonstrate the issue, but it wouldn't solve all potential use-cases, including the Git packfile use-case, since there can be other data between the zlib data streams that needs to be read using the underlying reader. The structure of a packfile looks like this:

+------------------+
|  Packfile header |
+------------------+
|   Entry header   |
+------------------+
| Zlib object data |
+------------------+
|   Entry header   |
+------------------+
| Zlib object data |
+------------------+
        ...
+------------------+
|     Checksum     |
+------------------+

So if the zlib decompressor reads too much data, some or all of the entry header or the next entry might be consumed by the underlying reader and available only in the decompressor's internal buffer.

If there were a way to tell how many bytes were actually used by the decompressor, and have a known upper bound on the amount of lookahead that could possibly be used, then it would be possible to work around this by replacing the buffered reader currently used in the packfile reader with an implementation that is guaranteed to always retain that known amount of lookahead in the buffer, so that after reading the zlib object data for an entry, we could see how much data was actually used and rewind the buffered reader to the end of the data.

For example, if we knew that the decompressor can only read ahead up to 128 bytes, then we could do something like

var pack_buffered_reader = std.io.rewindableBufferedReader(pack.reader(), 128); // New API to guarantee rewinding in a buffered reader
// ... loop and read entry header ...
var entry_decompress_stream = std.compress.zlib.decompressor(entry_crc32_reader.reader());
// ... read entry data ...
pack_buffered_reader.rewind(entry_decompress_stream.unusedBytes()); // New API to tell how many extra bytes were read by the decompressor for its buffer but not used

I think the new API you added in #18979 will be helpful for other use-cases, though, and it seems like a good addition to make.

Edit: actually, if I understand the code correctly, this may already be possible by reading decompressor.bits.nbits, right? I'll give that a try and see if I'm able to achieve something like what I described above with the information in that field.

Edit 2: hmm, yes, something like that does indeed seem to work: https://github.com/ianprime0509/zig-zlib-issue/blob/e843d77608810afbd0273cfb9fc933dfc509fcf6/test-new-workaround.zig I'll have to think a little more about the API of the rewindable buffered reader, though; this one was just thrown together for this experiment using std.io.BufferedReader as a base. It might be helpful to have the unused_bytes calculation available as a method on the decompressor to avoid reaching into its internals to obtain it.

ianprime0509 added a commit to ianprime0509/zig that referenced this issue Feb 18, 2024
Works around ziglang#18967. Currently non-functional for reasons which will be
explained in the comments of that issue.
@ianprime0509
Copy link
Contributor Author

I tried proceeding with a workaround to fix the Git package fetching logic based on my comment above. However, I ran into another issue which I'm not sure how to solve with the "rewinding buffer" approach: the buffered reader is wrapped in two other readers, a CountingReader (so that we know the current byte position in the packfile) and a HashedReader (to confirm the integrity of the packfile using the checksum at the end of the file). If I rewind the buffered reader, these wrapper readers will end up with incorrect results.

I will need to think about this more and see if there is any other way this can be fixed aside from imposing the constraint described in this issue (that the decompressor should never read any more bytes than necessary from the underlying reader).

@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Feb 18, 2024
@daurnimator
Copy link
Contributor

I'll have to think a little more about the API of the rewindable buffered reader,

I hope you'd be able to use std.LinearFifo and only call discard when the data is consumed?

@ianic
Copy link
Contributor

ianic commented Feb 18, 2024

Edit: actually, if I understand the code correctly, this may already be possible by reading decompressor.bits.nbits, right?

Right.

Sorry for breaking this.

Bit reader in compressor internally has 8 bytes buffer. Max overshot we can get when decompressing zlib stream is 4 bytes. Zlib has 4 bytes checksum at the end and if we refill with 8 bytes just before reading checksum that leaves 4 bytes in bit reader.

One obvious solution is to use 4 bytes bit buffer. That will ensure that after reading checksum nothing is left in the bit buffer. But that will hurt performance at about 20% according to my initial measurement. So I'm uneasy to implement that for all use cases.

I think that I now pretty much understand code in indexPackFirstPass. I see entry_crc32_reader as the biggest problem and don't have solution for it right now.

We can make some kind of two pass parsing. In first pass decompress, get compressed and decompressed sizes, decompressed sha1. Then use compressed size in second pass to navigate over it, and calculate checksum during that. I understand that is lot of work on you side and will probably also hurt performance.

Or if we can use some kind of seekable stream to seekBy few bytes back. But that again is complete redesign.

Sorry, I don't have solution right now. But if we don't find something workable I will provide precise (in the number of bytes read) decompressor, somehow. Don't worry about that, although I will need few days because I can't be by computer full time next 2-3 days.

@ianprime0509
Copy link
Contributor Author

I hope you'd be able to use std.LinearFifo and only call discard when the data is consumed?

Unfortunately, there are two reasons why LinearFifo doesn't completely help solve this, although conceptually it is very close:

  1. Its read function calls discard internally, so if the decompressor reads from it, there is no way to recover that read data.
  2. It does not store the underlying reader internally, so data must be written to it explicitly before the decompressor reads from it, and if the decompressor needs more data than what we wrote, it will incorrectly interpret this as EOF. (This could be avoided if we had a guaranteed upper bound on the amount of data read by the decompressor in a single step.)

Sorry for breaking this.

No worries at all, I understand the performance reasons for wanting to buffer more data, and what I'm doing with it is probably a niche use-case (I suspect the vast majority of users will only have a single data stream to decompress).

I have found a way to rework indexPackFirstPass to continue working correctly in a single pass over the data, although it does require a little more allocation and copying. I hope to make a PR with the workaround soon, once I finish testing it and making sure I didn't overlook any obvious simplifications. If you're able to find a good way to support a precise reader at some point (no rush), then the workaround could be avoided/reverted.

ianprime0509 added a commit to ianprime0509/zig that referenced this issue Feb 18, 2024
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which
accumulates data read from the underlying packfile, and by keeping track
of the position in the packfile and hash/checksum information separately
rather than using reader composition. That is, the packfile position and
hashes/checksums are updated with the accumulated read history data only
after we can determine what data has actually been used by the
decompressor rather than merely being buffered.

The only addition to the standard library APIs to support this change is
the `unreadBytes` function in `std.compress.flate.Inflate`, which allows
the user to determine how many bytes have been read only for buffering
and not used as part of compressed data.

These changes can be reverted if ziglang#18967 is resolved with a decompressor
that reads precisely only the number of bytes needed for decompression.
ianprime0509 added a commit to ianprime0509/zig that referenced this issue Feb 18, 2024
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which
accumulates data read from the underlying packfile, and by keeping track
of the position in the packfile and hash/checksum information separately
rather than using reader composition. That is, the packfile position and
hashes/checksums are updated with the accumulated read history data only
after we can determine what data has actually been used by the
decompressor rather than merely being buffered.

The only addition to the standard library APIs to support this change is
the `unreadBytes` function in `std.compress.flate.Inflate`, which allows
the user to determine how many bytes have been read only for buffering
and not used as part of compressed data.

These changes can be reverted if ziglang#18967 is resolved with a decompressor
that reads precisely only the number of bytes needed for decompression.
ianprime0509 added a commit to ianprime0509/zig that referenced this issue Feb 19, 2024
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which
accumulates data read from the underlying packfile, and by keeping track
of the position in the packfile and hash/checksum information separately
rather than using reader composition. That is, the packfile position and
hashes/checksums are updated with the accumulated read history data only
after we can determine what data has actually been used by the
decompressor rather than merely being buffered.

The only addition to the standard library APIs to support this change is
the `unreadBytes` function in `std.compress.flate.Inflate`, which allows
the user to determine how many bytes have been read only for buffering
and not used as part of compressed data.

These changes can be reverted if ziglang#18967 is resolved with a decompressor
that reads precisely only the number of bytes needed for decompression.
@ianprime0509
Copy link
Contributor Author

I've opened #18992 with a workaround for this in the Git package fetching code.

ibokuri added a commit to getty-zig/json that referenced this issue Feb 19, 2024
zlib decompressor's broken in Zig which causes git URLs to not work.
The issue's tracked here: ziglang/zig#18967.
andrewrk pushed a commit that referenced this issue Feb 19, 2024
This commit works around #18967 by adding an `AccumulatingReader`, which
accumulates data read from the underlying packfile, and by keeping track
of the position in the packfile and hash/checksum information separately
rather than using reader composition. That is, the packfile position and
hashes/checksums are updated with the accumulated read history data only
after we can determine what data has actually been used by the
decompressor rather than merely being buffered.

The only addition to the standard library APIs to support this change is
the `unreadBytes` function in `std.compress.flate.Inflate`, which allows
the user to determine how many bytes have been read only for buffering
and not used as part of compressed data.

These changes can be reverted if #18967 is resolved with a decompressor
that reads precisely only the number of bytes needed for decompression.
@andrewrk andrewrk modified the milestones: 0.12.0, 0.13.0 Feb 19, 2024
leecannon added a commit to CascadeOS/CascadeOS that referenced this issue Feb 19, 2024
This reverts commit 747176c.

Due to an upstream zig issue: ziglang/zig#18967
@ianic
Copy link
Contributor

ianic commented Feb 20, 2024

In std lib we have peek reader which allows us to return some bytes to be read again. That, and this make me think about what kind of read will fit for this case.
I came up to the idea of reader which holds other chaining readers behind for lookahead bytes while allowing consumer to use it as any other reader. If we are holding others behind we can put back lookahead bytes to be read again. We can not hold other readers behind, they are just passing through from initial stream. But we can modify all those passing readers to be writers, and then write to that writers only data which are lookahead bytes behind consumer.
I implement such a reader and use it in indexPackFirstPass. Test is passing.
While holding writers behind we also need in some moment to flush everything consumed to them. That is flush function.

What do you think about such an approach.

@ianprime0509
Copy link
Contributor Author

@ianic what you came up with is a really nice approach, thanks for taking the time to implement it! I much prefer it to the approach I came up with, since it avoids the need for any additional allocations while still having an API and usage that make sense. I also built it and tested it out on my other projects as an additional check (all of them are working fine).

Personally, I think the following would be a fine solution to this issue, but I'd be interested in what you and others think as well:

  1. Add LookaheadBufferedReader to the standard library
  2. Add the constant value 8 somewhere in std.compress documented as the maximum number of additional bytes the decompressor may read past the end of its input (just to make it clear what it represents, and as a central reference point for others to use)
  3. With those changes available, the git.zig implementation could be updated to use them just as you did already

I also have a couple questions/suggestions about the high-level design of LookaheadBufferedReader:

  1. Given the behavior of the reader is not just to read, but also to pump the read data to a writer, how about the name BufferedTee (by analogy with the tee command)? Or perhaps something else that indicates that its purpose also includes writing.
  2. What is the reason for writer being nullable? Is it to allow this implementation to potentially subsume the functionality of BufferedReader by using a null writer? If so, maybe it would be best to make WriterType itself nullable so that the existence of a writer is known at compile time, and the compiler can eliminate all the writer/flush logic for that case.

@ianic
Copy link
Contributor

ianic commented Feb 21, 2024

Great that you like it and thank you for suggestions.

I like the 'tee' name very much, renamed it to BufferedTee. Although there are two characteristics of this tee, one is that it is buffered and the other is that is holding output lookahead bytes behind consumer. BufferedTee name express just one. Besides that I like BufferedTee name very much.
Writer was nullable just to ease running existing BufferedReader tests. Just an draft implementation detail. Removed.

Why I first start solving this problem outside of the flate?
In practice all inflate implementations are using some kind of lookahead. It is possible to search for symbol in Huffman tree bit by bit. That is pretty inefficient so all implementations are using some kind of lookup table. Maximum deflate code length is 15 bits so algorithm reads 15 bits and than use lookup table to find symbol. After symbol is found then it knows how much of that 15 bits is actually used. That kind of deflate implementation can overshoot 2 bytes at the end of deflate stream.
For example; if it has 6 unused bits, to lookup with 15 bits it needs to read 2 more bytes giving 22 unused bits. If last code plus byte padding is less than 6 bits it is left with 2 overshoot bytes.
Zlib has 4 byte checksum in footer which will compensate this 2 bytes overshoot. But using raw deflate leaves us with possible overshoot.

I also think that there is place for something like this to be in the standard library. I'll prepare PR. Current implementation is here, all comments are highly welcome.

@ianprime0509
Copy link
Contributor Author

Thanks for the detailed explanation. That makes sense; given your explanation of the efficiency improvements gained by buffering some extra input in the decompressor, and the existence of a better tool to handle such needs in #19032, I'm going to close this issue as "not planned" because the behavior it asks for would have a negative effect overall.

As mentioned in my comments on #19032, I think your git.zig changes to use BufferedTee would be an improvement over my earlier workaround, so I hope you'll consider adding them to #19032 or making a separate PR for them later. Thanks again for all your help with this issue!

@ianprime0509 ianprime0509 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Feb 27, 2024
ianic added a commit to ianic/zig that referenced this issue Mar 11, 2024
My first zlib implementation broke git fetch because it introduce
[lookahead](ziglang#18967). That resulted
in workarounds [1](ziglang@80f3ef6)
[2](ziglang@d00faa2)

After [fixing](ziglang#19163) lookahead in
zlib decompressor this fixes are no longer necessary.
andrewrk pushed a commit that referenced this issue Mar 11, 2024
My first zlib implementation broke git fetch because it introduce
[lookahead](#18967). That resulted
in workarounds [1](80f3ef6)
[2](d00faa2)

After [fixing](#19163) lookahead in
zlib decompressor this fixes are no longer necessary.
RossComputerGuy pushed a commit to ExpidusOS-archive/zig that referenced this issue Mar 20, 2024
My first zlib implementation broke git fetch because it introduce
[lookahead](ziglang#18967). That resulted
in workarounds [1](ziglang@80f3ef6)
[2](ziglang@d00faa2)

After [fixing](ziglang#19163) lookahead in
zlib decompressor this fixes are no longer necessary.
ianic added a commit to ianic/zig that referenced this issue Mar 20, 2024
Introduced in  ziglang#19032 as a fix for ziglang#18967.
Not needed any more after ziglang#19253.
andrewrk pushed a commit that referenced this issue Mar 21, 2024
Introduced in  #19032 as a fix for #18967.
Not needed any more after #19253.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior regression It worked in a previous version of Zig, but stopped working. standard library This issue involves writing Zig code for the standard library. use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants