-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New zlib decompressor may read more data than necessary #18967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @ianic |
Thanks @ianprime0509 for bringing this up. Can we solve problem by implementing reset method on decompressor. Reset will instruct decompressor to reset its internal state to initial so it could parse another zlib data stream if available in input reader. Here is example:
outputs:
|
Hopefully fixes: ziglang#18967
Thanks @ianic for your response! That would indeed solve this particular example I created to demonstrate the issue, but it wouldn't solve all potential use-cases, including the Git packfile use-case, since there can be other data between the zlib data streams that needs to be read using the underlying reader. The structure of a packfile looks like this:
So if the zlib decompressor reads too much data, some or all of the entry header or the next entry might be consumed by the underlying reader and available only in the decompressor's internal buffer. If there were a way to tell how many bytes were actually used by the decompressor, and have a known upper bound on the amount of lookahead that could possibly be used, then it would be possible to work around this by replacing the buffered reader currently used in the packfile reader with an implementation that is guaranteed to always retain that known amount of lookahead in the buffer, so that after reading the zlib object data for an entry, we could see how much data was actually used and rewind the buffered reader to the end of the data. For example, if we knew that the decompressor can only read ahead up to 128 bytes, then we could do something like var pack_buffered_reader = std.io.rewindableBufferedReader(pack.reader(), 128); // New API to guarantee rewinding in a buffered reader
// ... loop and read entry header ...
var entry_decompress_stream = std.compress.zlib.decompressor(entry_crc32_reader.reader());
// ... read entry data ...
pack_buffered_reader.rewind(entry_decompress_stream.unusedBytes()); // New API to tell how many extra bytes were read by the decompressor for its buffer but not used I think the new API you added in #18979 will be helpful for other use-cases, though, and it seems like a good addition to make. Edit: actually, if I understand the code correctly, this may already be possible by reading Edit 2: hmm, yes, something like that does indeed seem to work: https://github.com/ianprime0509/zig-zlib-issue/blob/e843d77608810afbd0273cfb9fc933dfc509fcf6/test-new-workaround.zig I'll have to think a little more about the API of the rewindable buffered reader, though; this one was just thrown together for this experiment using |
Works around ziglang#18967. Currently non-functional for reasons which will be explained in the comments of that issue.
I tried proceeding with a workaround to fix the Git package fetching logic based on my comment above. However, I ran into another issue which I'm not sure how to solve with the "rewinding buffer" approach: the buffered reader is wrapped in two other readers, a I will need to think about this more and see if there is any other way this can be fixed aside from imposing the constraint described in this issue (that the decompressor should never read any more bytes than necessary from the underlying reader). |
I hope you'd be able to use |
Right. Sorry for breaking this. Bit reader in compressor internally has 8 bytes buffer. Max overshot we can get when decompressing zlib stream is 4 bytes. Zlib has 4 bytes checksum at the end and if we refill with 8 bytes just before reading checksum that leaves 4 bytes in bit reader. One obvious solution is to use 4 bytes bit buffer. That will ensure that after reading checksum nothing is left in the bit buffer. But that will hurt performance at about 20% according to my initial measurement. So I'm uneasy to implement that for all use cases. I think that I now pretty much understand code in indexPackFirstPass. I see entry_crc32_reader as the biggest problem and don't have solution for it right now. We can make some kind of two pass parsing. In first pass decompress, get compressed and decompressed sizes, decompressed sha1. Then use compressed size in second pass to navigate over it, and calculate checksum during that. I understand that is lot of work on you side and will probably also hurt performance. Or if we can use some kind of seekable stream to seekBy few bytes back. But that again is complete redesign. Sorry, I don't have solution right now. But if we don't find something workable I will provide precise (in the number of bytes read) decompressor, somehow. Don't worry about that, although I will need few days because I can't be by computer full time next 2-3 days. |
Unfortunately, there are two reasons why
No worries at all, I understand the performance reasons for wanting to buffer more data, and what I'm doing with it is probably a niche use-case (I suspect the vast majority of users will only have a single data stream to decompress). I have found a way to rework |
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which accumulates data read from the underlying packfile, and by keeping track of the position in the packfile and hash/checksum information separately rather than using reader composition. That is, the packfile position and hashes/checksums are updated with the accumulated read history data only after we can determine what data has actually been used by the decompressor rather than merely being buffered. The only addition to the standard library APIs to support this change is the `unreadBytes` function in `std.compress.flate.Inflate`, which allows the user to determine how many bytes have been read only for buffering and not used as part of compressed data. These changes can be reverted if ziglang#18967 is resolved with a decompressor that reads precisely only the number of bytes needed for decompression.
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which accumulates data read from the underlying packfile, and by keeping track of the position in the packfile and hash/checksum information separately rather than using reader composition. That is, the packfile position and hashes/checksums are updated with the accumulated read history data only after we can determine what data has actually been used by the decompressor rather than merely being buffered. The only addition to the standard library APIs to support this change is the `unreadBytes` function in `std.compress.flate.Inflate`, which allows the user to determine how many bytes have been read only for buffering and not used as part of compressed data. These changes can be reverted if ziglang#18967 is resolved with a decompressor that reads precisely only the number of bytes needed for decompression.
This commit works around ziglang#18967 by adding an `AccumulatingReader`, which accumulates data read from the underlying packfile, and by keeping track of the position in the packfile and hash/checksum information separately rather than using reader composition. That is, the packfile position and hashes/checksums are updated with the accumulated read history data only after we can determine what data has actually been used by the decompressor rather than merely being buffered. The only addition to the standard library APIs to support this change is the `unreadBytes` function in `std.compress.flate.Inflate`, which allows the user to determine how many bytes have been read only for buffering and not used as part of compressed data. These changes can be reverted if ziglang#18967 is resolved with a decompressor that reads precisely only the number of bytes needed for decompression.
I've opened #18992 with a workaround for this in the Git package fetching code. |
zlib decompressor's broken in Zig which causes git URLs to not work. The issue's tracked here: ziglang/zig#18967.
This commit works around #18967 by adding an `AccumulatingReader`, which accumulates data read from the underlying packfile, and by keeping track of the position in the packfile and hash/checksum information separately rather than using reader composition. That is, the packfile position and hashes/checksums are updated with the accumulated read history data only after we can determine what data has actually been used by the decompressor rather than merely being buffered. The only addition to the standard library APIs to support this change is the `unreadBytes` function in `std.compress.flate.Inflate`, which allows the user to determine how many bytes have been read only for buffering and not used as part of compressed data. These changes can be reverted if #18967 is resolved with a decompressor that reads precisely only the number of bytes needed for decompression.
This reverts commit 747176c. Due to an upstream zig issue: ziglang/zig#18967
In std lib we have peek reader which allows us to return some bytes to be read again. That, and this make me think about what kind of read will fit for this case. What do you think about such an approach. |
@ianic what you came up with is a really nice approach, thanks for taking the time to implement it! I much prefer it to the approach I came up with, since it avoids the need for any additional allocations while still having an API and usage that make sense. I also built it and tested it out on my other projects as an additional check (all of them are working fine). Personally, I think the following would be a fine solution to this issue, but I'd be interested in what you and others think as well:
I also have a couple questions/suggestions about the high-level design of
|
Great that you like it and thank you for suggestions. I like the 'tee' name very much, renamed it to BufferedTee. Although there are two characteristics of this tee, one is that it is buffered and the other is that is holding output lookahead bytes behind consumer. BufferedTee name express just one. Besides that I like BufferedTee name very much. Why I first start solving this problem outside of the flate? I also think that there is place for something like this to be in the standard library. I'll prepare PR. Current implementation is here, all comments are highly welcome. |
Thanks for the detailed explanation. That makes sense; given your explanation of the efficiency improvements gained by buffering some extra input in the decompressor, and the existence of a better tool to handle such needs in #19032, I'm going to close this issue as "not planned" because the behavior it asks for would have a negative effect overall. As mentioned in my comments on #19032, I think your |
My first zlib implementation broke git fetch because it introduce [lookahead](ziglang#18967). That resulted in workarounds [1](ziglang@80f3ef6) [2](ziglang@d00faa2) After [fixing](ziglang#19163) lookahead in zlib decompressor this fixes are no longer necessary.
My first zlib implementation broke git fetch because it introduce [lookahead](ziglang#18967). That resulted in workarounds [1](ziglang@80f3ef6) [2](ziglang@d00faa2) After [fixing](ziglang#19163) lookahead in zlib decompressor this fixes are no longer necessary.
Introduced in ziglang#19032 as a fix for ziglang#18967. Not needed any more after ziglang#19253.
Zig Version
0.12.0-dev.2790+fc7dd3e28
Steps to Reproduce and Observed Behavior
Clone https://github.com/ianprime0509/zig-zlib-issue at commit
076f340acd113b73162800b43cc0add3a0141bd0
and runzig test test-new.zig
. The test fails:If the
expect
call on line 25 is commented out, the test will fail with the errorBadZlibHeader
.Expected Behavior
The test should pass. The input data to the test is the concatenation of two zlib data streams: when using
std.compress.zlib.decompressor
at the beginning of the concatenated stream and reading the entire decompressed content, the underlying reader should not be advanced past the end of the first zlib data stream, so that the second data stream can be decompressed starting at the end of the first one.A successful outcome can be reproduced by checking out 7204ecc (the commit just before the new gzip changes were merged) and running
zig test test-old.zig
, which is identical totest-new.zig
except it uses the oldstd.compress.zlib
API.For context on where this is necessary, an entry in a Git packfile consists of a header followed by zlib-compressed data, and the header only contains the uncompressed length of the data, so it is impossible to know where the second entry in a packfile begins without reading the first entry's compressed data precisely to the end and no further. Unfortunately, this means the Git package fetching support is currently broken.
I haven't delved too deeply into the new zlib code to find out how large of a change this would be, but I think this is a reasonable constraint to make on the decompress reader API, given formats such as Git packfiles which rely on knowing exactly where the decompressed data ends.
The text was updated successfully, but these errors were encountered: