Skip to content

ZSTD_GENERIC_ERROR when compressing large binary files #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jan 27, 2015 · 9 comments
Closed

ZSTD_GENERIC_ERROR when compressing large binary files #7

ghost opened this issue Jan 27, 2015 · 9 comments

Comments

@ghost
Copy link

ghost commented Jan 27, 2015

When compressing a large 6GB binary file a compression error happens. After some investigation it appears to happen when there are more symbols than the max symbol limit. The line is here

https://github.com/Cyan4973/zstd/blob/master/lib/fse.c#L1458

A simple temporary fix is just deleting this line, but my guess is that this isn't a good solution. Increasing the max symbol limit didn't seem to work, but I'm not that familiar with the code base so I'm sure I missed something.

@Cyan4973
Copy link
Contributor

When reaching this line, if the test triggers the error code,
this is pretty bad : it means the normalized distribution is not correctly normalized.
In such case, the algorithm is right to stop processing there.

Now, if that is the right scenario, it means the next step is to understand why the normalization would fail. It could be a very specific corner case that the fuzzer is unable to produce.

It may sound a long stretch, but could it be possible to access the faulty file, for debugging ?

If the problem is the one described above, it means it's not related to the size of the file.
It might be possible to capture just the place where the problem occurs.

@ghost
Copy link
Author

ghost commented Jan 28, 2015

Sorry, I can't give out the faulty file since it contains some sensitive information. One thing to note is that after removing the line and compressing, then decompressing the file, the file is a perfect match for the original.

@Cyan4973
Copy link
Contributor

OK.
It's a pity to not be able to investigate the problem, but I'm glad it works correctly for you.
To be fair, I'm very surprised : this test was supposed to be an important sanitizer check, I took for granted that compression would necessarily behave badly if it fails.
Apparently that's not always the case.

Also, as a secondary question :
Do you have any idea why "there are more symbols than the max symbol limit" ?
This situation is not supposed to happen, so it would be interesting to understand why it does.

@ghost
Copy link
Author

ghost commented Jan 28, 2015

The comment "there are more symbols than the max symbol limit" was purely a guess from what I saw of the code. It's very likely I mis-read the code and the real issue is something else entirely.

If it helps, I can give you some gdb output, so for example here's some variables from gdb when it stops at that line

Breakpoint 1, FSE_buildCTable (CTable=0x7fffffff1d10, normalizedCounter=0x7fffffff3920, maxSymbolValue=252, tableLog=8) at ../lib/fse.c:1460
1460 return (size_t)-FSE_ERROR_GENERIC; /* Must have gone through all positions */
(gdb) p position
$1 = 49
(gdb) p maxSymbolValue
$2 = 252
(gdb) p symbol
$3 = 253

@Cyan4973
Copy link
Contributor

OK.
I see that the compression level is quite affected, since it tries to fit up to 253 symbols into a table a 256 elements. It can work, but compression ratio will suffer considerably.

symbol necessarily exits the look at maxSymbolValue+1, so this part is correct.

What is not correct is "position", which is supposed to end at "0".
Here it ends at "49".
It should be possible to know how many symbols are missing from this value. Not sure if it is very useful though.

I've made a small update of FSE within the "dev" branch of FSE.
https://github.com/Cyan4973/FiniteStateEntropy/tree/dev
Maybe it can help to solve this situation.

@ghost
Copy link
Author

ghost commented Jan 28, 2015

When using the dev branch of FSE, compression works.

@Cyan4973
Copy link
Contributor

Thanks for the feedback

@Cyan4973
Copy link
Contributor

Fix integrated into zstd "dev" branch

@Cyan4973
Copy link
Contributor

merged into master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant