-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
I want to use zstd dictionaries with user-specified Dictionary_ID's (as per the RFC 8478 Section 5, "If the frame is going to be distributed in a private environment, any Dictionary_ID can be used"), but I am not using the zstd --train
command line tool. (See tangential footnote below.)
I am finding the zstd.h
API and the zstd manual difficult to understand.
I think I can get to ZSTD_compress_usingDict
to work... but if I understand the frame format, the resultant encoded bytes don't contain a Dictionary_ID. That's not so surprising, as I've never passed the Dictionary_ID to any ZSTD_etc
functions. But... how do I provide one? What function should I call?
Perhaps I should use ZSTD_createCDict
and then manipulate the ZSTD_CDict
somehow. I can see ZSTD_getDictID_etc
getter methods but no ZSTD_setDictID_etc
setter methods.
Or, perhaps I should somehow convert my raw dictionary into zstd --train
's format. But Section 5 says that this format contains Entropy_Tables, so it's not just as simple as prepending a fixed sized prefix to my raw dictionary bytes. In any case, it sounds like an expensive computation, for what I was hoping to be a trivial setter function or extra function argument.
What would you recommend? How do I pass a user-specified Dictionary_ID to the ZSTD_CCtx
?
As a general dictionary API issue, I often found it unclear what format the dictionary (pointer, length) arguments were expecting.
For example, ZSTD_compress_usingDict
takes const void* dict, size_t dictSize
arguments. Does this (pointer, length) pair hold
- (A) the raw dictionary bytes (IIUC, a 'content-only dictionary'),
- (B) the wrapped
zstd --train
format of Section 5, or - (C) either?
The comment says "Note 2 : When dict == NULL || dictSize < 8
no dictionary is used" suggests that ZSTD_compress_usingDict
is (A) or (C). On the other hand, the comment also says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information", which suggests (C).
But for (C), how does it distinguish raw dictionary bytes that happen to start with e.g. the 0xEC30A437 magic number?
Similarly, ZSTD_getDictID_fromDict
also takes const void* dict, size_t dictSize
, and its comment says "Provides the dictID stored within dictionary. if return == 0, the dictionary is not conformant with Zstandard specification", which suggests either (B) or (C).
Coming back to ZSTD_createCDict
, it also takes const void* dictBuffer, size_t dictSize
. But what are the semantics? Is it (A), (B) or (C)? How do I tell without diving further into the source code?
Perhaps the various const void* dict, size_t dict_size
function arguments could be renamed so that it's clear when it's expecting (A) or when it's expecting (B). Or does everything take (C), and if so, what am I supposed to do in the "false positive" case where my raw dictionary bytes coincidentally also look like the Section 5 format?
By the way, ZSTD_compress_usingDict
's comment says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information (see dictBuilder/zdict.h)."
I looked at "dictBuilder/zdict.h", but didn't find it helpful. In any case, I'm not looking to train a dictionary. I have a dictionary, and I have a Dictionary_ID that I'd like to associate with it, but I can't see how to make that association.
Tangential footnote:
I am not using zstd --train
. I am using the https://github.com/google/brotli/blob/master/research/dictionary_generator.cc program, with its --chunk_len
option, creating dictionaries for chunks within a single file. Even if zstd --train
gained an option to do that, I'm working with multiple compression codecs that could all use the same dictionary, so I'd rather not introduce a zstd-specfiic format.
Further links:
ractool
command line program- RAC (Random Access Compression) spec, where I have tried to base "RAC + Zstandard" dictionaries on my understanding of "RAC + Zlib" dictionaries (e.g. RFC 1950's DICTID field being a hash of the dictionary's contents), but perhaps my mental model doesn't match Zstd perfectly.