API for user-specified Dictionary_ID's confusing (or missing?)

I want to use zstd dictionaries with user-specified Dictionary_ID's (as per the RFC 8478 Section 5, "If the frame is going to be distributed in a private environment, any Dictionary_ID can be used"), but **I am not using the `zstd --train` command line tool**. (See tangential footnote below.)

I am finding the `zstd.h` API and the [zstd manual](https://facebook.github.io/zstd/zstd_manual.html) difficult to understand.

I think I can get to `ZSTD_compress_usingDict` to work... but if I understand the frame format, the resultant encoded bytes don't contain a Dictionary_ID. That's not so surprising, as I've never passed the Dictionary_ID to any `ZSTD_etc` functions. But... how do I provide one? What function should I call?

Perhaps I should use `ZSTD_createCDict` and then manipulate the `ZSTD_CDict` somehow. I can see `ZSTD_getDictID_etc` getter methods but no `ZSTD_setDictID_etc` setter methods.

Or, perhaps I should somehow convert my raw dictionary into `zstd --train`'s format. But Section 5 says that this format contains Entropy_Tables, so it's not just as simple as prepending a fixed sized prefix to my raw dictionary bytes. In any case, it sounds like an expensive computation, for what I was hoping to be a trivial setter function or extra function argument.

What would you recommend? How do I pass a user-specified Dictionary_ID to the `ZSTD_CCtx`?

---

As a general dictionary API issue, I often found it unclear what format the dictionary (pointer, length) arguments were expecting.

For example, `ZSTD_compress_usingDict` takes `const void* dict, size_t dictSize` arguments. Does this (pointer, length) pair hold
- (A) the raw dictionary bytes (IIUC, a 'content-only dictionary'),
- (B) the wrapped `zstd --train` format of Section 5, or
- (C) either?

The comment says "Note 2 : When `dict == NULL || dictSize < 8` no dictionary is used" suggests that `ZSTD_compress_usingDict` is (A) or (C). On the other hand, the comment also says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information", which suggests (C).

But for (C), how does it distinguish raw dictionary bytes that happen to start with e.g. the 0xEC30A437 magic number?

Similarly, `ZSTD_getDictID_fromDict` also takes `const void* dict, size_t dictSize`, and its comment says "Provides the dictID stored within dictionary. if return == 0, the dictionary is not conformant with Zstandard specification", which suggests either (B) or (C).

Coming back to `ZSTD_createCDict`, it also takes `const void* dictBuffer, size_t dictSize`. But what are the semantics? Is it (A), (B) or (C)? How do I tell without diving further into the source code?

Perhaps the various `const void* dict, size_t dict_size` function arguments could be renamed so that it's clear when it's expecting (A) or when it's expecting (B). Or does everything take (C), and if so, what am I supposed to do in the "false positive" case where my raw dictionary bytes coincidentally also look like the Section 5 format?

---

By the way, `ZSTD_compress_usingDict`'s comment says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information (see dictBuilder/zdict.h)."

I looked at "dictBuilder/zdict.h", but didn't find it helpful. In any case, I'm not looking to *train* a dictionary. I *have* a dictionary, and I have a Dictionary_ID that I'd like to associate with it, but I can't see how to make that association.

---

Tangential footnote:

I am not using `zstd --train`. I am using the https://github.com/google/brotli/blob/master/research/dictionary_generator.cc program, with its `--chunk_len` option, creating dictionaries for chunks *within* a single file. Even if `zstd --train` gained an option to do that, I'm working with multiple compression codecs that could all use the same dictionary, so I'd rather not introduce a zstd-specfiic format.

Further links:
- [`ractool` command line program](https://godoc.org/github.com/google/wuffs/cmd/ractool)
- [RAC (Random Access Compression) spec](https://github.com/google/wuffs/blob/master/doc/spec/rac-spec.md), where I have tried to base "RAC + Zstandard" dictionaries on my understanding of "RAC + Zlib" dictionaries (e.g. RFC 1950's DICTID field being a hash of the dictionary's contents), but perhaps my mental model doesn't match Zstd perfectly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API for user-specified Dictionary_ID's confusing (or missing?) #1776

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API for user-specified Dictionary_ID's confusing (or missing?) #1776

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions