Skip to content

API for user-specified Dictionary_ID's confusing (or missing?) #1776

@nigeltao

Description

@nigeltao

I want to use zstd dictionaries with user-specified Dictionary_ID's (as per the RFC 8478 Section 5, "If the frame is going to be distributed in a private environment, any Dictionary_ID can be used"), but I am not using the zstd --train command line tool. (See tangential footnote below.)

I am finding the zstd.h API and the zstd manual difficult to understand.

I think I can get to ZSTD_compress_usingDict to work... but if I understand the frame format, the resultant encoded bytes don't contain a Dictionary_ID. That's not so surprising, as I've never passed the Dictionary_ID to any ZSTD_etc functions. But... how do I provide one? What function should I call?

Perhaps I should use ZSTD_createCDict and then manipulate the ZSTD_CDict somehow. I can see ZSTD_getDictID_etc getter methods but no ZSTD_setDictID_etc setter methods.

Or, perhaps I should somehow convert my raw dictionary into zstd --train's format. But Section 5 says that this format contains Entropy_Tables, so it's not just as simple as prepending a fixed sized prefix to my raw dictionary bytes. In any case, it sounds like an expensive computation, for what I was hoping to be a trivial setter function or extra function argument.

What would you recommend? How do I pass a user-specified Dictionary_ID to the ZSTD_CCtx?


As a general dictionary API issue, I often found it unclear what format the dictionary (pointer, length) arguments were expecting.

For example, ZSTD_compress_usingDict takes const void* dict, size_t dictSize arguments. Does this (pointer, length) pair hold

  • (A) the raw dictionary bytes (IIUC, a 'content-only dictionary'),
  • (B) the wrapped zstd --train format of Section 5, or
  • (C) either?

The comment says "Note 2 : When dict == NULL || dictSize < 8 no dictionary is used" suggests that ZSTD_compress_usingDict is (A) or (C). On the other hand, the comment also says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information", which suggests (C).

But for (C), how does it distinguish raw dictionary bytes that happen to start with e.g. the 0xEC30A437 magic number?

Similarly, ZSTD_getDictID_fromDict also takes const void* dict, size_t dictSize, and its comment says "Provides the dictID stored within dictionary. if return == 0, the dictionary is not conformant with Zstandard specification", which suggests either (B) or (C).

Coming back to ZSTD_createCDict, it also takes const void* dictBuffer, size_t dictSize. But what are the semantics? Is it (A), (B) or (C)? How do I tell without diving further into the source code?

Perhaps the various const void* dict, size_t dict_size function arguments could be renamed so that it's clear when it's expecting (A) or when it's expecting (B). Or does everything take (C), and if so, what am I supposed to do in the "false positive" case where my raw dictionary bytes coincidentally also look like the Section 5 format?


By the way, ZSTD_compress_usingDict's comment says "A dictionary can be any arbitrary data segment (also called a prefix), or a buffer with specified information (see dictBuilder/zdict.h)."

I looked at "dictBuilder/zdict.h", but didn't find it helpful. In any case, I'm not looking to train a dictionary. I have a dictionary, and I have a Dictionary_ID that I'd like to associate with it, but I can't see how to make that association.


Tangential footnote:

I am not using zstd --train. I am using the https://github.com/google/brotli/blob/master/research/dictionary_generator.cc program, with its --chunk_len option, creating dictionaries for chunks within a single file. Even if zstd --train gained an option to do that, I'm working with multiple compression codecs that could all use the same dictionary, so I'd rather not introduce a zstd-specfiic format.

Further links:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions