-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Migration of first IDataView
docs
#173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
# IDV File Format | ||
|
||
This document describes ML.NET's Binary dataview file format, version 1.1.1.5 | ||
written by the `BinarySaver` and `BinaryLoader` classes, commonly known as the | ||
`.idv` format. | ||
|
||
## Goal of the Format | ||
|
||
A dataview is a collection of columns, over some number of rows. (Do not | ||
confuse column with features. Columns can be and often are vector valued, and | ||
it is expected though not required that commonly all features will be together | ||
in one vector valued column.) | ||
|
||
The actual values are stored in blocks. A block holds values for a single | ||
column across multiple rows. Block format is dictated by a codec. There is a | ||
table-of-contents and lookup table to facilitate quasi-random access to | ||
particular blocks. (Quasi in the sense that you can only seek to a block, not | ||
to a particular within a block.) | ||
|
||
## General Data Format | ||
|
||
Before we discuss the format itself we will establish some conventions on how | ||
individual scalar values, strings, and other data is serialized. All basic | ||
pieces of data (e.g., a single number, or a single string) are encoded in ways | ||
reflecting the semantics of the .NET `BinaryWriter` class, those semantics | ||
being: | ||
|
||
* All numbers are stored as little-endian, using their natural fix-length | ||
binary encoding. | ||
|
||
* Strings are stored using an unsigned | ||
[LEB128](https://en.wikipedia.org/wiki/LEB128) number describing the number | ||
of bytes, followed by that many bytes containing the UTF-8 encoded string. | ||
|
||
A note about this: LEB128 is a simple encoding to encode arbitrarily large | ||
integers. Each byte of 8-bits follows this convention. The most significant | ||
bit is 0 if and only if this is the end of the LEB128 encoding. The remaining | ||
7 bits are a part of the number being encoded. The bytes are stored | ||
little-endian, that is, the first byte holds the 7 least significant bits, the | ||
second byte (if applicable) holds the next 7 least significant bits, etc., and | ||
the last byte holds the 7 most significant bits. LEB128 is used one or two | ||
places in this format. (I might tend to prefer use of LEB128 in places where | ||
we are writing values that, on balance, we expect to be relatively small, and | ||
only in cases where there is no potential for benefit for random access to the | ||
associated stream, since LEB128 is incompatible with random access. However, | ||
this is not formulated into anything approaching a definite policy.) | ||
|
||
## Header | ||
|
||
Every binary instances stream has a header composed of 256 bytes, at the start | ||
of the stream. Not all bytes are used. Those bytes that are not explicitly | ||
used have undefined content, and can have anything in them. We strongly | ||
encourage writers of this format to insert obscene messages in this dead | ||
space. The content is defined as follows (the offsets being the start of that | ||
column). | ||
|
||
Offsets | Type | Name and Description | ||
--------|-------|--------------------- | ||
0 | ulong | **Signature**: The magic number of this file. | ||
8 | ulong | **Version**: Indicates the version of the data file. | ||
16 | ulong | **CompatibleVersion**: Indicates the minimum reader version that can interpret this file, possibly with some data loss. | ||
24 | long | **TableOfContentsOffset**: The offset to the column table of contents structure. | ||
32 | long | **TailOffset**: The eight-byte tail signature starts at this offset. So, the entire dataset stream should be considered to have byte length of eight plus this value. | ||
40 | long | **RowCount**: The number of rows in this data file. | ||
48 | int | **ColumnCount**: The number of columns in this data file. | ||
|
||
Notes on these: | ||
|
||
* The signature of this file is `0x00425644004C4D43`, which is, when written | ||
little-endian to a file, `CML DVB ` with null characters in the place of | ||
spaces. These letters are intended to suggest "CloudML DataView Binary." | ||
|
||
* The tail signature is the byte-reversed version of this, that is, | ||
`0x434D4C0044564200`. | ||
|
||
* Versions are encoded as four 16-bit unsigned numbers passed into a single | ||
ulong, with higher order bits being a more major version. The first | ||
supported version of the is 1.1.1.4, that is, `0x0001000100010004`. | ||
(Versions prior to 1.1.1.4 did exist, but were not released, so we do not | ||
support them, though we do describe them in this document for the sake of | ||
completeness.) | ||
|
||
## Table of Contents Format | ||
|
||
The table of contents are packed entries, with there being as many entries as | ||
there are columns. The version field here indicates the versions where that | ||
entry is written. ≥ indicates the field occurred in versions after and | ||
including that version, = indicates the field occurs only in that version. | ||
|
||
Description | Entry Type | Version | ||
------------|------------|-------- | ||
Column name | string | ≥1.1.1.1 | ||
Codec loadname | string | ≥1.1.1.1 | ||
Codec parameterization length | LEB128 integer | ≥1.1.1.1 | ||
Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length | ≥1.1.1.1 | ||
Compression kind | CompressionKind (byte) | ≥1.1.1.1 | ||
Rows per block in this column | LEB128 integer | ≥1.1.1.1 | ||
Lookup table offset | long | ≥1.1.1.1 | ||
Slot names offset, or 0 if this column has no slot names, if 1.1.1.2 behave as if there are no slot names, with this having value 0) | long | =1.1.1.3 | ||
Slot names byte size (present only if slot names offset is greater than 0) | long | =1.1.1.3 | ||
Slot names count (present only if slot names offset is greater than 0) | int | =1.1.1.3 | ||
Metadata table of contents offset, or 0 if there is no metadata (1.1.1.4) | long | ≥1.1.1.4 | ||
|
||
For those working in the ML.NET codebase: The three `Codec` fields are handled | ||
by the `CodecFactory.WriteCodec/TryReadCodec` methods, with the definition | ||
stream being at the start of the codec loadname, and being at the end of the | ||
codec parameterization, both in the case of success or failure. | ||
|
||
CompressionCodec enums are described below, and describe the compression | ||
algorithm used to compress blocks. | ||
|
||
### Compression Kind | ||
|
||
The enum for compression kind is one byte, and follows this scheme: | ||
|
||
Compression Kind | Code | ||
---------------------------------------------------------------|----- | ||
None | 0 | ||
DEFLATE (i.e., [RFC1951](http://www.ietf.org/rfc/rfc1951.txt)) | 1 | ||
zlib (i.e., [RFC1950](http://www.ietf.org/rfc/rfc1950.txt)) | 2 | ||
|
||
None means no compression. DEFLATE is the default scheme. There is a tendency | ||
to conflate zlib and DEFLATE, so to be clear: zlib can be (somewhat inexactly) | ||
considered a wrapped version of DEFLATE, but it is still a distinct (but | ||
closely related) format. However, both are implemented by the zlib library, | ||
which is probably the source of the confusion. | ||
|
||
## Metadata Table of Contents Format | ||
|
||
The metadata table of contents begins with a LEB128 integer describing the | ||
number of entries. (Should be a positive value, since if a column has no | ||
metadata the expectation is that the offset for the metadata TOC will be | ||
stored as 0.) What follows that are that many packed entries. Each entry is | ||
somewhat akin to the column table of contents entry, with some simplifications | ||
considering that there will be exactly one "block" with one item. | ||
|
||
Description | Entry Type | ||
-------------------------------------------------------|------------ | ||
Metadata kind | string | ||
Codec loadname | string | ||
Codec parameterization length | LEB128 integer | ||
Codec parameterization, which must have precisely the length indicated above | arbitrary, but with specified length | ||
Compression kind | CompressionKind(byte) | ||
Offset of the block where the metadata item is written | long | ||
Byte length of the block | LEB128 integer | ||
|
||
The "block" written is written in exactly same format as the main content | ||
blocks. This will be very slightly inefficient as that scheme is sometimes | ||
written to accommodate many entries, but I don't expect that to be much of a | ||
burden. | ||
|
||
## Lookup Table Format | ||
|
||
Each table of contents entry is associated with a lookup table starting at the | ||
indicated lookup table offset. It is written as packed binary, with each | ||
lookup entry consisting of 16 bytes. So in all, the lookup table takes 16 | ||
bytes, times the total number of blocks for this column. | ||
|
||
Description | Entry Type | ||
----------------------------------------------------------|----------- | ||
Block offset, position in the file where the block starts | long | ||
Block length, its size in bytes in the file | int | ||
Uncompressed block length, its size in bytes if the block bytes were decompressed according to the column's compression codec | int | ||
|
||
## Slot Names | ||
|
||
If slot names are stored, they are stored as pairs of integer index/string | ||
pairs. As many pairs are stored as count of slot names were present in the | ||
table of contents entry. Note that this only appeared in version 1.1.1.3. With | ||
1.1.1.4 and later, slot names were just considered yet another piece of | ||
metadata. | ||
|
||
Description | Entry Type | ||
------------------|----------- | ||
Index of the slot | int | ||
The slot name | string | ||
|
||
## Block Format | ||
|
||
Columns are ordered into blocks, with each block holding the binary encoded | ||
values for one particular columns across a range of rows. So for example, if | ||
the column's table of contents describes it as having 1000 rows per block, the | ||
first block will contain the values for the column for rows 0 through 999, | ||
second block 1000 through 1999, etc., with all blocks containing the same | ||
number of blocks, except the last block which will contain fewer items (unless | ||
the number of rows just so happens to be a multiple of the block size). | ||
|
||
Each column is a possibly compressed sequence of bytes, compressed according | ||
to the compression type field in the table of contents. It begins and ends at | ||
the offsets indicated in the metadata entry stored in the directory. The | ||
uncompressed bytes will be stored in the format as described by the codec. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
# Key Values | ||
|
||
Most commonly, key-values are used to encode items where it is convenient or | ||
efficient to represent values using numbers, but you want to maintain the | ||
logical "idea" that these numbers are keys indexing some underlying, implicit | ||
set of values, in a way more explicit than simply mapping to a number would | ||
allow you to do. | ||
|
||
A more formal description of key values and types is | ||
[here](IDataViewTypeSystem.md#key-types). *This* document's motivation is less | ||
to describe what key types and values are, and more to instead describe why | ||
key types are necessary and helpful things to have. Necessarily, this document, | ||
is more anecdotal in its descriptions to motivate its content. | ||
|
||
Let's take a few examples of transforms that produce keys: | ||
|
||
* The `TermTransform` forms a dictionary of unique observed values to a key. | ||
The key type's count indicates the number of items in the set, and through | ||
the `KeyValue` metadata "remembers" what each key is representing. | ||
|
||
* The `HashTransform` performs a hash of input values, and produces a key | ||
value with count equal to the range of the hash function, which, if a b bit | ||
hash was used, will produce a 2ᵇ hash. | ||
|
||
* The `CharTokenizeTransform` will take input strings and produce key values | ||
representing the characters observed in the string. | ||
|
||
## Keys as Intermediate Values | ||
|
||
Explicitly invoking transforms that produce key values, and using those key | ||
values, is sometimes helpful. However, given that most trainers expect the | ||
feature vector to be a vector of floating point values and *not* keys, in | ||
typical usage the majority of usages of keys is as some sort of intermediate | ||
value on the way to that final feature vector. (Unless, say, doing something | ||
like preparing labels for a multiclass learner.) | ||
|
||
So why not go directly to the feature vector, and forget this key stuff? | ||
Actually, to take text as the canonical example, we used to. However, by | ||
structuring the transforms from, say, text to key to vector, rather than text | ||
to vector *directly*, we are able to simplify a lot of code on the | ||
implementation side, which is both less for us to maintain, and also for users | ||
gives consistency in behavior. | ||
|
||
So for example, the `CharTokenize` above might appear to be a strange choice: | ||
*why* represent characters as keys? The reason is that the ngram transform is | ||
written to ingest keys, not text, and so we can use the same transform for | ||
both the n-gram featurization of words, as well as n-char grams. | ||
|
||
Now, much of this complexity is hidden from the user: most users will just use | ||
the `text` transform, select some options for n-grams, and chargrams, and not | ||
be aware of these internal invisible keys. Similarly, use the categorical or | ||
categorical hash transforms, without knowing that internally it is just the | ||
term or hash transform followed by a `KeyToVector` transform. But, keys are | ||
still there, and it would be impossible to really understand ML.NET's | ||
featurization pipeline without understanding keys. Any user that wants to | ||
understand how, say, the text transform resulted in a particular featurization | ||
will have to inspect the key values to get that understanding. | ||
|
||
## Keys are not Numbers | ||
|
||
As an actual CLR data type, key values are stored as some form of unsigned | ||
integer (most commonly `uint`). The most common confusion that arises from | ||
this is to ascribe too much importance to the fact that it is a `uint`, and | ||
think these are somehow just numbers. This is incorrect. | ||
|
||
For keys, the concept of order and difference has no inherent, real meaning as | ||
it does for numbers, or at least, the meaning is different and highly domain | ||
dependent. Consider a numeric `U4` type, with values `0`, `1`, and `2`. The | ||
difference between `0` and `1` is `1`, and the difference between `1` and `2` | ||
is `1`, because they're numbers. Very well: now consider that you train a term | ||
transform over the input tokens `apple`, `pear`, and `orange`: this will also | ||
map to the keys logically represented as the numbers `0`, `1`, and `2` | ||
respectively. Yet for a key, is the difference between keys `0` and `1`, `1`? | ||
No, the difference is `0` maps to `apple` and `1` to `pear`. Also order | ||
doesn't mean one key is somehow "larger," it just means we saw one before | ||
another -- or something else, if sorting by value happened to be selected. | ||
|
||
Also: ML.NET's vectors can be sparse. Implicit entries in a sparse vector are | ||
assumed to have the `default` value for that type -- that is, implicit values | ||
for numeric types will be zero. But what would be the implicit default value | ||
for a key value be? Take the `apple`, `pear`, and `orange` example above -- it | ||
would inappropriate for the default value to be `0`, because that means the | ||
result is `apple`, would be appropriate. The only really appropriate "default" | ||
choice is that the value is unknown, that is, missing. | ||
|
||
An implication of this is that there is a distinction between the logical | ||
value of a key-value, and the actual physical value of the value in the | ||
underlying type. This will be covered more later. | ||
|
||
## As an Enumeration of a Set: `KeyValues` Metadata | ||
|
||
While keys can be used for many purposes, they are often used to enumerate | ||
items from some underlying set. In order to map keys back to this original | ||
set, many transform producing key values will also produce `KeyValues` | ||
metadata associated with that output column. | ||
|
||
Valid `KeyValues` metadata is a vector of length equal to the count of the | ||
type of the column. This can be of varying types: it is often text, but does | ||
not need to be. For example, a `term` applied to a column would have | ||
`KeyValue` metadata of item type equal to the item type of the input data. | ||
|
||
How this metadata is used downstream depends on the purposes of who is | ||
consuming it, but common uses are: in multiclass classification, for | ||
determining the human readable class names, or if used in featurization, | ||
determining the names of the features. | ||
|
||
Note that `KeyValues` data is optional, and sometimes is not even sensible. | ||
For example, if we consider a clustering algorithm, the prediction of the | ||
cluster of an example would. So for example, if there were five clusters, then | ||
the prediction would indicate the cluster by `U4<0-4>`. Yet, these clusters | ||
were found by the algorithm itself, and they have no natural descriptions. | ||
|
||
## Actual Implementation | ||
|
||
This may be of use only to writers or extenders of ML.NET, or users of our | ||
API. How key values are presented *logically* to users of ML.NET, is distinct | ||
from how they are actually stored *physically* in actual memory, both in | ||
ML.NET source and through the API. For key values: | ||
|
||
* All key values are stored in unsigned integers. | ||
* The missing key values is always stored as `0`. See the note above about the | ||
default value, to see why this must be so. | ||
* Valid non-missing key values are stored from `1`, onwards, irrespective of | ||
whatever we claim in the key type that minimum value is. | ||
|
||
So when, in the prior example, the term transform would map `apple`, `pear`, | ||
and `orange` seemingly to `0`, `1`, and `2`, values of `U4<0-2>`, in reality, | ||
if you were to fire up the debugger you would see that they were stored with | ||
`1`, `2`, and `3`, with unrecognized values being mapped to the "default" | ||
missing value of `0`. | ||
|
||
Nevertheless, we almost never talk about this, no more than we would talk | ||
about our "strings" really being implemented as string slices: this is purely | ||
an implementation detail, relevant only to people working with key values at | ||
the source level. To a regular non-API user of ML.NET, key values appear | ||
*externally* to be simply values, just as strings appear to be simply strings, | ||
and so forth. | ||
|
||
There is another implication: a hypothetical type `U1<4000-4002>` is actually | ||
a sensible type in this scheme. The `U1` indicates that is stored in one byte, | ||
which would on first glance seem to conflict with values like `4000`, but | ||
remember that the first valid key-value is stored as `1`, and we've identified | ||
the valid range as spanning the three values 4000 through 4002. That is, | ||
`4000` would be represented physically as `1`. | ||
|
||
The reality cannot be seen by any conventional means I am aware of, save for | ||
viewing ML.NET's workings in the debugger or using the API and inspecting | ||
these raw values yourself: that `4000` you would see is really stored as the | ||
`byte` `1`, `4001` as `2`, `4002` as `3`, and a missing value stored as `0`. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have hyphenated n-grams here (3x w/ n-char). My liking is "ngram" and "chargram". #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, could you explain this? Because this library in C# its identifier can't be
n-gram
so we call itNGram
, but everywhere I look actual prose usage of the term is -n-gram, including back when I was a wee little grad student. I see paper titles form ICML as "N-gram" or "n-gram", I don't see an "ngram." Unless the "cool kids" are doing something different nowadays, with their long hair and rock music?In reply to: 188887783 [](ancestors = 188887783)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started an email poll for terminology. No conclusion currently, but there was feedback that we'll need to also (besides for documentation) define the terms when used in code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Justin, leaving as n-gram. As near as I see, whenw written in prose, it's nearly universally referred to this way. Maybe in less formal writing someone might omit the hyphen, I see that Google actually has a piece of software branded "NGram", but otherwise, I do not see that your preferred usage is used at all. Thanks though!
In reply to: 189188391 [](ancestors = 189188391)