Skip to content

Commit b962da4

Browse files
committed
Rename DataView Metadata to Annotations.
Fix dotnet#1843 Fix dotnet#2297
1 parent 6077e18 commit b962da4

File tree

160 files changed

+1252
-1255
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

160 files changed

+1252
-1255
lines changed

docs/code/IDataViewDesignPrinciples.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ only when needed to satisfy a local request for information.
4747
The IDataView design fulfills the following design requirements:
4848

4949
* **General schema**: Each view carries schema information, which specifies
50-
the names and types of the view's columns, together with metadata associated
50+
the names and types of the view's columns, together with annotations associated
5151
with the columns. The system is optimized for a reasonably small number of
5252
columns (hundreds). See [here](#basics).
5353

@@ -112,14 +112,14 @@ The IDataView system design does *not* include the following:
112112
* **Multi-view schema information**: There is no direct support for specifying
113113
cross-view schema information, for example, that certain columns are primary
114114
keys, and that there are foreign key relationships among tables. However,
115-
the column metadata support, together with conventions, may be used to
115+
the column annotation support, together with conventions, may be used to
116116
represent such information.
117117

118118
* **Standard ML schema**: The IDataView system does not define, nor prescribe,
119119
standard ML schema representation. For example, it does not dictate
120120
representation of nor distinction between different semantic interpretations
121121
of columns, such as label, feature, score, weight, etc. However, the column
122-
metadata support, together with conventions, may be used to represent such
122+
annotation support, together with conventions, may be used to represent such
123123
interpretations.
124124

125125
* **Row count**: A view is not required to provide its row count. The
@@ -149,7 +149,7 @@ The IDataView system design does *not* include the following:
149149

150150
IDataView has general schema support, in that a view can have an arbitrary
151151
number of columns, each having an associated name, index, data type, and
152-
optional metadata.
152+
optional annotation.
153153

154154
Column names are case sensitive. Multiple columns can share the same name, in
155155
which case, one of the columns hides the others, in the sense that the name
@@ -177,7 +177,7 @@ The set of standard types will likely be expanded over time.
177177
The IDataView type system is specified in a separate document, *IDataView Type
178178
System Specification*.
179179

180-
IDataView provides a general mechanism for associating semantic metadata with
180+
IDataView provides a general mechanism for associating semantic annotations with
181181
columns, such as designating sets of score columns, names associated with the
182182
individual slots of a vector-valued column, values associated with a key type
183183
column, whether a column's data is normalized, etc.

docs/code/IDataViewImplementation.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -313,10 +313,10 @@ are initialized using the pseudo-random number generator in an `IHost` that
313313
changes from one to another. But, that's a bit nit-picky.
314314

315315
Note also: when we say functionally identical we include everything about it:
316-
not just the data, but the schema, its metadata, the implementation of
316+
not just the data, but the schema, its annotations, the implementation of
317317
shuffling, etc. For this reason, while serializing the data *model* has
318318
guarantees of consistency, serializing the *data* has no such guarantee: if
319-
you serialize data using the text saver, practically all metadata (except slot
319+
you serialize data using the text saver, practically all annotations (except slot
320320
names) will be completely lost, which can have implications on how some
321321
transforms and downstream processes work. Or: if you serialize data using the
322322
binary saver, suddenly it may become shufflable whereas it may not have been
@@ -475,7 +475,7 @@ helpful).
475475

476476
The schema contains information about the columns. As we see in [the design
477477
principles](IDataViewDesignPrinciples.md), it has index, data type, and
478-
optional metadata.
478+
optional annotations.
479479

480480
While *programmatically* accesses to an `IDataView` are by index, from a
481481
user's perspective the indices are by name; most training algorithms
@@ -498,20 +498,20 @@ things like key-types and vector-types, when returned, should not be created
498498
in the function itself (thereby creating a new object every time), but rather
499499
stored somewhere and returned.
500500

501-
## Metadata
501+
## Annotations
502502

503-
Since metadata is *optional*, one is not obligated to necessarily produce it,
503+
Since annotations are *optional*, one is not obligated to necessarily produce it,
504504
or conform to any particular schemas for any particular kinds (beyond, say,
505505
the obvious things like making sure that the types and values are consistent).
506506
However, the flip side of that freedom given to *producers*, is that
507507
*consumers* are obligated, when processing a data view input, to react
508-
gracefully when metadata of a certain kind is absent, or not in a form that
509-
one expects. One should *never* fail when input metadata is in a form one does
508+
gracefully when an annotation of a certain kind is absent, or not in a form that
509+
one expects. One should *never* fail when input annotations are in a form one does
510510
not expect.
511511

512512
To give a practical example of this: many transforms, learners, or other
513513
components that process `IDataView`s will do something with the slot names,
514-
but when the `SlotNames` metadata kind for a given column is either absent,
514+
but when the `SlotNames` annotation kind for a given column is either absent,
515515
*or* not of the right type (vectors of strings), *or* not of the right size
516516
(same length vectors as the input), the behavior is not to throw or yield
517517
errors or do anything of the kind, but to simply say, "oh, I don't really have

docs/code/IDataViewTypeSystem.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ components. At a high level, it is analogous to the .Net interface
6363
While `IEnumerable<T>` is a sequence of objects of type `T`, `IDataView` is a
6464
sequence of rows. An `IDataView` object has an associated `ISchema` object
6565
that defines the `IDataView`'s columns, including their names, types, indices,
66-
and associated metadata. Each row of the `IDataView` has a value for each
66+
and associated annotations. Each row of the `IDataView` has a value for each
6767
column defined by the schema.
6868

6969
Just as `IEnumerable<T>` has an associated enumerator interface, namely
@@ -224,29 +224,29 @@ to a dense representation having the suppressed entries filled in with the
224224
entries are emphatically *not* the missing/`NA` value of the item type, unless
225225
the missing and default values are identical, as they are for key types.
226226

227-
### Metadata
227+
### Annotations
228228

229-
A column in an `ISchema` can have additional column-wide information, known as
230-
metadata. For each string value, known as a metadata kind, a column may have a
231-
value associated with that metadata kind. The value also has an associated
229+
A column in an `DataViewSchema` can have additional column-wide information, known as
230+
annotations. For each string value, known as an annotation kind, a column may have a
231+
value associated with that annotation kind. The value also has an associated
232232
type, which is a compatible column type.
233233

234234
For example:
235235

236236
* A column may indicate that it is normalized, by providing a `BL` valued
237-
piece of metadata named `IsNormalized`.
237+
annotation named `IsNormalized`.
238238

239239
* A column whose type is `V<R4,17>`, meaning a vector of length 17 whose items
240-
are single-precision floating-point values, might have `SlotNames` metadata
240+
are single-precision floating-point values, might have `SlotNames` annotation
241241
of type `V<TX,17>`, meaning a vector of length 17 whose items are text.
242242

243243
* A column produced by a scorer may have several pieces of associated
244-
metadata, indicating the "scoring column group id" that it belongs to, what
244+
annotations, indicating the "scoring column group id" that it belongs to, what
245245
kind of scorer produced the column (for example, binary classification), and the
246246
precise semantics of the column (for example, predicted label, raw score,
247247
probability).
248248

249-
The `ISchema` interface, including the metadata API, is fully specified in
249+
The `DataViewSchema` class, including the annotations API, is fully specified in
250250
another document.
251251

252252
## Text Type

docs/code/MlNetHighLevelConcepts.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@ This document is going to cover the following ML.NET concepts:
2929
In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, cursorable, heterogenous, schematized dataset.
3030

3131
- It has *Schema* (an instance of an `ISchema` interface), that contains the information about the data view's columns.
32-
- Each column has a *Name*, a *Type*, and an arbitrary set of *metadata* associated with it.
32+
- Each column has a *Name*, a *Type*, and an arbitrary set of *annotations* associated with it.
3333
- It is important to note that one of the types is the `vector<T, N>` type, which means that the column's values are *vectors of items of type T, with the size of N*. This is a recommended way to represent multi-dimensional data associated with every row, like pixels in an image, or tokens in a text.
34-
- The column's *metadata* contains information like 'slot names' of a vector column and suchlike. The metadata itself is actually represented as another one-row *data*, that is unique to each column.
34+
- The column's *annotations* contains information like 'slot names' of a vector column and suchlike. The annotations itself are actually represented as another one-row *data*, that is unique to each column.
3535
- The data view is a source of *cursors*. Think SQL cursors: a cursor is an object that iterates through the data, one row at a time, and presents the available data.
3636
- Naturally, data can have as many active cursors over it as needed: since data itself is immutable, cursors are truly independent.
3737
- Note that cursors typically access only a subset of columns: for efficiency, we do not compute the values of columns that are not 'needed' by the cursor.

docs/code/SchemaComprehension.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,17 @@ For a better understanding of `IDataView` principles and type system please refe
88

99
## Introduction
1010

11-
Every dataset in ML.NET is represented as an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other metadata is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.
11+
Every dataset in ML.NET is represented as an `IDataView`, which is, for the purposes of this document, a collection of rows that share the same columns. The set of columns, their names, types and other annotations is known as the *schema* of the `IDataView`, and it's represented as an `ISchema` object.
1212

1313
In this document, we will be using the terms *data view* and `IDataView` interchangeably, same for *schema* and `ISchema`.
1414

1515
Before any new data enters ML.NET, the user needs to somehow define how the schema of the data will look like.
1616
To do this, the following questions need to be answered:
1717
- What are the column names?
1818
- What are their types?
19-
- What other metadata is associated with the columns?
19+
- What other annotations are associated with the columns?
2020

21-
These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and metadata can correspond to field attributes.
21+
These items above are very similar to the definition of fields in a C# class: names and types of columns correspond to names and types of fields, and annotations can correspond to field attributes.
2222
Because of this similarity, ML.NET offers a common convenient mechanism for creating a schema: it is done via defining a C# class.
2323

2424
For example, the below class definition can be used to define a data view with 5 float columns:
@@ -201,10 +201,10 @@ var dataView = env.CreateDataView<IrisVectorData>(arr, schemaDef);
201201
var predictionEngine = env.CreatePredictionEngine<IrisData, IrisVectorData>(dv, outputSchemaDefinition: schemaDef);
202202
```
203203

204-
In addition to the above, you can use `SchemaDefinition` to add per-column metadata:
204+
In addition to the above, you can use `SchemaDefinition` to add per-column annotations:
205205
```C#
206-
// Add column metadata.
207-
schemaDef["Label"].AddMetadata(MetadataUtils.Kinds.HasMissingValues, false);
206+
// Add column annotation.
207+
schemaDef["Label"].AddAnnotation(MetadataUtils.Kinds.HasMissingValues, false);
208208
```
209209

210210
## Limitations
@@ -216,7 +216,7 @@ Here is the list of things that are only possible via the low-level interface:
216216
* Creating or reading a data view, where even column *types* are not known at compile time (so you cannot create a C# class to define the schema)
217217
* This can happen if you write a general-purpose machine learning tool that can ingest different kinds of datasets.
218218
* Reading a subset of columns that differs from one row to another: the cursor always populates the entire row object.
219-
* Reading column metadata from the data view.
219+
* Reading column annotations from the data view.
220220
* Accessing the 'hidden' data view columns by index.
221221
* Hidden columns are those that have the same name as other columns and a smaller index. They are not accessible by name.
222222
* Creating 'cursor sets': this is a feature that lets you iterate over data in multiple parallel threads by splitting the data between multiple 'sibling' cursors.

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/LightGBMMulticlassClassification.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,9 @@ public static void Example()
5656
// IDataView with predictions, to an IEnumerable<DatasetUtils.MulticlassClassificationExample>.
5757
var nativePredictions = mlContext.CreateEnumerable<DatasetUtils.MulticlassClassificationExample>(dataWithPredictions, false).ToList();
5858

59-
// Get schema object out of the prediction. It contains metadata such as the mapping from predicted label index
59+
// Get schema object out of the prediction. It contains annotations such as the mapping from predicted label index
6060
// (e.g., 1) to its actual label (e.g., "AA").
61-
// The metadata can be used to get all the unique labels used during training.
61+
// The annotations can be used to get all the unique labels used during training.
6262
var labelBuffer = new VBuffer<ReadOnlyMemory<char>>();
6363
dataWithPredictions.Schema["PredictedLabelIndex"].GetKeyValues(ref labelBuffer);
6464
// nativeLabels is { "AA" , "BB", "CC", "DD" }

docs/samples/Microsoft.ML.Samples/Static/LightGBMMulticlassWithInMemoryData.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,15 +70,15 @@ public void MultiClassLightGbmStaticPipelineWithInMemoryData()
7070
// Convert prediction in ML.NET format to native C# class.
7171
var nativePredictions = mlContext.CreateEnumerable<DatasetUtils.MulticlassClassificationExample>(prediction.AsDynamic, false).ToList();
7272

73-
// Get schema object out of the prediction. It contains metadata such as the mapping from predicted label index
73+
// Get schema object out of the prediction. It contains annotations such as the mapping from predicted label index
7474
// (e.g., 1) to its actual label (e.g., "AA"). The call to "AsDynamic" converts our statically-typed pipeline into
75-
// a dynamically-typed one only for extracting metadata. In the future, metadata in statically-typed pipeline should
75+
// a dynamically-typed one only for extracting annotations. In the future, annotations in statically-typed pipeline should
7676
// be accessible without dynamically-typed things.
7777
var schema = prediction.AsDynamic.Schema;
7878

7979
// Retrieve the mapping from labels to label indexes.
8080
var labelBuffer = new VBuffer<ReadOnlyMemory<char>>();
81-
schema[nameof(DatasetUtils.MulticlassClassificationExample.PredictedLabelIndex)].Metadata.GetValue("KeyValues", ref labelBuffer);
81+
schema[nameof(DatasetUtils.MulticlassClassificationExample.PredictedLabelIndex)].Annotations.GetValue("KeyValues", ref labelBuffer);
8282
// nativeLabels is { "AA" , "BB", "CC", "DD" }
8383
var nativeLabels = labelBuffer.DenseValues().ToArray(); // nativeLabels[nativePrediction.PredictedLabelIndex - 1] is the original label indexed by nativePrediction.PredictedLabelIndex.
8484

0 commit comments

Comments
 (0)