Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions docs/code/EntryPoints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Overview

An 'entry point', is a representation of a ML.Net type in json format and it is used to serialize and deserialize an ML.Net type in JSON.
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML.Net [](start = 43, length = 6)

I think the branding is not the lower case ML.Net but ML.NET. #Closed

Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json [](start = 58, length = 4)

JSON is typically capitalized. #Closed

Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it [](start = 74, length = 2)

Ambiguous, when we say "it" what are we referring to? Entry-points? JSON? An ML.NET type? #Closed

It is also one of the ways ML.Net uses to deserialize experiments, and the recommended way to interface with other languages.
In terms defining experiments w.r.t entry points, experiments are entry points DAGs, and respectively, entry points are experiment graph nodes.
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t [](start = 30, length = 5)

If we want to use this initialism it would be "w.r.t." not "w.r.t". #Closed

That's why through the documentaiton, we also refer to them as 'entry points nodes'.
The graph 'variables', the various values of the experiemnt graph json properties serve to describe the relationship between the entry point nodes.
The 'variables' are therefore the edges of the DAG.
Copy link
Contributor

@GalOshri GalOshri Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce the acronym "directed acyclic graph" #Resolved


All of ML.Net entry points are described by their manifest. The manifest is another json object that documents and describes the structure of an entry points.
Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph.

This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.Net classes that help construct an entry point
graph.

## `EntryPoint manifest - the definition of an entry point`
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EntryPoint manifest - the definition of an entry point [](start = 4, length = 54)

This was put in code formatting. Was that intentional? #Closed


An example of an entry point manifest object, specifically for the MissingValueIndicator transform, is:
Copy link
Contributor

@TomFinley TomFinley Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MissingValueIndicator [](start = 67, length = 21)

Consider using code formatting for class names. #Resolved


```javascript
Copy link
Contributor

@TomFinley TomFinley Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it is actually written out, but I wonder if we could just format it a bit to make it a bit more tolerable. The document is dominated by this ~180 line monstrosity. I think it could be improved significantly by just deleting a bunch of whitespace... so for example if the stuff from lines 40 through 65, we could make it look more like this to save a bunch of lines.

"Values": ["I1", "U1", "I2", "U2", "I4", "U4", "I8", "U8",
    "R4", "Num", R8", "TX", "Text", "TXT", "BL", "Bool",
    "TimeSpan", "TS", "DT", DateTime", "DZ", "DateTimeZone",
    "UG", "U16"]

Basically I suppose I'd say if it looked more like someone actually wrote it vs. code-generated it would be a lot easier to appreciate and comprehend. I think we can get it to all fit on one page. Sometimes more lengthy cannot be helped, but in general and especially for the first example, I think it's important that it fit on one page. #Closed

{
"Name": "Transforms.MissingValueIndicator",
"Desc": "Create a boolean output column with the same number of slots as the input column, where the output value is true if the value in the input column is missing.",
"FriendlyName": "NA Indicator Transform",
"ShortName": "NAInd",
"Inputs": [
{
"Name": "Column",
"Type": {
"Kind": "Array",
"ItemType": {
"Kind": "Struct",
"Fields": [
{
"Name": "Name",
"Type": "String",
"Desc": "Name of the new column",
"Aliases": [
"name"
],
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": null
},
{
"Name": "Source",
"Type": "String",
"Desc": "Name of the source column",
"Aliases": [
"src"
],
"Required": false,
"SortOrder": 150.0,
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SortOrder": 150.0, [](start = 18, length = 19)

These are kind of poor examples... SortOrder is identical between the two properties here, and in the enclosing scope they are also identical with sort order of 1. :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving the 150 sort order intact, since it seems to be the de fact (not sure if intentional, though) default for advanced properties.

Updating the transform used for the example to a better one.


In reply to: 192868923 [](ancestors = 192868923)

"IsNullable": false,
"Default": null
}
]
}
},
"Desc": "New column definition(s) (optional form: name:src)",
"Aliases": [
"col"
],
"Required": true,
"SortOrder": 1.0,
"IsNullable": false
},
{
"Name": "Data",
"Type": "DataView",
"Desc": "Input dataset",
"Required": true,
"SortOrder": 1.0,
"IsNullable": false
}
],
"Outputs": [
{
"Name": "OutputData",
"Type": "DataView",
"Desc": "Transformed dataset"
},
{
"Name": "Model",
"Type": "TransformModel",
"Desc": "Transform model"
}
],
"InputKind": [
"ITransformInput"
],
"OutputKind": [
"ITransformOutput"
]
}
```

The respective entry point, constructed based on this manifest would be:

```javascript
{
"Name": "Transforms.MissingValueIndicator",
"Inputs": {
"Column": [
{
"Name": "Features",
"Source": "Features"
}
],
"Data": "$data0"
},
"Outputs": {
"OutputData": "$Output_1528136517433",
"Model": "$TransformModel_1528136517433"
}
}
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be consistent with usage of spaces vs. tabs above. Prefer spaces. #Closed

```

## `EntryPointGraph`

This class encapsulates the list of nodes (`EntryPointNode`) and edges
(`EntryPointVariable` inside a `RunContext`) of the graph.

## `EntryPointNode`

This class represents a node in the graph, and wraps an entry point call. It
has methods for creating and running entry points. It also has a reference to
the `RunContext` to allow it to get and set values from `EntryPointVariable`s.

To express the inputs that are set through variables, a set of dictionaries
are used. The `InputBindingMap` maps an input parameter name to a list of
`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a
`VariableBinding`. For example, if the JSON looks like this:

```javascript
'foo': '$bar'
```

the `InputBindingMap` will have one entry that maps the string "foo" to a list
that has only one element, a `SimpleParameterBinding` with the name "foo" and
the `InputMap` will map the `SimpleParameterBinding` to a
`SimpleVariableBinding` with the name "bar". For a more complicated example,
let's say we have this JSON:

```javascript
'foo': [ '$bar[3]', '$baz']
```

the `InputBindingMap` will have one entry that maps the string "foo" to a list
that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and
index 0 and another one with index 1. The `InputMap` will map the first
`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar"
and index 3 and the second `ArrayIndexParameterBinding` to a
`SimpleVariableBinding` with the name "baz".

For outputs, a node assumes that an output is mapped to a variable, so the
`OutputMap` is a simple dictionary from string to string.

## `EntryPointVariable`

This class represents an edge in the entry point graph. It has a name, a type
and a value. Variables can be simple, arrays and/or dictionaries. Currently,
only data views, file handles, predictor models and transform models are
allowed as element types for a variable.

## `RunContext`

This class is just a container for all the variables in a graph.

## VariableBinding and Derived Classes

The abstract base class represents a "pointer to a (part of a) variable". It
is used in conjunction with `ParameterBinding`s to specify inputs to an entry
point node. The `SimpleVariableBinding` is a pointer to an entire variable,
the `ArrayIndexVariableBinding` is a pointer to a specific index in an array
variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific
key in a dictionary variable.

## ParameterBinding and Derived Classes

The abstract base class represents a "pointer to a (part of a) parameter". It
parallels the `VariableBinding` hierarchy and it is used to specify the inputs
to an entry point node. The `SimpleParameterBinding` is a pointer to a
non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a
pointer to a specific index of an array parameter and the
`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary
parameter.
123 changes: 123 additions & 0 deletions docs/code/GraphRunner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# JSON Graph format

The entry point graph in TLC is an array of _nodes_. Each node is an object with the following fields:
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry point [](start = 4, length = 11)

This might be a good place to have a link to EntryPoints.md. #Closed


- _name_: string. Required. Name of the entry point.
- _inputs_: object. Optional. Specifies non-default inputs to the entry point.
Note that if the entry point has required inputs (which is very common), the _inputs_ field is requred.
- _outputs_: object. Optional. Specifies the variables that will hold the node's outputs.

## Input and output types
The following types are supported in JSON graphs:

- _string_. Represented as a JSON string, maps to a C# string.
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 55, length = 6)

Should these types when listed here be listed as string vs. plain old string, since we are using C# keywords to describe them? (E.g.: string, float, double, bool, enum, int, long, etc.) This comment would not apply to things that are actually meant to be interpreted as prose descriptions of the type, e.g., "array." #WontFix

- _float_. Represented as a JSON float, maps to a C# float or double.
- _bool_. Represented as a JSON bool, maps to a C# bool.
- _enum_. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest).
- _int_. Currently not implemented. Represented as a JSON integer, maps to a C# int or long.
- _array_ of the above. Represented as a JSON array, maps to a C# array.
- _dictionary_. Currently not implemented. Represented as a JSON object, maps to a C# `Dictionary<string,T>`.
- _component_. Currently not implemented. Represented as a JSON object with 2 fields: _name_:string and _settings_:object.

Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this information current? I thought I saw some support for these. Certainly components are supported (not as SubComponent type specifically, but we can use dependency injection through the component factories). #Resolved

Copy link
Member Author

@sfilipi sfilipi Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my edits to this file are not reflected. Fixing that. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected the component part. Double-checking on the dictionaries and indexing in arrays. I don't think we do that yet.


In reply to: 192866131 [](ancestors = 192866131)

## Variables
The following input/output types can not be represented as a JSON value:
- _DataView_
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataView [](start = 2, length = 10)

Is this usage intentional? There is no DataView, but there is an IDataView. Similar for file handles, the models, etc. #Closed

Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if these are meant to be actual types, should they not be in ` backticks, since they're meant to be interpreted as code?


In reply to: 192868598 [](ancestors = 192868598)

- _FileHandle_
- _TransformModel_
- _PredictorModel_

These must be passed as _variables_. The variable is represented as a JSON string that begins with "$".
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"$" [](start = 99, length = 3)

For code like this I might prefer `$` to "$". #Closed

Note the following rules:

- A variable can appear in the _outputs_ only once per graph. That is, the variable can be 'assigned' only once.
- If the variable is present in _inputs_ of one node and in the _outputs_ of another node, this signifies the graph 'edge'.
The same variable can participate in many edges.
- If the variable is present only in _inputs_, but never in _outputs_, it is a _graph input_. All graph inputs must be provided before
a graph can be run.
- The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is
ambiguous, TLC throws an exception.
Copy link
Contributor

@GalOshri GalOshri Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to ML.NET? #Resolved

- Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, TLC throws an exception.
_Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw._

### Variables for arrays and dictionaries.
It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above).
They are treated the same way as regular 'scalar' variables.

If we want to reference an item of the collection, we can use the `[]` syntax:
- `$var[5]` denotes 5th element of an array variable.
- `$var[foo]` and `$var['foo']` both denote the element with key 'foo' of a dictionary variable.
_This is not yet implemented._

Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects:
- `["$v1", "$v2", "$v3"]` denotes an array containing 3 variables.
- `{"foo": "$v1", "bar": "$v2"}` denotes a collection containing 2 key-value pairs.
_This is also not yet implemented._

## Example of a JSON entry point manifest object, and the respective entry point graph node
Let's consider the following manifest snippet, describing an entry point _'CVSplit.Split'_:
```
Copy link
Contributor

@TomFinley TomFinley Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have in the other file been using the javascript type on these code blocks. This is probably a good practice to carry over to this file. #Closed

{
"name": "CVSplit.Split",
"desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)",
"inputs": [
{
"name": "Data",
"type": "DataView",
"desc": "Input dataset",
"required": true
},
{
"name": "NumFolds",
"type": "Int",
"desc": "Number of folds to split into",
"required": false,
"default": 2
},
{
"name": "StratificationColumn",
"type": "String",
"desc": "Stratification column",
"aliases": [
"strat"
],
"required": false,
"default": null
}
],
"outputs": [
{
"name": "TrainData",
"type": {
"kind": "Array",
"itemType": "DataView"
},
"desc": "Training data (one dataset per fold)"
},
{
"name": "TestData",
"type": {
"kind": "Array",
"itemType": "DataView"
},
"desc": "Testing data (one dataset per fold)"
}
]
}
```

As we can see, the entry point has 3 inputs (one of them required), and 2 outputs.
The following is a correct graph containing call to this entry point:
```
{
"nodes": [
{
"name": "CVSplit.Split",
"inputs": {
"Data": "$data1"
},
"outputs": {
"TrainData": "$cv"
}
}]
}
```