Direct API: Static Typing of Data Pipelines

Currently in all iterations of the pipeline concept, whether they be based on the v0.1 idiom of `LearningPipeline`, or the #371 proposal where `IDataView` is directly created, or the refinement of that in #581, or the convenience constructors, or whatever, there is always this idea of a pipeline being a runtime-checked thing, where each stage has some output schema with typed columns indexed by a string name, and all of this is known only at runtime -- at compile time, all the compiler knows is you have *some* estimator, or *some* data view, or something like that, but has no idea what is in it.

This makes sense from a practical perspective, since there are many applications where you cannot know the schema until runtime. E.g.: loading a model from a file, or loading a Parquet file, you aren't going to know anything until the code actually runs. So we want the underlying system to remain dynamically typed to serve those scenarios, and I do not propose changing that. That said, there are some definite usability costs:

* Typos on those column names are found at runtime, which is unfortunate.
* Application of the wrong transform or learner is found at runtime.
* Discoverability is an issue. You just sort of have to know what transforms are applicable to your case, which is somewhat difficult since if we were to collect all the things you could apply to data (transform, train a learner), there are probably about 100 of these or thereabouts. Intellisense will be of no help to you here, because at compile time, the only thing the language knows is you have *some*.

It's sort of like working with `Dictionary<string, object>` as your central data structure, and an API that just takes `Dictionary<string, object>` everywhere. In a way that's arbitrarily powerful, but the language itself can give you no help at all about what you should do with it, which is kind of a pity since we have this nice statically typed language we're working in.

So: a statically typed helper API on top of this that was sufficiently powerful would help increase the confidence that if someone compiles it might run, and also give you some help in the form of proper intellisense of what you can do, while you are typing before you've run anything. Properly structured, if you had strong typing at the columnar level, nearly everything you can do can be automatically discoverable through intellisense. The documentation would correspondingly become a lot more focused.

The desire to have something like this is very old, but all prior attempts I recall ran into some serious problems sooner or later. In this issue I discuss such an API that I've been kicking around for a little bit, and at least so far it doesn't seem to have any show-stopping problems, at least so far as I've discovered in my initial implementations.

The following proposal is built on top of #581. (For those seeking actual code, the current exploratory work in progress is based out of [this branch](https://github.com/TomFinley/machinelearning/tree/tfinley/StrongPipe), which in turn is a branch based of @Zruty0's [branch here](https://github.com/Zruty0/machinelearning/tree/feature/estimators/src/Microsoft.ML.Core/Data).)

# Simple Example

It may be that the easiest way to explain the proposal is to show a simple example, then explain it. This will be where we [train sentiment classification](https://github.com/dotnet/machinelearning/blob/89dfc82f5edcfe23015dc2c1291bc7a836188e80/test/Microsoft.ML.Tests/ScenariosWithDirectInstantiation/SentimentPredictionTests.cs#L24), though I've simplified the text settings to just the diacritics option

```csharp
// We load two columns, the boolean "label" and the textual "sentimentText".
var text = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true);

// We apply the text featurizer transform to "sentimentText" producing the column "features".
var transformation = text.CreateTransform(r =>
    (r.label, features: r.sentimentText.TextFeaturizer(keepDiacritics: true)));

// We apply a learner to learn "label" given "features", which will in turn produce
// float "score", float "probability", and boolean "predictedLabel".
var training = transformation.CreateTransform(r =>
    r.label.TrainLinearClassification(r.features))
```

An alternative is we might do a continuous, non-segmented form (where they are all merged into a single thing):

```csharp
var pipeline = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true)
    .ExtendWithTransform(r => (r.label, features: sentimentText.TextFeaturizer(keepDiacritics: true)))
    .ExtendWithTransform(r => r.label.TrainLinearClassification(r.features));
```

or even the following:

```csharp
var pipeline = TextLoader.Create(c =>
    c.LoadBool(0).TrainLinearClassification(c.LoadText(1).TextFeaturizer(keepDiacritics: true)));
```

## Developer Story

Here's how I imagine this playing out by someone maybe like me. So: first we have this `TextLoader.Create` method. (Feel free to suggest better names.)

* The developer knows they have some data file, a TSV with two fields, a label, and some sentiment text. They write `TextLoader.Create`. The first argument is delegate, with a text loader context input, and is responsible for producing a tuple out of things composed out of that context. (Both the method signature, and the XML doc commentary, can explain this.) When they write `c => c.`, intellisense hits them with what they can do... `c.LoadBool`, `c.LoadDouble`, `c.LoadFloat`, etc. These methods produce things like `Scalar<bool>`, `Vector<bool>`, `Scalar<double>`, etc., depending on which is called, what overload is used, and so on. The developer ultimately creates a value-tuple out of all this stuff, with the bool and text loading values.

* Then they have this object, here called `text`. So they type `text.`, and Intellisense pops up again. They know they want to do *something* with the data they've told the framework to load, and they see `CreateTransform` or `ExtendWithTransform`. Even if they haven't read a bit of documentation, that has enough of a name that they think, maybe, "heck, maybe this is where I belong." So they choose that.

* Again they are hit with a delegate. But this time, the input type is the value-tuple created in the loader. And they might type, `text.CreateTransform(r => r.label.TrainLinearClassification`. They try to feed in their `sentimentText`, but the compiler complains at them, saying, I want a `Vector<float>`, not a `Scalar<text>`. So maybe they now do `r.sentimentText.TextFeaturizer` as something sufficiently promising, since it has a promising sounding name and also returns the `Vector<float>` that the classifier claims to want. In VS it looks something like this (click the image to see zoomed in version, sorry that the screenshot is so wide):

![image](https://user-images.githubusercontent.com/8295757/43598024-96326f6e-9638-11e8-992d-f72e882efe7d.png)

Given that setup, there is I think only one thing here that cannot be plausibly discovered via intellisense, or the XML docs that pop up with intellisense, and that is the fact that you would want to start the pipeline with something like `TextLoader.Create`. But I figure this will be so ubiquitous even in "example 1" that we can get away with it. There's also the detail about training happening through a "label," and unless they happen to have the right type (`Scalar<bool>`) it simply won't show up for them. But someone reading documentation on the linear classifier would surely see that extension method and figure out what to do with it.

# More Details

Now we drill a little bit more into the shape and design of this API.

## `PipelineColumn` and its subclasses

As we saw in the example, many transformations are indicated by the type of data. For this we have the abstract class `PipelineColumn`, which are manifested to the user through the following abstract subclasses.

* `Scalar<>` represents a scalar-column of a particular type.
* `Vector<>` represents a vector-valued column of a particular type, where the vector size will be fixed and known (though we might not know the actual size at compiled time)
* `VarVector<>` is similar, but on known size.
* `Key<>`, indicating a key type, which are essentially enumerations into a set.
* `Key<,>`, indicating a key type with a known value, which are essentially enumerations into a set.
* `VarKey<>` which is a key type of unknown cardinality. (These are fairly rare.)

## `ValueTuple`s of `PipelineColumn`s

The pipeline  are the smallest granularity structures. Above that you have collections of these representing the values present at any given time, upon which you can apply more transformations. That value, as mentioned earlier, is a potentially nested value tuple. By potentially nested, what I mean is that you can have as many `ValueTuple`s as you want. So all of the following are fine, if we imagine that `a`, `b`, and `c` are each some sort of `PipelineColumn`:

```csharp
(a, b)
(a, x: (b, c))
a
```

In the first case the actual underlying data-view, when produced, would have two columns named `a` and `b`. In the second, there would be three columns, `a`, `x.b`, and `x.c`. In the last, since there is no way as near as I can tell to have a named `ValueTuple<>`, I just for now picked the name `Data`. (Note that in the case where value-tuples are present, the names of the items become the names of the output columns in the data-view schema.)

The reason for supporting nesting is, some estimators produce multiple columns (notably, in the example, the binary classification trainer produces three columns), and as far as I can tell there is no way to "unpack" a returned value-tuple into another value-tuple. Also it provides a convenient way to just bring along all the inputs, if we wanted to do so, by just assigning the input tuple itself as an item in the output tuple.

## The Pipeline Components

At a higher level of the columns, and the (nested) tuples of columns, you have the objects that represent the pipeline components that describe each step of what you are actually doing with these things. That is, those objects mappings into those value tuples, or between them. To return to the example with `text` and `transformation` and `training`, these have the following types, in the sense that all the following statements in code would be true:

```csharp
text is DataReaderEstimator<IMultiStreamSource,
    (Scalar<bool> label, Scalar<string> sentimentText)>;

transformation is Estimator<
    (Scalar<bool> label, Scalar<string> sentimentText),
    (Scalar<bool> label, Scalar<float> features)>;

training is Estimator<
    (Scalar<bool> label, Scalar<float> features),
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;
```

and also in those "omnibus" equivalents;

```csharp
pipeline is DataReaderEstimator<IMultiStreamSource,
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;
```

One may note that the statically-typed API is strongly parallel to the structures proposed in #581. That is, for every core structure following the `IEstimator` idiom laid out in #581, I envision a strongly typed variant of each type. In the current working code, in fact, the objects actually implement those interfaces, but I might go to having them actually wrap them.

Like the underlying dynamically typed objects, they can be combined in the usual way to form cohesive pipelines. So for example: one could take a `DataReaderEstimator<TIn, TA>` and an `Estimator<TA, TB>` to produce a `DataReaderEstimator<TIn, TB>`. (So for example, when I was using `ExtendWithTransform` instead of )

This duality is deliberate. While the *usage* of the static estimators will necessarily not resemble the dynamically typed estimators, based as it is on actual .NET types and identifiers, the structure that is being built up *is* an estimator based pipeline, and so will resemble it structurally. This duality enables one to use static-typing for as long as is convenient, then when done drop back down to the dynamically typed one. But you could also go in reverse, start with something dynamically typed -- perhaps a model loaded from a file -- essentially assert that this dynamically typed thing has a certain shape (which of course could only be checked at runtime), and then from then on continue with the statically-typed pipe. So as soon as the static typing stops being useful, there's no cliff -- you can just stop using it at that point, and continue dynamically.

However if you can stay in the statically typed world, that's fine. You can fit a strongly typed `Estimator` to to get a strongly typed `Transformer`. You can then further get a strongly typed `DataView` out of a strongly typed `Transformer`. In the end this is still just a veneer, kind of like the `PredictionEngine` stuff, but it's a veneer that has a strong likelihood of working.

## One or Two Implementation Details

The following is not something that most users will need to concern themselves with, and we won't go into too many details. However at least a loose idea of how the system works might help clear up some of the mystery.

The `Scalar<>`, `Vector<>`, etc. classes are abstract classes. The `PipelineColumn`s that are created from the helper extension methods have actual concrete implementations intended to be nested private classes in whatever estimator they're associated with. A user never sees those implementations. The component author is responsible for calling the `protected` constructor on those objects, so as to feed it the list of dependencies (what `PipelineColumn` it needs to exist before it would want to chain its own estimator), as well as a little factory object for now called a "reconciler" that the analyzer can call once it has satisfied those dependencies.

The analyzer itself takes the delegate. It constructs the input object, then pipes it thorugh the delegate. In the case of the estimator,  these are *not* the ones returned from any prior delegate (indeed we have no requirement that there *be* a prior delegate -- estimators can function as independent building blocks), but special instances made for that analysis task). The resulting output will be a value-tuple of `PipelineColumn`s, and by tracing back the dependencies, until we get the graph of dependencies.

The actual constructed inputs have no dependencies, and are assumed to just be there already. We then iteratively "resolve" dependencies -- we take all columns that have their dependencies resolved, and take some subset that all have the same "reconciler." That reconciler is responsible for returning the actual `IEstimator`. Then anything that depends on *that* column gets resolved. And so on.

In this way these delegates are declarative structures. Each extension method provides these `PipelineColumn` implementations, which as objects, but it is the analyzer that goes ahead and figures out in what sequence those factory methods will be called, with what names, etc.

It might be more clear if we saw that actual engine.

https://github.com/TomFinley/machinelearning/blob/8e0298f64f0a9f439bb83426b09e54967065793b/src/Microsoft.ML.Core/StrongPipe/BlockMaker.cs#L13

The system mostly has fake objects everywhere as standins right now just to validate the approach, so for example if I were to actually run the code in the first example, I get the following diagnostic output. (It should be relatively easy to trace back the diagnostic output.)

```
Called CreateTransform !!!
Using input with name label
Using input with name sentimentText
Constructing TextTransform estimator!
    Will make 'features' out of 'sentimentText'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!
```

If I had another example, like this:

```csharp
var text = TextLoader.Create(
    ctx => (
    label: ctx.LoadBool(0),
    text: ctx.LoadText(1),
    numericFeatures: ctx.LoadFloat(2, 9)
    ));

var transform = text.CreateTransform(r => (
    r.label,
    features: r.numericFeatures.ConcatWith(r.text.Tokenize().Dictionarize().BagVectorize())
    ));

var train = transform.CreateTransform(r => (
    r.label.TrainLinearClassification(r.features)
```

then the output looks a little something like this:

```
Called CreateTransform !!!
Using input with name label
Using input with name numericFeatures
Using input with name text
Constructing WordTokenize estimator!
    Will make '#Temp_0' out of 'text'
Constructing Term estimator!
    Will make '#Temp_1' out of '#Temp_0'
Constructing KeyToVector estimator!
    Will make '#Temp_2' out of '#Temp_1'
Constructing Concat estimator!
    Will make 'features' out of 'numericFeatures', '#Temp_2'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!
```

You can sort of trace though what the analyzer is doing as it resolves dependencies, constructs `IEstimator`s, etc. etc. (Obviously the real version won't have all those little console writelines everywhere.)

## Stuff Not Covered

There's a lot of stuff I haven't yet talked about. We create these blocks, how do we mix and match? What does the strongly typed `Transformer` or `DataView` look like? We talked about the text loader, what about sources that come from actual .NET objects? These we might cover in future editions on this, or in subsequent comments. But I think perhaps this writing has gone on long enough...

/cc @Zruty0 , @ericstj , @eerhardt , @terrajobst , @motus 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Direct API: Static Typing of Data Pipelines #632

Simple Example

Developer Story

More Details

`PipelineColumn` and its subclasses

`ValueTuple`s of `PipelineColumn`s

The Pipeline Components

One or Two Implementation Details

Stuff Not Covered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Direct API: Static Typing of Data Pipelines #632

Description

Simple Example

Developer Story

More Details

PipelineColumn and its subclasses

ValueTuples of PipelineColumns

The Pipeline Components

One or Two Implementation Details

Stuff Not Covered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`PipelineColumn` and its subclasses

`ValueTuple`s of `PipelineColumn`s