Skip to content

Direct API: Static Typing of Data Pipelines #632

Closed
@TomFinley

Description

@TomFinley

Currently in all iterations of the pipeline concept, whether they be based on the v0.1 idiom of LearningPipeline, or the #371 proposal where IDataView is directly created, or the refinement of that in #581, or the convenience constructors, or whatever, there is always this idea of a pipeline being a runtime-checked thing, where each stage has some output schema with typed columns indexed by a string name, and all of this is known only at runtime -- at compile time, all the compiler knows is you have some estimator, or some data view, or something like that, but has no idea what is in it.

This makes sense from a practical perspective, since there are many applications where you cannot know the schema until runtime. E.g.: loading a model from a file, or loading a Parquet file, you aren't going to know anything until the code actually runs. So we want the underlying system to remain dynamically typed to serve those scenarios, and I do not propose changing that. That said, there are some definite usability costs:

  • Typos on those column names are found at runtime, which is unfortunate.
  • Application of the wrong transform or learner is found at runtime.
  • Discoverability is an issue. You just sort of have to know what transforms are applicable to your case, which is somewhat difficult since if we were to collect all the things you could apply to data (transform, train a learner), there are probably about 100 of these or thereabouts. Intellisense will be of no help to you here, because at compile time, the only thing the language knows is you have some.

It's sort of like working with Dictionary<string, object> as your central data structure, and an API that just takes Dictionary<string, object> everywhere. In a way that's arbitrarily powerful, but the language itself can give you no help at all about what you should do with it, which is kind of a pity since we have this nice statically typed language we're working in.

So: a statically typed helper API on top of this that was sufficiently powerful would help increase the confidence that if someone compiles it might run, and also give you some help in the form of proper intellisense of what you can do, while you are typing before you've run anything. Properly structured, if you had strong typing at the columnar level, nearly everything you can do can be automatically discoverable through intellisense. The documentation would correspondingly become a lot more focused.

The desire to have something like this is very old, but all prior attempts I recall ran into some serious problems sooner or later. In this issue I discuss such an API that I've been kicking around for a little bit, and at least so far it doesn't seem to have any show-stopping problems, at least so far as I've discovered in my initial implementations.

The following proposal is built on top of #581. (For those seeking actual code, the current exploratory work in progress is based out of this branch, which in turn is a branch based of @Zruty0's branch here.)

Simple Example

It may be that the easiest way to explain the proposal is to show a simple example, then explain it. This will be where we train sentiment classification, though I've simplified the text settings to just the diacritics option

// We load two columns, the boolean "label" and the textual "sentimentText".
var text = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true);

// We apply the text featurizer transform to "sentimentText" producing the column "features".
var transformation = text.CreateTransform(r =>
    (r.label, features: r.sentimentText.TextFeaturizer(keepDiacritics: true)));

// We apply a learner to learn "label" given "features", which will in turn produce
// float "score", float "probability", and boolean "predictedLabel".
var training = transformation.CreateTransform(r =>
    r.label.TrainLinearClassification(r.features))

An alternative is we might do a continuous, non-segmented form (where they are all merged into a single thing):

var pipeline = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true)
    .ExtendWithTransform(r => (r.label, features: sentimentText.TextFeaturizer(keepDiacritics: true)))
    .ExtendWithTransform(r => r.label.TrainLinearClassification(r.features));

or even the following:

var pipeline = TextLoader.Create(c =>
    c.LoadBool(0).TrainLinearClassification(c.LoadText(1).TextFeaturizer(keepDiacritics: true)));

Developer Story

Here's how I imagine this playing out by someone maybe like me. So: first we have this TextLoader.Create method. (Feel free to suggest better names.)

  • The developer knows they have some data file, a TSV with two fields, a label, and some sentiment text. They write TextLoader.Create. The first argument is delegate, with a text loader context input, and is responsible for producing a tuple out of things composed out of that context. (Both the method signature, and the XML doc commentary, can explain this.) When they write c => c., intellisense hits them with what they can do... c.LoadBool, c.LoadDouble, c.LoadFloat, etc. These methods produce things like Scalar<bool>, Vector<bool>, Scalar<double>, etc., depending on which is called, what overload is used, and so on. The developer ultimately creates a value-tuple out of all this stuff, with the bool and text loading values.

  • Then they have this object, here called text. So they type text., and Intellisense pops up again. They know they want to do something with the data they've told the framework to load, and they see CreateTransform or ExtendWithTransform. Even if they haven't read a bit of documentation, that has enough of a name that they think, maybe, "heck, maybe this is where I belong." So they choose that.

  • Again they are hit with a delegate. But this time, the input type is the value-tuple created in the loader. And they might type, text.CreateTransform(r => r.label.TrainLinearClassification. They try to feed in their sentimentText, but the compiler complains at them, saying, I want a Vector<float>, not a Scalar<text>. So maybe they now do r.sentimentText.TextFeaturizer as something sufficiently promising, since it has a promising sounding name and also returns the Vector<float> that the classifier claims to want. In VS it looks something like this (click the image to see zoomed in version, sorry that the screenshot is so wide):

image

Given that setup, there is I think only one thing here that cannot be plausibly discovered via intellisense, or the XML docs that pop up with intellisense, and that is the fact that you would want to start the pipeline with something like TextLoader.Create. But I figure this will be so ubiquitous even in "example 1" that we can get away with it. There's also the detail about training happening through a "label," and unless they happen to have the right type (Scalar<bool>) it simply won't show up for them. But someone reading documentation on the linear classifier would surely see that extension method and figure out what to do with it.

More Details

Now we drill a little bit more into the shape and design of this API.

PipelineColumn and its subclasses

As we saw in the example, many transformations are indicated by the type of data. For this we have the abstract class PipelineColumn, which are manifested to the user through the following abstract subclasses.

  • Scalar<> represents a scalar-column of a particular type.
  • Vector<> represents a vector-valued column of a particular type, where the vector size will be fixed and known (though we might not know the actual size at compiled time)
  • VarVector<> is similar, but on known size.
  • Key<>, indicating a key type, which are essentially enumerations into a set.
  • Key<,>, indicating a key type with a known value, which are essentially enumerations into a set.
  • VarKey<> which is a key type of unknown cardinality. (These are fairly rare.)

ValueTuples of PipelineColumns

The pipeline are the smallest granularity structures. Above that you have collections of these representing the values present at any given time, upon which you can apply more transformations. That value, as mentioned earlier, is a potentially nested value tuple. By potentially nested, what I mean is that you can have as many ValueTuples as you want. So all of the following are fine, if we imagine that a, b, and c are each some sort of PipelineColumn:

(a, b)
(a, x: (b, c))
a

In the first case the actual underlying data-view, when produced, would have two columns named a and b. In the second, there would be three columns, a, x.b, and x.c. In the last, since there is no way as near as I can tell to have a named ValueTuple<>, I just for now picked the name Data. (Note that in the case where value-tuples are present, the names of the items become the names of the output columns in the data-view schema.)

The reason for supporting nesting is, some estimators produce multiple columns (notably, in the example, the binary classification trainer produces three columns), and as far as I can tell there is no way to "unpack" a returned value-tuple into another value-tuple. Also it provides a convenient way to just bring along all the inputs, if we wanted to do so, by just assigning the input tuple itself as an item in the output tuple.

The Pipeline Components

At a higher level of the columns, and the (nested) tuples of columns, you have the objects that represent the pipeline components that describe each step of what you are actually doing with these things. That is, those objects mappings into those value tuples, or between them. To return to the example with text and transformation and training, these have the following types, in the sense that all the following statements in code would be true:

text is DataReaderEstimator<IMultiStreamSource,
    (Scalar<bool> label, Scalar<string> sentimentText)>;

transformation is Estimator<
    (Scalar<bool> label, Scalar<string> sentimentText),
    (Scalar<bool> label, Scalar<float> features)>;

training is Estimator<
    (Scalar<bool> label, Scalar<float> features),
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;

and also in those "omnibus" equivalents;

pipeline is DataReaderEstimator<IMultiStreamSource,
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;

One may note that the statically-typed API is strongly parallel to the structures proposed in #581. That is, for every core structure following the IEstimator idiom laid out in #581, I envision a strongly typed variant of each type. In the current working code, in fact, the objects actually implement those interfaces, but I might go to having them actually wrap them.

Like the underlying dynamically typed objects, they can be combined in the usual way to form cohesive pipelines. So for example: one could take a DataReaderEstimator<TIn, TA> and an Estimator<TA, TB> to produce a DataReaderEstimator<TIn, TB>. (So for example, when I was using ExtendWithTransform instead of )

This duality is deliberate. While the usage of the static estimators will necessarily not resemble the dynamically typed estimators, based as it is on actual .NET types and identifiers, the structure that is being built up is an estimator based pipeline, and so will resemble it structurally. This duality enables one to use static-typing for as long as is convenient, then when done drop back down to the dynamically typed one. But you could also go in reverse, start with something dynamically typed -- perhaps a model loaded from a file -- essentially assert that this dynamically typed thing has a certain shape (which of course could only be checked at runtime), and then from then on continue with the statically-typed pipe. So as soon as the static typing stops being useful, there's no cliff -- you can just stop using it at that point, and continue dynamically.

However if you can stay in the statically typed world, that's fine. You can fit a strongly typed Estimator to to get a strongly typed Transformer. You can then further get a strongly typed DataView out of a strongly typed Transformer. In the end this is still just a veneer, kind of like the PredictionEngine stuff, but it's a veneer that has a strong likelihood of working.

One or Two Implementation Details

The following is not something that most users will need to concern themselves with, and we won't go into too many details. However at least a loose idea of how the system works might help clear up some of the mystery.

The Scalar<>, Vector<>, etc. classes are abstract classes. The PipelineColumns that are created from the helper extension methods have actual concrete implementations intended to be nested private classes in whatever estimator they're associated with. A user never sees those implementations. The component author is responsible for calling the protected constructor on those objects, so as to feed it the list of dependencies (what PipelineColumn it needs to exist before it would want to chain its own estimator), as well as a little factory object for now called a "reconciler" that the analyzer can call once it has satisfied those dependencies.

The analyzer itself takes the delegate. It constructs the input object, then pipes it thorugh the delegate. In the case of the estimator, these are not the ones returned from any prior delegate (indeed we have no requirement that there be a prior delegate -- estimators can function as independent building blocks), but special instances made for that analysis task). The resulting output will be a value-tuple of PipelineColumns, and by tracing back the dependencies, until we get the graph of dependencies.

The actual constructed inputs have no dependencies, and are assumed to just be there already. We then iteratively "resolve" dependencies -- we take all columns that have their dependencies resolved, and take some subset that all have the same "reconciler." That reconciler is responsible for returning the actual IEstimator. Then anything that depends on that column gets resolved. And so on.

In this way these delegates are declarative structures. Each extension method provides these PipelineColumn implementations, which as objects, but it is the analyzer that goes ahead and figures out in what sequence those factory methods will be called, with what names, etc.

It might be more clear if we saw that actual engine.

https://github.com/TomFinley/machinelearning/blob/8e0298f64f0a9f439bb83426b09e54967065793b/src/Microsoft.ML.Core/StrongPipe/BlockMaker.cs#L13

The system mostly has fake objects everywhere as standins right now just to validate the approach, so for example if I were to actually run the code in the first example, I get the following diagnostic output. (It should be relatively easy to trace back the diagnostic output.)

Called CreateTransform !!!
Using input with name label
Using input with name sentimentText
Constructing TextTransform estimator!
    Will make 'features' out of 'sentimentText'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!

If I had another example, like this:

var text = TextLoader.Create(
    ctx => (
    label: ctx.LoadBool(0),
    text: ctx.LoadText(1),
    numericFeatures: ctx.LoadFloat(2, 9)
    ));

var transform = text.CreateTransform(r => (
    r.label,
    features: r.numericFeatures.ConcatWith(r.text.Tokenize().Dictionarize().BagVectorize())
    ));

var train = transform.CreateTransform(r => (
    r.label.TrainLinearClassification(r.features)

then the output looks a little something like this:

Called CreateTransform !!!
Using input with name label
Using input with name numericFeatures
Using input with name text
Constructing WordTokenize estimator!
    Will make '#Temp_0' out of 'text'
Constructing Term estimator!
    Will make '#Temp_1' out of '#Temp_0'
Constructing KeyToVector estimator!
    Will make '#Temp_2' out of '#Temp_1'
Constructing Concat estimator!
    Will make 'features' out of 'numericFeatures', '#Temp_2'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!

You can sort of trace though what the analyzer is doing as it resolves dependencies, constructs IEstimators, etc. etc. (Obviously the real version won't have all those little console writelines everywhere.)

Stuff Not Covered

There's a lot of stuff I haven't yet talked about. We create these blocks, how do we mix and match? What does the strongly typed Transformer or DataView look like? We talked about the text loader, what about sources that come from actual .NET objects? These we might cover in future editions on this, or in subsequent comments. But I think perhaps this writing has gone on long enough...

/cc @Zruty0 , @ericstj , @eerhardt , @terrajobst , @motus

Metadata

Metadata

Assignees

Labels

APIIssues pertaining the friendly APIenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions