API: Binary Classification Training Context #949

TomFinley · 2018-09-19T19:54:48Z

There seems to be something appealing about a convenience object whose purpose is to help "guide" people on the path to a successful experiment. So for example, someone might have a pipeline where they featurize, then learn, then evaluate on a test set. Each of these is of course naturally implemented in separate classes, which is good. But it also means that the ingredients necessary to compose a successful experiment are naturally spread hither and yon.

You might imagine that in addition to the components, there might be some sort of "task context" object, like for example, a BinaryClassifierContext. This might have common facilities: for example, a common way to "browse" binary classifier trainers, and to evaluate binary classification outputs.

There is something appealing about doing this:

var data = ...
var ctx = new BinaryClassificationContext();
var prediction = ctx.Trainers.FastTree(data, ...);
var metrics = ctx.Evaluate(prediction, ...);

vs. this

var data = ...
var prediction = new FastTreeBinaryClassifierEstimator(data, ...);
var eval = new BinaryClasifierEvaluator(...);
var metrics = eval.Evaluate(prediction, ...)

The latter case is certainly no less powerful, but if I imagine someone tooling around in intellisense, the sheer number of things you'll get by including the key namespaces and saying new is absolutely dizzying, vs. this context which can be very, very focused.

In the case of static pipelines the story is a little bit better, "we provide extension methods on Scalar<bool>", which is OK if you know that, but if you don't happen to know that, I see no reasonable way you could discover that without reading documentation and samples. (Of course for that matter I see ). But requiring knowledge at the level of, "if you want to do something related to binary classifiers, please say new BinaryClassifierContext" or something, that seems kind of reasonable to me.

This hypothetical Context object would contain at least two things: the first is a property. (It must be an actual instance because the only way external assemblies could "add" their learners to it would be via extension methods.) The second is one or more Evaluate methods to produce metrics.

These "objects" do have state in the sense that they must have an IHostEnvironment, but aside from this are more or less like "namespaces," with the important difference possibly that you can't have a top level function as a namespace. (Though perhaps we don't care about doing functions.) There was some thought that if we also defined pipelines through them we could avoid having environments in the dynamic pipelines altogether (as we already do for static pipelines), but how this would be accomplished is not clear to me.

Also because the only reasonable way things can add themselves is via an extension method, this Trainers object would have to be an actual instance... now then, it needn't actually be instantiable -- one can call extension methods on the null of an object as well as anything so long as we don't want to get any information out of it -- but that is a little awkward. If we could just put extension methods on, say, a static class or something that would be nice, but we can't.

Work Item

The first thing I will do is create a binary classification training context object, as an exploration of the idea. If we like the idea, we can extend it to the other tasks as well.

The text was updated successfully, but these errors were encountered:

TomFinley · 2018-09-20T18:53:33Z

Having the thing be an actual instance is pretty appealing as it turns out, since then you can "smuggle" to the component instantiating object an IHostEnvrionment.

It may be that contexts can also have on them things like estimators, e.g., a Transforms object. This might improve discoverability of estimators.

Zruty0 · 2018-09-20T19:09:08Z

The idea that a 'task specific context' serves as a 'catalog of things that are reasonable to do if you solve this type of problem' appears powerful at a first glance.
For instance, dynamic typing could have

pipeline.Add(context.Transforms.ImageOperations.GlobalContrastNormalizer("ImagePixels")) // env smuggled via context

and static typing could have

pipeline.Append(row => (Image: context.Transforms.ImageOperations.GlobalContrastNormalizer(row.ImagePixels), ...))

and the same context can serve as a start of any (currently static) operation chain:

var reader = context.DataReaders.TextLoader(ctx=>(label: ctx.LoadBool(0), features: ctx.LoadFloat(1, 10)));
// rather than static TextLoader.CreateReader

var estimator = context.LearningPipeline.StartWith(reader)
    .Append(row=> (features: row.features.Normalize(), row.label)
// rather than extension reader.MakeNewPipeline


var trainedModel = estimator.Fit(reader.Read("data.tsv"));
trainedModel.SaveTo("model.zip");
var scoringModel = context.LoadModel("model.zip");
// rather than static TransformerChain.LoadFrom

I agree that this is very promising, and let's make some baby steps in this direction to validate.

TomFinley · 2018-09-20T20:24:34Z

The one thing that is kind of annoying is that we have three ways of doing a single thing, for practically everything. We had already for the sake of statically typed pipelines "two" ways of doing things, and now we have "three" ways, one for the static, and two for the dynamic.

So let's take evaluation as one example. We have the following ways:

the BinaryClassifierEvaluator.Evaluate direct method,
the BinaryClassificationContext.Evaluate helper method on the context that calls 1.
the BinaryClassificationContext.Evaluate extension method on the context for static pipelines that also calls 1.

SDCA is another example:

The actual constructor new LinearClassificationEtc.
The extension method Sdca on BinaryClassificationContext.Trainers for dynamic structures that calls 1. (I actually did not write this yet.)
The extension method Sdca on BinaryClassificationContext.Trainers for static structures that calls 1.

This improves discoverability of components a lot, but it will entail some degree of duplication, especially in documentation, which is really actually the most annoying part of this so far.

TomFinley mentioned this issue Sep 20, 2018

Binary train context with evaluation and SDCA #967

Merged

TomFinley closed this as completed in #967 Sep 21, 2018

Zruty0 mentioned this issue Sep 22, 2018

Extended contexts to regression and multiclass, added FFM pigstension #993

Merged

TomFinley mentioned this issue Nov 30, 2018

Rename types inside MLContext as Catalogs #1796

Closed

ghost locked as resolved and limited conversation to collaborators Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Binary Classification Training Context #949

API: Binary Classification Training Context #949

TomFinley commented Sep 19, 2018 •

edited

Loading

TomFinley commented Sep 20, 2018

Zruty0 commented Sep 20, 2018

TomFinley commented Sep 20, 2018 •

edited

Loading

API: Binary Classification Training Context #949

API: Binary Classification Training Context #949

Comments

TomFinley commented Sep 19, 2018 • edited Loading

Work Item

TomFinley commented Sep 20, 2018

Zruty0 commented Sep 20, 2018

TomFinley commented Sep 20, 2018 • edited Loading

TomFinley commented Sep 19, 2018 •

edited

Loading

TomFinley commented Sep 20, 2018 •

edited

Loading