Skip to content

API: Binary Classification Training Context #949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Sep 19, 2018 · 3 comments
Closed

API: Binary Classification Training Context #949

TomFinley opened this issue Sep 19, 2018 · 3 comments

Comments

@TomFinley
Copy link
Contributor

TomFinley commented Sep 19, 2018

There seems to be something appealing about a convenience object whose purpose is to help "guide" people on the path to a successful experiment. So for example, someone might have a pipeline where they featurize, then learn, then evaluate on a test set. Each of these is of course naturally implemented in separate classes, which is good. But it also means that the ingredients necessary to compose a successful experiment are naturally spread hither and yon.

You might imagine that in addition to the components, there might be some sort of "task context" object, like for example, a BinaryClassifierContext. This might have common facilities: for example, a common way to "browse" binary classifier trainers, and to evaluate binary classification outputs.

There is something appealing about doing this:

var data = ...
var ctx = new BinaryClassificationContext();
var prediction = ctx.Trainers.FastTree(data, ...);
var metrics = ctx.Evaluate(prediction, ...);

vs. this

var data = ...
var prediction = new FastTreeBinaryClassifierEstimator(data, ...);
var eval = new BinaryClasifierEvaluator(...);
var metrics = eval.Evaluate(prediction, ...)

The latter case is certainly no less powerful, but if I imagine someone tooling around in intellisense, the sheer number of things you'll get by including the key namespaces and saying new is absolutely dizzying, vs. this context which can be very, very focused.

In the case of static pipelines the story is a little bit better, "we provide extension methods on Scalar<bool>", which is OK if you know that, but if you don't happen to know that, I see no reasonable way you could discover that without reading documentation and samples. (Of course for that matter I see ). But requiring knowledge at the level of, "if you want to do something related to binary classifiers, please say new BinaryClassifierContext" or something, that seems kind of reasonable to me.

This hypothetical Context object would contain at least two things: the first is a property. (It must be an actual instance because the only way external assemblies could "add" their learners to it would be via extension methods.) The second is one or more Evaluate methods to produce metrics.

These "objects" do have state in the sense that they must have an IHostEnvironment, but aside from this are more or less like "namespaces," with the important difference possibly that you can't have a top level function as a namespace. (Though perhaps we don't care about doing functions.) There was some thought that if we also defined pipelines through them we could avoid having environments in the dynamic pipelines altogether (as we already do for static pipelines), but how this would be accomplished is not clear to me.

Also because the only reasonable way things can add themselves is via an extension method, this Trainers object would have to be an actual instance... now then, it needn't actually be instantiable -- one can call extension methods on the null of an object as well as anything so long as we don't want to get any information out of it -- but that is a little awkward. If we could just put extension methods on, say, a static class or something that would be nice, but we can't.

Work Item

The first thing I will do is create a binary classification training context object, as an exploration of the idea. If we like the idea, we can extend it to the other tasks as well.

@TomFinley
Copy link
Contributor Author

Having the thing be an actual instance is pretty appealing as it turns out, since then you can "smuggle" to the component instantiating object an IHostEnvrionment.

It may be that contexts can also have on them things like estimators, e.g., a Transforms object. This might improve discoverability of estimators.

@Zruty0
Copy link
Contributor

Zruty0 commented Sep 20, 2018

The idea that a 'task specific context' serves as a 'catalog of things that are reasonable to do if you solve this type of problem' appears powerful at a first glance.
For instance, dynamic typing could have

pipeline.Add(context.Transforms.ImageOperations.GlobalContrastNormalizer("ImagePixels")) // env smuggled via context

and static typing could have

pipeline.Append(row => (Image: context.Transforms.ImageOperations.GlobalContrastNormalizer(row.ImagePixels), ...))

and the same context can serve as a start of any (currently static) operation chain:

var reader = context.DataReaders.TextLoader(ctx=>(label: ctx.LoadBool(0), features: ctx.LoadFloat(1, 10)));
// rather than static TextLoader.CreateReader

var estimator = context.LearningPipeline.StartWith(reader)
    .Append(row=> (features: row.features.Normalize(), row.label)
// rather than extension reader.MakeNewPipeline


var trainedModel = estimator.Fit(reader.Read("data.tsv"));
trainedModel.SaveTo("model.zip");
var scoringModel = context.LoadModel("model.zip");
// rather than static TransformerChain.LoadFrom

I agree that this is very promising, and let's make some baby steps in this direction to validate.

@TomFinley
Copy link
Contributor Author

TomFinley commented Sep 20, 2018

The one thing that is kind of annoying is that we have three ways of doing a single thing, for practically everything. We had already for the sake of statically typed pipelines "two" ways of doing things, and now we have "three" ways, one for the static, and two for the dynamic.

So let's take evaluation as one example. We have the following ways:

  1. the BinaryClassifierEvaluator.Evaluate direct method,
  2. the BinaryClassificationContext.Evaluate helper method on the context that calls 1.
  3. the BinaryClassificationContext.Evaluate extension method on the context for static pipelines that also calls 1.

SDCA is another example:

  1. The actual constructor new LinearClassificationEtc.
  2. The extension method Sdca on BinaryClassificationContext.Trainers for dynamic structures that calls 1. (I actually did not write this yet.)
  3. The extension method Sdca on BinaryClassificationContext.Trainers for static structures that calls 1.

This improves discoverability of components a lot, but it will entail some degree of duplication, especially in documentation, which is really actually the most annoying part of this so far.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants