Skip to content

Simple API to go from a trainer to something that can make predictions #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eerhardt opened this issue Jul 19, 2018 · 5 comments
Closed
Labels
API Issues pertaining the friendly API

Comments

@eerhardt
Copy link
Member

With the API proposal change in #371, the current proposed API looks something like:

... // load data and make transforms

// Train.
var trainer = new SdcaRegressionTrainer(env, new SdcaRegressionTrainer.Arguments());
var cached = new CacheDataView(env, trans, prefetch: null);
var trainRoles = TrainUtils.CreateExamples(cached, label: "Label", feature: "Features");
var pred = trainer.Train(trainRoles);

// Score.
IDataView scoredData = ScoreUtils.GetScorer(pred, trainRoles, env, trainRoles.Schema);

// Do a simple prediction.
var engine = env.CreatePredictionEngine<HousePriceData, HousePricePrediction>(scoredData);

HousePricePrediction prediction = engine.Predict(new HousePriceData()
....

Compare and contrast the similar code what what we have in the LearningPipeline API:

... // load data and make transforms

pipeline.Add(new StochasticDualCoordinateAscentRegressor());

PredictionModel<HousePriceData, HousePricePrediction> model = pipeline.Train<HousePriceData, HousePricePrediction>();

HousePricePrediction prediction = model.Predict(new HousePriceData()
....

You can see the proposed API has what feels like boilerplate code (create a cache data view, create examples, call train, get a scorer, create an engine). Where the LearningPipeline API simplifies this into roughly one call: call train, get something that can make predictions.

I don't think our simplest API example should have so many concepts in it. In my mind, the main concepts a new user needs to know about are:

  • Load data
  • Do transforms
  • Pick a learning algorithm
  • Train
  • Predict

However, in the current proposed API, they also need to think/learn about:

  • Whether or not they need a cached data view
  • Creating roles/examples
    • I'm not sure which is it. The type is RoleMappedData, but the method is named CreateExamples.
  • An IPredictor object
    • which doesn't make predictions
  • Calling GetScorer, which returns an IDataView that we call scoredData.
    • Is this object really data, or is it something that does scoring as implied by the method name: GetScorer?

In my opinion, this API is too complex and non-intuitive for first time users. We should investigate ways to make it simpler and see if we can come up with a design with less concepts to learn when first interacting with ML.NET.

/cc @ericstj @TomFinley @Zruty0 @terrajobst

@Zruty0
Copy link
Contributor

Zruty0 commented Jul 19, 2018

Let's try to break down what's good about the LearingPipeline example:

  • No need for a cryptic RoleMappedData.CreateExamples
    • Then again, I suppose you don't get to choose what features to train on, the column must be called Features, and the label column must be Label, which is a limitation.
  • No need to cache the data manually, the pipeline will itself be smart enough to cache the data.
  • The output of Train is directly capable of predicting

In this example we see the caching, but there are other similar 'smarts' that happen behind the scene: auto-normalization and auto-calibration. In the absence of a 'smart pipeline' component, the users must know to do this themselves.

Also, you are right about scoredData: this is a really interesting object. In fact, this is both the scorer and the scored training data.

  • If you call scoredData.AsEnumerable<HousePricePrediction>, you will actually get back the entire training dataset scored by the model.
  • if you feed scoredData to the PredictionEngine, you turn it into the conventional 'predictor' blackbox: it takes examples in, and outputs predictions.

The LearningPipeline is a poor abstraction because it really hinders even some very 'basic' scenarios, like cross-validation and stacking.

But maybe we can get rid of the LearningPipeline and still retain the simplicity and the 'smarts'?

For example, make predictors more 'rigid', less 'flexible', by essentially auto-generating PredictionEngine out of them.

There is still a question of what we do with the 'smarts'. Maybe some form of TrainingContext object could help us, in a form of:

var predictor = trainer.Train(new TrainingContext {
  TrainingData = trans,
  FeatureColumn = "Features",
  LabelColumn = "Label",
  CachingPolicy = CachingPolicy.IfNeeded,
  NormalizeData = NormalizePolicy.Always,
  CalibratePredictor = CalibratePolicy.Never
});

HousePricePrediction prediction = predictor.Predict(new HousePriceData());

Of course, we could keep a simpler extension method Train(IDataView trainingData), which would create a proper default TrainingContext, populate its TrainingData and call the real Train.

Anyway, to recap, I see two separate issues:

  • We need to get rid of extra concepts that pollute the clarity of a main-use case: RoleMappedData, Scorer, and PredictionEngine seem to be these ones.
  • We need a way to invoke the same 'smarts' as LearningPipeline did.

@Zruty0
Copy link
Contributor

Zruty0 commented Jul 20, 2018

We talked some more about it, we have agreed on many things and still have disagreement on some.
What seems to be universally accepted is:

The Train method will produce an object capable of making predictions.

A simple way to do this is to compose together the IPredictor (produced by the old Train), the Scorer (created by GetScorer) and the prediction engine, and make the composition known as the 'prediction model'.

It will be capable of predictions, but it will also allow the user to inspect the individual pieces and, as necessary, manufacture a new 'prediction model' with some tweaks.

The still unresolved questions are:

  • Should we somehow fold the RoleMappedData into Train? I think we may get away with it, if we try to make trainers into 'estimators' that check their input schema before seeing the data.
  • Can we get away with merging the IPredictor and IScorer into one object, opaque to the API? @TomFinley believes this is a mistake, but we agreed to try and sketch a prototype of how this would possibly work out, and then look at how terrible will the consequences be.

@Zruty0
Copy link
Contributor

Zruty0 commented Jul 20, 2018

What obviously can NOT happen is, we cannot make PredictionEngine and Predictor the same object: the PredictionEngine is not thread-safe, whereas predictor and scorer are immutable and therefore thread-safe.

@shauheen shauheen added the API Issues pertaining the friendly API label Jul 23, 2018
@Zruty0
Copy link
Contributor

Zruty0 commented Aug 25, 2018

After the changes to the API, the example now looks akin to:

var trainer = new LinearClassificationTrainer(env, new LinearClassificationTrainer.Arguments { }, "Features", "Label");
var model = trainer.Fit(trainData);
var predictor = model.MakePredictionFunction<ExampleClass, PredictionClass>();
PredictionClass prediction = predictor.Predict(new ExampleClass(...));

This eliminates the intermediate concepts of CacheDataView, RoleMappedData, scorer and predictor.

We still have the distinction between model (a trained thing that can be saved/loaded) and prediction function (a model-derived thread-unsafe object that delivers predictions).

I think at this point we should be closing this issue. @eerhardt , what are your thoughts?

@eerhardt
Copy link
Member Author

(Somehow this slipped through my radar.)

Yes, I believe the current API sufficiently solves this issue. Closing.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

3 participants