Skip to content

Add Ranking AutoML Sample #852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 1, 2021
Merged

Conversation

jwood803
Copy link
Contributor

Add Ranking sample for AutoML.

Update for this issue

@jwood803
Copy link
Contributor Author

@justinormont Feel free to let me know if anything is missing or should be changed with this sample.


ConsoleHelper.PrintRankingMetrics(bestRun.TrainerName, metrics);

// STEP 6: Save/persist the trained model to a .ZIP file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're up for it, you could demonstrate thrice training.

The general idea is to progressively refit the best found pipeline with merged data from all of the datasets in a three step process.

step fitting data scoring data notes
1 train valid Done within AutoML to find the best pipeline
2 train+valid test Gives final metrics which estimate the model's perf in production
3 train+valid+test N/A Gives final model to launch to production

I have an example created for another bug:
full example -- https://dotnetfiddle.net/nWpCkP

// Re-fit best pipeline on train and validation data, to produce 
// a model that is trained on as much data as is available while
// still having test data for the final estimate of how well the
// model will do in production.
Console.WriteLine("\n===== Refitting on train+valid and evaluating model's rsquared with test data =====");
var TrainPlusValidationDataView = textLoader.Load(new MultiFileSource(TrainDataPath, ValidationDataPath));
var refitModel1 = experimentResult.BestRun.Estimator.Fit(TrainPlusValidationDataView);
IDataView predictionsRefitOnTrainPlusValidation = refitModel1.Transform(TestDataView);
var metricsRefitOnTrainPlusValidation = mlContext.Regression.Evaluate(predictionsRefitOnTrainPlusValidation, labelColumnName: "Label", scoreColumnName: "Score");
Console.WriteLine("|" + $"{"-",-4} {experimentResult.BestRun.TrainerName,-35} {metricsRefitOnTrainPlusValidation?.RSquared ?? double.NaN,8:F4} {metricsRefitOnTrainPlusValidation?.MeanAbsoluteError ?? double.NaN,13:F2} {metricsRefitOnTrainPlusValidation?.MeanSquaredError ?? double.NaN,12:F2} {metricsRefitOnTrainPlusValidation?.RootMeanSquaredError ?? double.NaN,8:F2} {"-",9}".PadRight(112) + "|");

// Re-fit best pipeline on train, validation, and test data, to 
// produce a model that is trained on as much data as is available.
// This is the final model that can be deployed to production.
Console.WriteLine("\n===== Refitting on train+valid+test to get the final model to launch to production =====");
var TrainPlusValidationPlusTestDataView = textLoader.Load(new MultiFileSource(TrainDataPath, ValidationDataPath, TestDataPath));
var refitModel2 = experimentResult.BestRun.Estimator.Fit(TrainPlusValidationPlusTestDataView);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting concept! I'll definitely give this a go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know about this! I think a video on this would be good to make. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make a great video. This process is most beneficial when the training set is small, or the dataset is split by time.

Ranking datasets are often split by time with the older data in the training split, newer data in the validation, and newest in the test dataset split. There's mainly two gains in using time splits; (1) removing leakage -- see "time leakage" in https://en.wikipedia.org/wiki/Leakage_(machine_learning), and (2) the most valuable data is the most up to date data as user trends shift over time.

The newest data is the most representative of the live production traffic the model will see, hence it's a better measure of the model's performance when launched in production (aka, why we use it as the final metrics), and its use in refitting the model will create a model which performs better on the production traffic.

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks so much for adding a sample.

Two minor/optional items:

@jwood803
Copy link
Contributor Author

The issue to add the truncation to the AutoML API may be in soon, so we may be able to wait for that and update this accordingly.

@jwood803
Copy link
Contributor Author

jwood803 commented Dec 17, 2020

@justinormont The latest version got pushed with the DcgTruncation change so I updated to use it. Just let me know if I missed anything.

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Minor items: #852 (comment) & #852 (comment).

// The scores are used to determine the ranking where a higher score indicates a higher ranking versus another candidate result.
foreach (var prediction in firstGroupPredictions)
{
Console.WriteLine($"GroupId: {prediction.GroupId}, Score: {prediction.Score}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints:

=============== Testing prediction engine ===============
#########################################################
=============== Loaded Model OK  ===============
GroupId: 94629043, Score: 10.5748415
GroupId: 94629043, Score: 8.747048
GroupId: 94629043, Score: 8.339484
GroupId: 94629043, Score: 8.295864
GroupId: 94629043, Score: 7.7126718
GroupId: 94629043, Score: 7.094361
GroupId: 94629043, Score: 6.403935
GroupId: 94629043, Score: 5.6126056
GroupId: 94629043, Score: 5.4343157
GroupId: 94629043, Score: 5.341767
GroupId: 94629043, Score: 3.4955919
GroupId: 94629043, Score: 3.0068336
GroupId: 94629043, Score: 2.925596
...

Is there more information we can print? Currently the user has no way to compare the ranked results against the original input data.

Could we print the correct/input Label in a column? The user could then verify the input Label value is decreasing in the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the original label to the output. Would that help?
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do a late merge, without adding Label to the RankingPrediction class?

Do you think if we add the Label to the output class, some users will misconstrue it as the predicted label (Score is the actual prediction)? Seeing no mispredictions, users could assume their model is perfect, and launch it to production only to find the Label field is then empty.

Even without the added label output, this PR is very good. I'm ok merging as is, or with adding the correct/input label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only thought is to do something like this:

var predictionsPreview = _predictions.Preview();

for (int i = 0; i < firstGroupPredictions.Count; i++)
{
  var currentPredictionPreview = predictionsPreview.RowView[i].Values;

  Console.WriteLine($"GroupId: {firstGroupPredictions[i].GroupId}, Score: {firstGroupPredictions[i].Score}");
}

Use the predictions from the test set and get the label from the Preview of it.

I'm not sure if it's the best thing to do or not, though. If this isn't good, perhaps we can get this in and can do another update once we can find a better solution.

Copy link
Contributor

@justinormont justinormont Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this look:

// Label values from the test dataset (not the predicted scores/labels)
IEnumerator<float> labelEnumerator = mlContext.Data.CreateEnumerable<RankingData>(testDataView, true).Select(a => a.Label).GetEnumerator();

foreach (var prediction in firstGroupPredictions)
{
    labelEnumerator.MoveNext();
    Console.WriteLine($"GroupId: {prediction.GroupId}, Score: {prediction.Score}, Correct Label: {labelEnumerator.Current}");
}

It would require passing in testDataView. Technically, the same enumerator can be made from firstGroupPredictions, though reading from the test dataset reinforces that the printed label is not part of the prediction, but instead a given label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works great! Pushed the latest changes with this in it. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good to merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so! Thank you for helping me with this sample!

@justinormont
Copy link
Contributor

@jwood803: I checked-in various fixes.

Most fixes were addressing my own review feedback. I also modified the thrice training to use the correct scoring dataset, and to return the refit model.

Only remaining feedback is #852 (review).

@justinormont justinormont merged commit 1c804f5 into dotnet:master Jan 1, 2021
Elizabethhanson pushed a commit to Elizabethhanson/machinelearning-samples that referenced this pull request Sep 10, 2021
* Initial add of project

* Update ranking sample

* Get sample working

* Updates based on feedback

* Add refitting on validation and test data sets

* Update console headers

* Iteration print improvements

* Correct validationData

* Printing NDCG@1,3,10 & DCG@10

* Printing NDCG@1,3,10 & DCG@10

* Add readme

* Update based on feedback

* Use new DcgTruncation property

* Update to latest AutoML package

* Review feedback

* Wording for 1st refit step

* Update to include original label in output

Co-authored-by: Justin Ormont <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants