Skip to content

One type label policy in trainers #2804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Ivanidzo4ka
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka commented Feb 28, 2019

This PR fixes #2628 and fixes #2750. fixes #2810

@codecov
Copy link

codecov bot commented Mar 2, 2019

Codecov Report

Merging #2804 into master will increase coverage by 0.02%.
The diff coverage is 82.27%.

@@            Coverage Diff             @@
##           master    #2804      +/-   ##
==========================================
+ Coverage   71.81%   71.83%   +0.02%     
==========================================
  Files         812      812              
  Lines      142658   142708      +50     
  Branches    16095    16092       -3     
==========================================
+ Hits       102445   102512      +67     
+ Misses      35830    35818      -12     
+ Partials     4383     4378       -5
Flag Coverage Δ
#Debug 71.83% <82.27%> (+0.02%) ⬆️
#production 67.97% <65.3%> (+0.01%) ⬆️
#test 86.25% <89.9%> (ø) ⬆️
Impacted Files Coverage Δ
...oft.ML.StandardTrainers/Standard/SdcaMultiClass.cs 93.51% <ø> (+1.26%) ⬆️
src/Microsoft.ML.FastTree/FastTreeRanking.cs 48.19% <ø> (ø) ⬆️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 72.68% <ø> (+0.13%) ⬆️
...dardTrainers/Standard/Online/AveragedPerceptron.cs 89.7% <ø> (+3.99%) ⬆️
...st/Microsoft.ML.Benchmarks.Tests/BenchmarksTest.cs 0% <ø> (ø) ⬆️
...s/StochasticDualCoordinateAscentClassifierBench.cs 0% <0%> (ø) ⬆️
src/Microsoft.ML.Core/Data/AnnotationUtils.cs 81.77% <0%> (-0.53%) ⬇️
...t/Microsoft.ML.Benchmarks/PredictionEngineBench.cs 0% <0%> (ø) ⬆️
...soft.ML.Data/Scorers/MultiClassClassifierScorer.cs 59.64% <0%> (ø) ⬆️
test/Microsoft.ML.TestFramework/Datasets.cs 100% <100%> (ø) ⬆️
... and 21 more

@Ivanidzo4ka Ivanidzo4ka changed the title [WIP] One type label policy in trainers One type label policy in trainers Mar 4, 2019
@rogancarr
Copy link
Contributor

namespace Microsoft.ML.Trainers.FastTree

FastTree and GAM binary classification convert from R4 to boolean with a custom function. Do we want to drop that functionality?


Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False)

@@ -20,21 +20,23 @@ public void SdcaWorkout()
var data = TextLoaderStatic.CreateLoader(Env, ctx => (Label: ctx.LoadFloat(0), Features: ctx.LoadFloat(1, 10)))
.Load(dataPath).Cache();

var binaryData = ML.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean).Fit(data.AsDynamic).Transform(data.AsDynamic);
Copy link
Contributor

@rogancarr rogancarr Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var binaryData [](start = 11, length = 15)

Maybe break lines? This is a bit of a mouthful. #Resolved

// Data
var data = mlContext.Data.Cache(reader.Load(GetDataPath(dataPath)));
var textData = reader.Load(GetDataPath(dataPath));
var data = mlContext.Data.Cache(mlContext.Transforms.Conversion.MapValueToKey("Label").Fit(textData).Transform(textData));
Copy link
Contributor

@rogancarr rogancarr Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var data [](start = 12, length = 9)

Maybe for lines like this in this file, break the steps out a bit for easier reading? #Resolved

var pipeline = mlContext.MulticlassClassification.Trainers.OneVersusAll(ap, useProbabilities: false);

var model = pipeline.Fit(data);
var predictions = model.Transform(data);

// Metrics
var metrics = mlContext.MulticlassClassification.Evaluate(predictions);
Assert.True(metrics.MicroAccuracy > 0.71);
Assert.True(metrics.MicroAccuracy > 0.66);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.True(metrics.MicroAccuracy > 0.66) [](start = 12, length = 41)

Did changing the concurrency give lower MicroAccuracy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.
I still need to debug it properly.
We have ova with 3 labels which gives me 0 vs 1,2; 1 vs 0,2 and 2 vs 0,1.
For 1 vs 0,2 AP with 1 iteration (which is default) treats everything as 0,2. I've even modify file and relabel it accordingly to test that. For other cases AP perfectly overfits and give you correct predictions on training data. Which leads to 2/3 accuracy.
For some reason what wasn't the case before with float labels, as I said, I need to look on that.
Concurency =1 set for ease of debugging, I can remove it.

@@ -307,11 +307,11 @@ public static void GetSlotNames(RoleMappedSchema schema, RoleMappedSchema.Column
schema.Schema[list[0].Index].Annotations.GetValue(Kinds.SlotNames, ref slotNames);
}

public static bool HasKeyValues(this SchemaShape.Column col)
public static bool ShouldAddSlotNames(this SchemaShape.Column col)
Copy link
Contributor

@rogancarr rogancarr Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldAddSlotNames [](start = 27, length = 18)

NeedsSlotNames? Having a modal verb here feels odd to me. #Resolved

{
return col.Annotations.TryFindColumn(Kinds.KeyValues, out var metaCol)
&& metaCol.Kind == SchemaShape.Column.VectorKind.Vector
&& metaCol.ItemType is TextDataViewType;
&& metaCol.ItemType is TextDataViewType; ;
Copy link
Member

@eerhardt eerhardt Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double ; #Resolved

if (task == TaskType.BinaryClassification)
return pipeline.Append(ML.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean))
.Fit(srcDV).Transform(srcDV);
else
Copy link
Member

@eerhardt eerhardt Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if should be on one line. #Resolved

@Ivanidzo4ka
Copy link
Contributor Author

namespace Microsoft.ML.Trainers.FastTree

Good question.
Main reason for this PR is to have consistency and support only one type of label.
Boolean feels natural for binary classification, but I can see benefits of floats for binary classification.
How useful that conversion is?
For Gam we just do this:
boolArray[doc] = targets[doc] > 0;
Which is same thing as ConvertFromR4ToBool.
Not sure I know where to look for FastTree.


In reply to: 469402531 [](ancestors = 469402531)


Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False)

@rogancarr
Copy link
Contributor

rogancarr commented Mar 4, 2019

namespace Microsoft.ML.Trainers.FastTree

In FastTree, we do this crazy thing:

private IEnumerable<bool> GetClassificationLabelsFromRatings(Dataset set)
{
    // REVIEW: Historically FastTree has this test as >= 1. TLC however   
    // generally uses > 0. Consider changing FastTree to be consistent.
    return set.Ratings.Select(x => x >= 1);
}

I would support dropping support for ints & floats & doubles here because we don't use any of the information and we do confusing casts.


In reply to: 469443560 [](ancestors = 469443560,469402531)


Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False)

@rogancarr
Copy link
Contributor

rogancarr commented Mar 4, 2019

namespace Microsoft.ML.Trainers.FastTree

Note that one thing we don't do is allow probability-as-labels for calibrated trainers. That is, there are times when I may want to specify that I believe that the correct probability out is 0.7 and not 1.0 for an example.

This is the only case where we'd want to allow floats as inputs.


In reply to: 469456048 [](ancestors = 469456048,469443560,469402531)


Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False)

@@ -40,14 +40,14 @@ public void TestPfiRegressionOnDenseFeatures()
// X4Rand: 3

// For the following metrics lower is better, so maximum delta means more important feature, and vice versa
Assert.Equal(3, MinDeltaIndex(pfi, m => m.MeanAbsoluteError.Mean));
Copy link
Member

@sfilipi sfilipi Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  [](start = 0, length = 6)

one too many #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean new lines


In reply to: 263174998 [](ancestors = 263174998)

@@ -294,7 +294,7 @@ private List<BreastCancerExample> ReadBreastCancerExamples()
public void TestTrainTestSplit()
{
var mlContext = new MLContext(0);

var dataPath = GetDataPath("adult.tiny.with-schema.txt");
Copy link
Member

@sfilipi sfilipi Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one what?


In reply to: 263175595 [](ancestors = 263175595)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one empty line, or did you mean to have two?


In reply to: 263510301 [](ancestors = 263510301,263175595)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually codeflow specific. it show two new lines if you remove leading spaces from it


In reply to: 263685676 [](ancestors = 263685676,263510301,263175595)

#@ col=FeatureContributions:R4:25-30
#@ col=FeatureContributions:R4:31-36
#@ col=FeatureContributions:R4:37-42
#@ col=Label:BL:7
Copy link
Member

@wschin wschin Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two labels columns? #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We restrict binary classification to work only with boolean columns.
It's easier to add convert than change shared code which generate R4 labels for different classes.


In reply to: 263181449 [](ancestors = 263181449)

@@ -1523,20 +1523,6 @@ private protected SdcaBinaryTrainerBase(IHostEnvironment env, BinaryOptionsBase

private protected abstract SchemaShape.Column[] ComputeSdcaBinaryClassifierSchemaShape();

private protected override void CheckLabelCompatible(SchemaShape.Column labelCol)
Copy link
Member

@wschin wschin Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckLabelCompatible [](start = 40, length = 20)

Can label be text? If no, we still a check. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see this is override method. By removing this one, I would force system to fallback to default behavior. Which is - to check it's a boolean column.


In reply to: 263182006 [](ancestors = 263182006)

@@ -899,6 +909,18 @@ public void Convert(in BL src, ref SB dst)
public void Convert(in DT src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); }
public void Convert(in DZ src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); }
#endregion ToStringBuilder
#region ToBL
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false;
Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false;
public void Convert(in R8 src, ref BL dst) => dst = src > 0.5 ? true : false;

Rounding to the nearest number looks more reasonable. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that.


In reply to: 263190893 [](ancestors = 263190893)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 0.00000001 looks more closer to 0?


In reply to: 263196830 [](ancestors = 263196830,263190893)

@@ -899,6 +909,18 @@ public void Convert(in BL src, ref SB dst)
public void Convert(in DT src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); }
public void Convert(in DZ src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); }
#endregion ToStringBuilder
#region ToBL
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false;
public void Convert(in R4 src, ref BL dst) => dst = src > 0 ? true : false;
Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void Convert(in R4 src, ref BL dst) => dst = src > 0 ? true : false;
public void Convert(in R4 src, ref BL dst) => dst = src > 0.5 ? true : false;
``` #Closed

#@ col=FeatureContributions:R4:25-30
#@ col=FeatureContributions:R4:31-36
#@ col=FeatureContributions:R4:37-42
#@ col=Label:BL:7
Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have two label columns? #ByDesign

Copy link
Contributor

@rogancarr rogancarr Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByDesign — response on a different thread.


In reply to: 263191537 [](ancestors = 263191537)

TestEntryPointRoutine("iris.txt", "Trainers.EnsembleBinaryClassifier", xfNames:
new[] {
"Transforms.ColumnTypeConverter",

Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line? #Resolved

{{
{string.Format(xfTemplate, xfNames[i], i + 1, xfArgs[i], i + 2)}
}},";

Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line? #Resolved

fileSeparator = '\t'
fileSeparator = '\t',
mamlExtraSettings = new[] { "xf=Term{col=Label}" }

Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line? Can auto-formatting handle it? #Resolved

@@ -1,344 +0,0 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

@wschin wschin Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of this file intended? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this is happening....


In reply to: 263192471 [](ancestors = 263192471)

@wschin
Copy link
Member

wschin commented Mar 7, 2019

Do we have any test to ensure that wrong label type incurs exception?

/// 2nd slot of xBuff has the least importance: Evaluation metrics do not change a lot when this slot is permuted.
/// x3 has the biggest importance.
/// </summary>
private IDataView GetSparseDataset(TaskType task = TaskType.Regression, int numberOfInstances = 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put these into a common static library where it can be accessed from all the places that use it, rather than copy it here?

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@rogancarr
Copy link
Contributor

Side note: This description is pretty vague ! :)

@@ -145,7 +145,14 @@ private protected virtual void CheckLabelCompatible(SchemaShape.Column labelCol)
IDataView validationSet = null, IPredictor initPredictor = null)
{
var trainRoleMapped = MakeRoles(trainSet);
var validRoleMapped = validationSet == null ? null : MakeRoles(validationSet);
CheckInputSchema(SchemaShape.Create(trainSet.Schema));
Copy link
Contributor

@rogancarr rogancarr Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckInputSchema [](start = 12, length = 16)

Nit: Ordering for train and valid in Make and Check are different. I'd prefer to have the same sequence. #Resolved

@@ -99,19 +98,14 @@ private protected IDataView MapLabelsCore<T>(DataViewType type, InPredicate<T> e
Host.Assert(data.Schema.Label.HasValue);

var lab = data.Schema.Label.Value;
Copy link
Contributor

@rogancarr rogancarr Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lab [](start = 16, length = 3)

Nit: This hurts my eyes. #Resolved

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key fix!

Just some nits but you are ready to go.

👨‍🚀 🚀 🌕 🌎

@Ivanidzo4ka Ivanidzo4ka merged commit acc4ac0 into dotnet:master Mar 12, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants