-
Notifications
You must be signed in to change notification settings - Fork 1.9k
One type label policy in trainers #2804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One type label policy in trainers #2804
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2804 +/- ##
==========================================
+ Coverage 71.81% 71.83% +0.02%
==========================================
Files 812 812
Lines 142658 142708 +50
Branches 16095 16092 -3
==========================================
+ Hits 102445 102512 +67
+ Misses 35830 35818 -12
+ Partials 4383 4378 -5
|
@@ -20,21 +20,23 @@ public void SdcaWorkout() | |||
var data = TextLoaderStatic.CreateLoader(Env, ctx => (Label: ctx.LoadFloat(0), Features: ctx.LoadFloat(1, 10))) | |||
.Load(dataPath).Cache(); | |||
|
|||
var binaryData = ML.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean).Fit(data.AsDynamic).Transform(data.AsDynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var binaryData [](start = 11, length = 15)
Maybe break lines? This is a bit of a mouthful. #Resolved
// Data | ||
var data = mlContext.Data.Cache(reader.Load(GetDataPath(dataPath))); | ||
var textData = reader.Load(GetDataPath(dataPath)); | ||
var data = mlContext.Data.Cache(mlContext.Transforms.Conversion.MapValueToKey("Label").Fit(textData).Transform(textData)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var data [](start = 12, length = 9)
Maybe for lines like this in this file, break the steps out a bit for easier reading? #Resolved
var pipeline = mlContext.MulticlassClassification.Trainers.OneVersusAll(ap, useProbabilities: false); | ||
|
||
var model = pipeline.Fit(data); | ||
var predictions = model.Transform(data); | ||
|
||
// Metrics | ||
var metrics = mlContext.MulticlassClassification.Evaluate(predictions); | ||
Assert.True(metrics.MicroAccuracy > 0.71); | ||
Assert.True(metrics.MicroAccuracy > 0.66); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.True(metrics.MicroAccuracy > 0.66) [](start = 12, length = 41)
Did changing the concurrency give lower MicroAccuracy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
I still need to debug it properly.
We have ova with 3 labels which gives me 0 vs 1,2; 1 vs 0,2 and 2 vs 0,1.
For 1 vs 0,2 AP with 1 iteration (which is default) treats everything as 0,2. I've even modify file and relabel it accordingly to test that. For other cases AP perfectly overfits and give you correct predictions on training data. Which leads to 2/3 accuracy.
For some reason what wasn't the case before with float labels, as I said, I need to look on that.
Concurency =1 set for ease of debugging, I can remove it.
@@ -307,11 +307,11 @@ public static void GetSlotNames(RoleMappedSchema schema, RoleMappedSchema.Column | |||
schema.Schema[list[0].Index].Annotations.GetValue(Kinds.SlotNames, ref slotNames); | |||
} | |||
|
|||
public static bool HasKeyValues(this SchemaShape.Column col) | |||
public static bool ShouldAddSlotNames(this SchemaShape.Column col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShouldAddSlotNames [](start = 27, length = 18)
NeedsSlotNames
? Having a modal verb here feels odd to me. #Resolved
{ | ||
return col.Annotations.TryFindColumn(Kinds.KeyValues, out var metaCol) | ||
&& metaCol.Kind == SchemaShape.Column.VectorKind.Vector | ||
&& metaCol.ItemType is TextDataViewType; | ||
&& metaCol.ItemType is TextDataViewType; ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double ;
#Resolved
if (task == TaskType.BinaryClassification) | ||
return pipeline.Append(ML.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean)) | ||
.Fit(srcDV).Transform(srcDV); | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if
should be on one line. #Resolved
Good question. In reply to: 469402531 [](ancestors = 469402531) Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False) |
In FastTree, we do this crazy thing: private IEnumerable<bool> GetClassificationLabelsFromRatings(Dataset set)
{
// REVIEW: Historically FastTree has this test as >= 1. TLC however
// generally uses > 0. Consider changing FastTree to be consistent.
return set.Ratings.Select(x => x >= 1);
} I would support dropping support for ints & floats & doubles here because we don't use any of the information and we do confusing casts. In reply to: 469443560 [](ancestors = 469443560,469402531) Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False) |
Note that one thing we don't do is allow probability-as-labels for calibrated trainers. That is, there are times when I may want to specify that I believe that the correct probability out is 0.7 and not 1.0 for an example. This is the only case where we'd want to allow floats as inputs. In reply to: 469456048 [](ancestors = 469456048,469443560,469402531) Refers to: src/Microsoft.ML.FastTree/FastTreeRanking.cs:40 in 5707e17. [](commit_id = 5707e17, deletion_comment = False) |
@@ -40,14 +40,14 @@ public void TestPfiRegressionOnDenseFeatures() | |||
// X4Rand: 3 | |||
|
|||
// For the following metrics lower is better, so maximum delta means more important feature, and vice versa | |||
Assert.Equal(3, MinDeltaIndex(pfi, m => m.MeanAbsoluteError.Mean)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 0, length = 6)
one too many #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -294,7 +294,7 @@ private List<BreastCancerExample> ReadBreastCancerExamples() | |||
public void TestTrainTestSplit() | |||
{ | |||
var mlContext = new MLContext(0); | |||
|
|||
var dataPath = GetDataPath("adult.tiny.with-schema.txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one empty line, or did you mean to have two?
In reply to: 263510301 [](ancestors = 263510301,263175595)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually codeflow specific. it show two new lines if you remove leading spaces from it
In reply to: 263685676 [](ancestors = 263685676,263510301,263175595)
#@ col=FeatureContributions:R4:25-30 | ||
#@ col=FeatureContributions:R4:31-36 | ||
#@ col=FeatureContributions:R4:37-42 | ||
#@ col=Label:BL:7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two labels columns? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We restrict binary classification to work only with boolean columns.
It's easier to add convert than change shared code which generate R4 labels for different classes.
In reply to: 263181449 [](ancestors = 263181449)
@@ -1523,20 +1523,6 @@ private protected SdcaBinaryTrainerBase(IHostEnvironment env, BinaryOptionsBase | |||
|
|||
private protected abstract SchemaShape.Column[] ComputeSdcaBinaryClassifierSchemaShape(); | |||
|
|||
private protected override void CheckLabelCompatible(SchemaShape.Column labelCol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckLabelCompatible [](start = 40, length = 20)
Can label be text? If no, we still a check. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see this is override method. By removing this one, I would force system to fallback to default behavior. Which is - to check it's a boolean column.
In reply to: 263182006 [](ancestors = 263182006)
@@ -899,6 +909,18 @@ public void Convert(in BL src, ref SB dst) | |||
public void Convert(in DT src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); } | |||
public void Convert(in DZ src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); } | |||
#endregion ToStringBuilder | |||
#region ToBL | |||
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false; | |
public void Convert(in R8 src, ref BL dst) => dst = src > 0.5 ? true : false; |
Rounding to the nearest number looks more reasonable. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 0.00000001 looks more closer to 0?
In reply to: 263196830 [](ancestors = 263196830,263190893)
@@ -899,6 +909,18 @@ public void Convert(in BL src, ref SB dst) | |||
public void Convert(in DT src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); } | |||
public void Convert(in DZ src, ref SB dst) { ClearDst(ref dst); dst.AppendFormat("{0:o}", src); } | |||
#endregion ToStringBuilder | |||
#region ToBL | |||
public void Convert(in R8 src, ref BL dst) => dst = src > 0 ? true : false; | |||
public void Convert(in R4 src, ref BL dst) => dst = src > 0 ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void Convert(in R4 src, ref BL dst) => dst = src > 0 ? true : false; | |
public void Convert(in R4 src, ref BL dst) => dst = src > 0.5 ? true : false; | |
``` #Closed |
#@ col=FeatureContributions:R4:25-30 | ||
#@ col=FeatureContributions:R4:31-36 | ||
#@ col=FeatureContributions:R4:37-42 | ||
#@ col=Label:BL:7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have two label columns? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestEntryPointRoutine("iris.txt", "Trainers.EnsembleBinaryClassifier", xfNames: | ||
new[] { | ||
"Transforms.ColumnTypeConverter", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra line? #Resolved
{{ | ||
{string.Format(xfTemplate, xfNames[i], i + 1, xfArgs[i], i + 2)} | ||
}},"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra line? #Resolved
fileSeparator = '\t' | ||
fileSeparator = '\t', | ||
mamlExtraSettings = new[] { "xf=Term{col=Label}" } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra line? Can auto-formatting handle it? #Resolved
@@ -1,344 +0,0 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the removal of this file intended? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any test to ensure that wrong label type incurs exception? |
/// 2nd slot of xBuff has the least importance: Evaluation metrics do not change a lot when this slot is permuted. | ||
/// x3 has the biggest importance. | ||
/// </summary> | ||
private IDataView GetSparseDataset(TaskType task = TaskType.Regression, int numberOfInstances = 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put these into a common static library where it can be accessed from all the places that use it, rather than copy it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: This description is pretty vague ! :) |
@@ -145,7 +145,14 @@ private protected virtual void CheckLabelCompatible(SchemaShape.Column labelCol) | |||
IDataView validationSet = null, IPredictor initPredictor = null) | |||
{ | |||
var trainRoleMapped = MakeRoles(trainSet); | |||
var validRoleMapped = validationSet == null ? null : MakeRoles(validationSet); | |||
CheckInputSchema(SchemaShape.Create(trainSet.Schema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckInputSchema [](start = 12, length = 16)
Nit: Ordering for train
and valid
in Make
and Check
are different. I'd prefer to have the same sequence. #Resolved
@@ -99,19 +98,14 @@ private protected IDataView MapLabelsCore<T>(DataViewType type, InPredicate<T> e | |||
Host.Assert(data.Schema.Label.HasValue); | |||
|
|||
var lab = data.Schema.Label.Value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lab [](start = 16, length = 3)
Nit: This hurts my eyes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key fix!
Just some nits but you are ready to go.
👨🚀 🚀 🌕 🌎
This PR fixes #2628 and fixes #2750. fixes #2810