-
Notifications
You must be signed in to change notification settings - Fork 1.9k
PipelineSweeperMacro for Multi-Class Classification #539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"sep=, col=age:R4:0 col=workclass:TX:1 col=fnlwgt:R4:2 col=education:TX:3 col=education_num:R4:4 col=marital_status:TX:5 col=occupation:TX:6 " + | ||
"col=relationship:TX:7 col=ethnicity:TX:8 col=sex:TX:9 col=Features:R4:10-12 col=native_country:TX:13 col=IsOver50K_:R4:14 header=+"; | ||
var inputFileTrain = new SimpleFileHandle(Env, pathData, false, false); | ||
#pragma warning disable 0618 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disables warning generated because of using ImportText (obsolete) instead of Textloader.
Perhaps in a future PR, we can update all the unit tests to use TextLoader. I can create a separate bug to track that.
In reply to: 202746784 [](ancestors = 202746784)
var runner = new GraphRunner(Env, catalog, graphJson[FieldNames.Nodes] as JArray); | ||
runner.SetInput("TrainingData", datasetTrain); | ||
runner.SetInput("TestingData", datasetTest); | ||
runner.RunAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runner.RunAll(); [](start = 12, length = 16)
do you think the test should have a check on the results, make sure they are the same all the time, at least. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… MultiClassClassification; refactored out unit tests for the PipelineSweeper
AutoInference.AutoMlMlState amlsOut = (AutoInference.AutoMlMlState)exp.GetOutput(output.State); | ||
Assert.NotNull(amlsOut); | ||
Assert.Equal(amlsOut.GetAllEvaluatedPipelines().Length, numIterations); | ||
Assert.True(amlsOut.GetBestPipeline().PerformanceSummary.MetricValue > 0.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.True(amlsOut.GetBestPipeline().PerformanceSummary.MetricValue > 0.1); [](start = 15, length = 77)
does it make more sense to compare against a fixed number? that might be a stronger test.. #Resolved
ThreshAtNumPos = 25, | ||
Nmi = 26, | ||
AvgMinScore = 27, | ||
Dbi = 28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= 28 [](start = 16, length = 4)
the numbers are implied, if they are just sequential. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they are auto generated by the CSharp API generator. We did not specify the numbers in the Enum.
In reply to: 203217022 [](ancestors = 203217022)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we serialize the enum, I think we'd want a fixed number. This could be serialized as a state object for the sweep so that the user can resume the sweep w/ the history of the sweep stored on disk. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a general comment , or something that needs to be addressed ?
Btw, this is is auto generated file . The Enum you see here is defined in PipelineSweeperSupportedMetrics.cs
In reply to: 203267499 [](ancestors = 203267499)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a topic to consider for the future.
When we serialize the enum, we'll either have to serialize the numeric form or the text form. A numeric form will be more stable, and for that we'd have to assign the numbers in the primary definition of the enum.
/// <summary> | ||
/// Mapp Enum Metrics to a SupportedMetric | ||
/// </summary> | ||
private static readonly Dictionary<string, SupportedMetric> _map = new Dictionary<string, SupportedMetric> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_map [](start = 67, length = 5)
if the only usage for this is to contain a list of known metrics, the enum might be enough.
Take a look at Enum.Parse. You can move the construction of new SupportedMetric in the GetSupportedMetric, if the conversion succeeds. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To specify the metric argument in the Macro we are using an Enum. That's the list of metrics you see in he CSharpApi.cs file. However, for parsing the results from the train test macro we need to use the metric names used by the ML.NET evaluators.
e.g. AcuracyMicro (Enum Argument) v/s 'Accuracy(micro-avg)' (Evaluator metric). This dictionary maps the two. Does that clarify why we need this mapping?
In reply to: 203217600 [](ancestors = 203217600)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Senja meant that the conversion from the enum to a string is not needed (see my comment below on the GetSupportedMetric method).
In reply to: 203220227 [](ancestors = 203220227,203217600)
|
||
using (var ch = _host.Start("Suggested Pipeline")) | ||
{ | ||
ch.Info($"PipelineSweeper Pipeline Id : {pipeline.UniqueId}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also print an increasing pipeline number? This lets the user see the progress of the sweeping. #Resolved
} | ||
ch.Info($"PipelineSweeper Learner : {pipeline.Learner}"); | ||
ch.Info($"PipelineSweeper Train Metric Value: {pipeline.PerformanceSummary.TrainingMetricValue}"); | ||
ch.Info($"PipelineSweeper Test Metric Value: {pipeline.PerformanceSummary.MetricValue}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're printing the primary sweeping metric here. How much work would it be to print all of the pipeline's metrics?
Generally when hand sweeping over a dataset, I'm watching multiple metrics. This is important as one metric often increases at the detriment of others.
For instance, micro & macro accuracy often trade-off as some models end-up better at smaller classes or the majority classes. The best models improve the metrics in unison. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I need to explore how we can expose this functionality for the PipelineSweeper. One thing to note is that the other metrics would depend on the TrainerKind.
I have created a separate task (269949) for this. Would not want to block this PR on this feature ask.
In reply to: 203264334 [](ancestors = 203264334)
@@ -15626,36 +15626,37 @@ public sealed class UniformRandomAutoMlEngine : AutoMlEngine | |||
|
|||
public abstract class AutoMlStateBase : ComponentKind {} | |||
|
|||
public enum AutoInferenceAutoMlMlStateArgumentsMetrics | |||
public enum PipelineSweeperSupportedMetricsMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't MetricsMetrics a bit odd sounding? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the auto generated file (the CSharpAPI translation layer) .
The name here results from concatenating the following from PipelineSweeperSupportedMetrics.cs
public sealed class PipelineSweeperSupportedMetrics { public enum Metrics {The name used for the class / enum look fine to me. Any alternate suggestions ?
In reply to: 203265071 [](ancestors = 203265071)
{ Metrics.Dbi.ToString(), new SupportedMetric(FieldNames.PipelineSweeperSupportedMetrics.Dbi, false)} | ||
}; | ||
|
||
public static SupportedMetric GetSupportedMetric(IHostEnvironment env, string metricName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
env [](start = 74, length = 3)
Is this argument needed? If it's just used for the one check below, we can either have an IExceptionContext instead, or simply use Contracts.CheckNonEmpty(). #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the argument. Is there a suitable Contracts.Check* API that is used for an Enum argument?
For now, I just removed the Checks since the default for the switch/case will throw a NotSupportedException anyway
In reply to: 203461762 [](ancestors = 203461762)
{ Metrics.Dbi.ToString(), new SupportedMetric(FieldNames.PipelineSweeperSupportedMetrics.Dbi, false)} | ||
}; | ||
|
||
public static SupportedMetric GetSupportedMetric(IHostEnvironment env, string metricName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string [](start = 79, length = 6)
You can pass a Metric here instead of a string, this way you won't need the dictionary, and instead just do a switch statement on the parameter. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Modified to using switch case as suggested.
In general, regarding usage of switch-case v.s. a dictionary mapping - is there some concrete benefit of using switch-case here or is it mostly a matter of personal preferance ?
In reply to: 203462492 [](ancestors = 203462492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to avoid the dictionary & switch statement, you can just keep a whitelist of the maximizing metrics. In the GetSupportedMetric, you can retrieve the boolean from looking up the whitelist.
In reply to: 203543398 [](ancestors = 203543398,203462492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Senja. I intend to re-factor this code (task 267452).
I have added the metrics issue as part of the same task.
I want to unblock the benchmarking efforts . Is it fine if i keep it as switch-case for now?
In reply to: 203780367 [](ancestors = 203780367,203543398,203462492)
@@ -271,7 +176,9 @@ public enum Metrics | |||
} | |||
|
|||
public AutoMlMlState(IHostEnvironment env, Arguments args) | |||
: this(env, SupportedMetric.ByName(Enum.GetName(typeof(Arguments.Metrics), args.Metric)), args.Engine.CreateComponent(env), | |||
: this(env, | |||
PipelineSweeperSupportedMetrics.GetSupportedMetric(env, Enum.GetName(typeof(PipelineSweeperSupportedMetrics.Metrics), args.Metric)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum.GetName [](start = 78, length = 12)
If you change the GetSupportedMetric method to take the Metrics value, then you won't need this, you can just pass args.Metric. #Resolved
Thanks . I didn't realize the description had quotes till you pointed it out In reply to: 405928784 [](ancestors = 405928784) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review comments. This PR is now merged. |
* failing test case for multiclass * Refactored PipelineSweeperSupportedMetrics Class; added unit test for MultiClassClassification; refactored out unit tests for the PipelineSweeper * take care of review comments; display transforms/learners + metrics in pipeline * taking care of PR comments + refactor PipelineSweeperRunSummary * taking care of review comments
* failing test case for multiclass * Refactored PipelineSweeperSupportedMetrics Class; added unit test for MultiClassClassification; refactored out unit tests for the PipelineSweeper * take care of review comments; display transforms/learners + metrics in pipeline * taking care of PR comments + refactor PipelineSweeperRunSummary * taking care of review comments
Fixes #538