Skip to content

PipelineSweeperMacro for Multi-Class Classification #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 19, 2018
Merged

PipelineSweeperMacro for Multi-Class Classification #539

merged 5 commits into from
Jul 19, 2018

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Jul 16, 2018

Fixes #538

  • The PipelineSweeper currently only supports AUC as the optimization metric. Trying to optimize on any other metric throws an exception.
  • Need to fix the way metrics are handled by the PipelineSweeper Macro.
  • Added a test case for MultiClass classification using the PipelineSweeper.

"sep=, col=age:R4:0 col=workclass:TX:1 col=fnlwgt:R4:2 col=education:TX:3 col=education_num:R4:4 col=marital_status:TX:5 col=occupation:TX:6 " +
"col=relationship:TX:7 col=ethnicity:TX:8 col=sex:TX:9 col=Features:R4:10-12 col=native_country:TX:13 col=IsOver50K_:R4:14 header=+";
var inputFileTrain = new SimpleFileHandle(Env, pathData, false, false);
#pragma warning disable 0618
Copy link
Member

@sfilipi sfilipi Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disables warning generated because of using ImportText (obsolete) instead of Textloader.

Perhaps in a future PR, we can update all the unit tests to use TextLoader. I can create a separate bug to track that.


In reply to: 202746784 [](ancestors = 202746784)

var runner = new GraphRunner(Env, catalog, graphJson[FieldNames.Nodes] as JArray);
runner.SetInput("TrainingData", datasetTrain);
runner.SetInput("TestingData", datasetTest);
runner.RunAll();
Copy link
Member

@sfilipi sfilipi Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runner.RunAll(); [](start = 12, length = 16)

do you think the test should have a check on the results, make sure they are the same all the time, at least. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the unit test.


In reply to: 202747140 [](ancestors = 202747140)

… MultiClassClassification; refactored out unit tests for the PipelineSweeper
@dnfclas
Copy link

dnfclas commented Jul 17, 2018

CLA assistant check
All CLA requirements met. #Resolved

@abgoswam abgoswam changed the title WIP . Fixes PipelineSweeperMacro for Multi-Class Classification Fixes PipelineSweeperMacro for Multi-Class Classification Jul 17, 2018
@abgoswam abgoswam changed the title Fixes PipelineSweeperMacro for Multi-Class Classification PipelineSweeperMacro for Multi-Class Classification Jul 17, 2018
AutoInference.AutoMlMlState amlsOut = (AutoInference.AutoMlMlState)exp.GetOutput(output.State);
Assert.NotNull(amlsOut);
Assert.Equal(amlsOut.GetAllEvaluatedPipelines().Length, numIterations);
Assert.True(amlsOut.GetBestPipeline().PerformanceSummary.MetricValue > 0.1);
Copy link
Member

@sfilipi sfilipi Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.True(amlsOut.GetBestPipeline().PerformanceSummary.MetricValue > 0.1); [](start = 15, length = 77)

does it make more sense to compare against a fixed number? that might be a stronger test.. #Resolved

ThreshAtNumPos = 25,
Nmi = 26,
AvgMinScore = 27,
Dbi = 28
Copy link
Member

@sfilipi sfilipi Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= 28 [](start = 16, length = 4)

the numbers are implied, if they are just sequential. #Resolved

Copy link
Member Author

@abgoswam abgoswam Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they are auto generated by the CSharp API generator. We did not specify the numbers in the Enum.


In reply to: 203217022 [](ancestors = 203217022)

Copy link
Contributor

@justinormont justinormont Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we serialize the enum, I think we'd want a fixed number. This could be serialized as a state object for the sweep so that the user can resume the sweep w/ the history of the sweep stored on disk. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a general comment , or something that needs to be addressed ?

Btw, this is is auto generated file . The Enum you see here is defined in PipelineSweeperSupportedMetrics.cs


In reply to: 203267499 [](ancestors = 203267499)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a topic to consider for the future.
When we serialize the enum, we'll either have to serialize the numeric form or the text form. A numeric form will be more stable, and for that we'd have to assign the numbers in the primary definition of the enum.

/// <summary>
/// Mapp Enum Metrics to a SupportedMetric
/// </summary>
private static readonly Dictionary<string, SupportedMetric> _map = new Dictionary<string, SupportedMetric>
Copy link
Member

@sfilipi sfilipi Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_map [](start = 67, length = 5)

if the only usage for this is to contain a list of known metrics, the enum might be enough.
Take a look at Enum.Parse. You can move the construction of new SupportedMetric in the GetSupportedMetric, if the conversion succeeds. #Resolved

Copy link
Member Author

@abgoswam abgoswam Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To specify the metric argument in the Macro we are using an Enum. That's the list of metrics you see in he CSharpApi.cs file. However, for parsing the results from the train test macro we need to use the metric names used by the ML.NET evaluators.

e.g. AcuracyMicro (Enum Argument) v/s 'Accuracy(micro-avg)' (Evaluator metric). This dictionary maps the two. Does that clarify why we need this mapping?


In reply to: 203217600 [](ancestors = 203217600)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Senja meant that the conversion from the enum to a string is not needed (see my comment below on the GetSupportedMetric method).


In reply to: 203220227 [](ancestors = 203220227,203217600)


using (var ch = _host.Start("Suggested Pipeline"))
{
ch.Info($"PipelineSweeper Pipeline Id : {pipeline.UniqueId}");
Copy link
Contributor

@justinormont justinormont Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also print an increasing pipeline number? This lets the user see the progress of the sweeping. #Resolved

}
ch.Info($"PipelineSweeper Learner : {pipeline.Learner}");
ch.Info($"PipelineSweeper Train Metric Value: {pipeline.PerformanceSummary.TrainingMetricValue}");
ch.Info($"PipelineSweeper Test Metric Value: {pipeline.PerformanceSummary.MetricValue}");
Copy link
Contributor

@justinormont justinormont Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're printing the primary sweeping metric here. How much work would it be to print all of the pipeline's metrics?

Generally when hand sweeping over a dataset, I'm watching multiple metrics. This is important as one metric often increases at the detriment of others.

For instance, micro & macro accuracy often trade-off as some models end-up better at smaller classes or the majority classes. The best models improve the metrics in unison. #Pending

Copy link
Member Author

@abgoswam abgoswam Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I need to explore how we can expose this functionality for the PipelineSweeper. One thing to note is that the other metrics would depend on the TrainerKind.

I have created a separate task (269949) for this. Would not want to block this PR on this feature ask.


In reply to: 203264334 [](ancestors = 203264334)

@@ -15626,36 +15626,37 @@ public sealed class UniformRandomAutoMlEngine : AutoMlEngine

public abstract class AutoMlStateBase : ComponentKind {}

public enum AutoInferenceAutoMlMlStateArgumentsMetrics
public enum PipelineSweeperSupportedMetricsMetrics
Copy link
Contributor

@justinormont justinormont Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't MetricsMetrics a bit odd sounding? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the auto generated file (the CSharpAPI translation layer) .

The name here results from concatenating the following from PipelineSweeperSupportedMetrics.cs

public sealed class PipelineSweeperSupportedMetrics { public enum Metrics {

The name used for the class / enum look fine to me. Any alternate suggestions ?


In reply to: 203265071 [](ancestors = 203265071)

@shauheen
Copy link
Contributor

shauheen commented Jul 18, 2018

@abgoswam why do you have fixes issue number in quotes? That fails to make it actually associate this with #538 #Resolved

{ Metrics.Dbi.ToString(), new SupportedMetric(FieldNames.PipelineSweeperSupportedMetrics.Dbi, false)}
};

public static SupportedMetric GetSupportedMetric(IHostEnvironment env, string metricName)
Copy link

@yaeldekel yaeldekel Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env [](start = 74, length = 3)

Is this argument needed? If it's just used for the one check below, we can either have an IExceptionContext instead, or simply use Contracts.CheckNonEmpty(). #Resolved

Copy link
Member Author

@abgoswam abgoswam Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the argument. Is there a suitable Contracts.Check* API that is used for an Enum argument?

For now, I just removed the Checks since the default for the switch/case will throw a NotSupportedException anyway


In reply to: 203461762 [](ancestors = 203461762)

{ Metrics.Dbi.ToString(), new SupportedMetric(FieldNames.PipelineSweeperSupportedMetrics.Dbi, false)}
};

public static SupportedMetric GetSupportedMetric(IHostEnvironment env, string metricName)
Copy link

@yaeldekel yaeldekel Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 79, length = 6)

You can pass a Metric here instead of a string, this way you won't need the dictionary, and instead just do a switch statement on the parameter. #Resolved

Copy link
Member Author

@abgoswam abgoswam Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Modified to using switch case as suggested.

In general, regarding usage of switch-case v.s. a dictionary mapping - is there some concrete benefit of using switch-case here or is it mostly a matter of personal preferance ?


In reply to: 203462492 [](ancestors = 203462492)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to avoid the dictionary & switch statement, you can just keep a whitelist of the maximizing metrics. In the GetSupportedMetric, you can retrieve the boolean from looking up the whitelist.


In reply to: 203543398 [](ancestors = 203543398,203462492)

Copy link
Member Author

@abgoswam abgoswam Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Senja. I intend to re-factor this code (task 267452).
I have added the metrics issue as part of the same task.

I want to unblock the benchmarking efforts . Is it fine if i keep it as switch-case for now?


In reply to: 203780367 [](ancestors = 203780367,203543398,203462492)

@@ -271,7 +176,9 @@ public enum Metrics
}

public AutoMlMlState(IHostEnvironment env, Arguments args)
: this(env, SupportedMetric.ByName(Enum.GetName(typeof(Arguments.Metrics), args.Metric)), args.Engine.CreateComponent(env),
: this(env,
PipelineSweeperSupportedMetrics.GetSupportedMetric(env, Enum.GetName(typeof(PipelineSweeperSupportedMetrics.Metrics), args.Metric)),
Copy link

@yaeldekel yaeldekel Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum.GetName [](start = 78, length = 12)

If you change the GetSupportedMetric method to take the Metrics value, then you won't need this, you can just pass args.Metric. #Resolved

@abgoswam
Copy link
Member Author

Thanks . I didn't realize the description had quotes till you pointed it out


In reply to: 405928784 [](ancestors = 405928784)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abgoswam abgoswam merged commit 8ce2a23 into dotnet:master Jul 19, 2018
@abgoswam
Copy link
Member Author

thanks for the review comments. This PR is now merged.

eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* failing test case for multiclass

* Refactored PipelineSweeperSupportedMetrics Class; added unit test for MultiClassClassification; refactored out unit tests for the PipelineSweeper

* take care of review comments; display transforms/learners + metrics in pipeline

* taking care of PR comments + refactor PipelineSweeperRunSummary

* taking care of review comments
codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018
* failing test case for multiclass

* Refactored PipelineSweeperSupportedMetrics Class; added unit test for MultiClassClassification; refactored out unit tests for the PipelineSweeper

* take care of review comments; display transforms/learners + metrics in pipeline

* taking care of PR comments + refactor PipelineSweeperRunSummary

* taking care of review comments
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineSweeping fails for MultiClass classification
6 participants