Skip to content

Added Benchmark performance tests for wikidetoxData #820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 11, 2018
Merged

Added Benchmark performance tests for wikidetoxData #820

merged 7 commits into from
Sep 11, 2018

Conversation

Anipik
Copy link
Contributor

@Anipik Anipik commented Sep 5, 2018

This PR adds BenchMark tests for AveragePreceptron and LightGBM classifier on wikiDetox Dataset.

cc @eerhardt @danmosemsft @sfilipi

@Anipik
Copy link
Contributor Author

Anipik commented Sep 5, 2018

@dotnet-bot test MachineLearning-CI

@@ -29,7 +29,7 @@ static void Main(string[] args)

private static IConfig CreateCustomConfig()
=> DefaultConfig.Instance
.With(Job.Default
.With(Job.VeryLongRun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to change this? /cc @adamsitnik

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


In reply to: 215314542 [](ancestors = 215314542)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should not change it.

What the current setting is : 1 warmup iteration, up to 20 workload iterations in 1 process
What LongRun is : up to 30 warmup iterations, up to 500 workload iterations repeated in 4 dedicated processes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the effect of having warm-up iterations?

Are we running in the same process, or another? The normal user has a single run, so I think only the first warm-up iteration is representative. For instance, the WordEmbedding transform only spends time loading the model the first time; this time is significant (seconds to minutes). I'm not sure if the static loading of the word embedding models will be torn-down in the current tests.

Static dictionary which keeps the word embedding models:

private static Dictionary<string, WeakReference<Model>> _vocab = new Dictionary<string, WeakReference<Model>>();

If we run benchmarks in concurrency, we'll hit the loader lock, which will cause only one model to load per process in parallel:

private static object _embeddingsLock = new object();

Can we check, for the WordEmbedding transforms benchmark, if the model is reloaded from disk for iteration (it should be)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to verify from the time of warmup iterations. There shouldn't be large difference between both the cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinormont every benchmark is executed in a dedicated process. We have 1 warmup iteration to exclude the cost of Jitting from the final result.

However, if in the real life scenarios this code is executed only once it makes sense to have a custom config for it, which is going to spawn new process n times and every process is going to run the benchmark exactly once. This is possible with BenchmarkDotNet.

@justinormont could you say something more about typical scenarios? (how many times train and predict are executed)

@davidwrighton this could be also interesting from the Runtime perspective, especially with tiered jitting enabled (cc @kouvel)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't just one type of user. What's being tested here is a batch scoring scenario where the user has a pre-trained model and they want to score a dataset against the model. This benchmark is meant to cover the speed of this scoring process (it doesn't return a useful accuracy metric as there is data leakage from running in a Train-Train (mostly same dataset used for both training and the final eval). The user in this benchmark likely is expected to have a cold-start.

Todo: see if the subsequent iterations also load the Word Embedding file (or we won't be including its load-time in the benchmark).

Future pull requests: We should expand the user scenarios tested. And increase the datasets so we don't overfit our perf improvements to only the few represented datasets.

Maml.MainAll(cmd);
}

[GlobalSetup]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) typically "setup" methods are towards the top of the class file. It makes for easier reading to see the stuff that will run first at the top.

string outDir = Path.Combine(currentAssemblyLocation.Directory.FullName, "TestOutput");

s_output_Wiki = Path.Combine(outDir, @"BenchmarkDefForMLNET\WikiDetox\00-baseline,_Bigram+Trichar\");
Directory.CreateDirectory(s_output_Wiki);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to save an output model? Is it required? What are we going to do with it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see we use it in WikiDox_Test. It seems like that test is dependent on Preceptron_CV running first. I'm not sure that is a good idea - to have one benchmark depend on another benchmark.


In reply to: 215315428 [](ancestors = 215315428)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt is right, benchmark should not rely on the side effects of any other benchmarks. What if the user provides a filter to run only one of them?

@Anipik you can use Target property of the [GlobalSetup] an example:

[GlobalSetup(Targets = new string[] { nameof(Preceptron_CV), nameof(LightGBM_CV) }]
public void Setup_Preceptron_LightGBM() { }

[GlobalSetup(Target = nameof(WikiDox_Test))]
public void Setup_WikiDox()
{
    Setup_Preceptron_LightGBM();
    Preceptron_CV();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that test is dependent on Preceptron_CV running first.

We can put one of the trained model on azure and pull that for this benchmark ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can provide a trained model, but this assumes the model will be static. It's likely that the model itself will change in time. For instance, the default tokenizer should be changed, or we may change the workings of the TextTransform.

What we could do is create the model once (not repeatedly) before the test runs for the first time. Then we use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is create the model once (not repeatedly) before the test runs for the first time. Then we use it.

we can do that in the global setup for that test

[GlobalSetup]
public void Setup()
{
s_dataPath_Wiki = Program.GetInvariantCultureDataPath("wikiDetoxAnnotated160kRows.tsv");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wikiDetoxAnnotated160kRows.tsv [](start = 67, length = 30)

There is a way to define datasets in the tests: look at machinelearning/test/Microsoft.ML.TestFramework/Datasets.cs.

Might make sense to add this one to the list.

{
string modelPath = Path.Combine(s_output_Wiki, @"0.model.fold000.zip");
string cmd = @"Test data=" + s_dataPath_Wiki + " in=" + modelPath;
Maml.MainAll(cmd);
Copy link
Member

@sfilipi sfilipi Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maml.MainAll(cmd); [](start = 12, length = 18)

don't know much about benchmark tests: do we need to track the time the run starts-ends or the framework does it for us? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Benchmark] attribute automatically does that for you. At the end of the run it gives you all the statistics like mean, median, memory used

}

[Benchmark]
public void WikiDox_Test()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "_Test" from the benchmark name. The names should be short and meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually test here refers to the "Test" and "CV" commads given to Maml.MainAll api

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Anipik I am sorry, I did not know. Very often people call benchmarks "Test" and later on it's hard to identify them.

Copy link
Contributor Author

@Anipik Anipik Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np :). i will try to rename it more properly and avoid "test" keyword

build.proj Outdated
@@ -79,6 +79,10 @@
<TestFile Include="$(MSBuildThisFileDirectory)/test/data/external/winequality-white.csv"
Url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
DestinationFile="$(MSBuildThisFileDirectory)test/data/external/winequality-white.csv" />

<TestFile Include="$(MSBuildThisFileDirectory)/test/data/external/wikiDetoxAnnotated160kRows.tsv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be downloading this file unless we need to run perf tests. It is a rather large file to download. So we should probably put this behind a Condition.

{
internal class EmptyWriter : TextWriter
{
private static EmptyWriter _instance = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be simpified to internal static readonly EmptyWriter Instance = new EmptyWriter()

  • it would be nice to add a comment with explanation why do we need that

sfilipi
sfilipi previously approved these changes Sep 6, 2018
Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi dismissed their stale review September 6, 2018 20:17

revoking review

}

[GlobalSetup(Target = nameof(wikiDetox))]
public void Setup_wikiDetox()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wiki should always have a capital W in type and method names, ideally

<ProjectReference Include="..\..\src\Microsoft.ML.KMeansClustering\Microsoft.ML.KMeansClustering.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.StandardLearners\Microsoft.ML.StandardLearners.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.LightGBM\Microsoft.ML.LightGBM.csproj" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to keep sorted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A good habit -- it leads to fewer merge conflicts..)

build.proj Outdated
@@ -33,8 +33,8 @@
RestoreProjects;
BuildRedist;
BuildNative;
$(TraversalBuildDependsOn);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we dont do this, then we have to build the Microsoft.ML.Benchmark project again before running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet run will build the project again so its safe to revert it

public static TestDataset WikiDetox = new TestDataset
{
name = "WikiDetox",
trainFilename = "Input/WikiDetoxAnnotated160kRows.tsv",
Copy link
Member

@eerhardt eerhardt Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "external/WikiDetoxAnnotated160kRows.tsv"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference here is that these Datasets are used throughout the tests, not just in the benchmark tests.

I see you are mapping ..\data\external to Input\ in the Benchmarks.csproj. However, other tests try to read these files from underneath $RepoRoot/test/data, and there is no "Input" folder in that directory. Since this class holds common datasets used throughout the tests, I don't think it makes sense to have this "Input" directory in this file name in the common code. If some other test tried using this dataset, it wouldn't work.

@Anipik
Copy link
Contributor Author

Anipik commented Sep 7, 2018

Benchmarks Result

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.400
  [Host]     : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
  Job-HIDSQF : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Toolchain=netcoreapp2.1  MaxIterationCount=20  WarmupCount=1  
Method Mean Error StdDev Extra Metric Gen 0 Gen 1 Gen 2 Allocated
Preceptron_CV 73.816 s 0.7231 s 0.6764 s - 4496000.0000 1374000.0000 114000.0000 794.59 KB
LightGBM_CV 795.648 s 11.7280 s 10.9704 s - 7767000.0000 2064000.0000 112000.0000 780.41 KB
WikiDetox 5.313 s 0.0253 s 0.0224 s - 175000.0000 34000.0000 8000.0000 240651.7 KB

@danmoseley
Copy link
Member

@sfilipi the running time of LightGBM_CV is much too large (with ~15 iterations it will take well over an hour). Is it possible to reduce the input size or otherwise improve the running time while still having a representative scenario?

Similar question for Preceptron_CV, although 73 sec is doable, it would certainly be easier to get accurate numbers if it was smaller so we coudl do more iterations. Would, say, a smaller input make it run faster while remaining representative?

WikiDetox runtime looks great.

@Anipik
Copy link
Contributor Author

Anipik commented Sep 7, 2018

BigramAndTrigramBenchmark.Preceptron_CV
Mean = 73.8162 s, StdErr = 0.1746 s (0.24%); N = 15, StdDev = 0.6764 s
Min = 72.2857 s, Q1 = 73.5126 s, Median = 73.7712 s, Q3 = 74.4294 s, Max = 74.7860 s
IQR = 0.9168 s, LowerFence = 72.1373 s, UpperFence = 75.8047 s
ConfidenceInterval = [73.0931 s; 74.5393 s] (CI 99.9%), Margin = 0.7231 s (0.98% of Mean)
Skewness = -0.56, Kurtosis = 2.55, MValue = 2
-------------------- Histogram --------------------
[72.046 s ; 75.026 s) | @@@@@@@@@@@@@@@
---------------------------------------------------

BigramAndTrigramBenchmark.LightGBM_CV:

Mean = 795.6485 s, StdErr = 2.8325 s (0.36%); N = 15, StdDev = 10.9704 s
Min = 780.4932 s, Q1 = 785.8133 s, Median = 793.5253 s, Q3 = 800.2203 s, Max = 818.2505 s
IQR = 14.4070 s, LowerFence = 764.2029 s, UpperFence = 821.8307 s
ConfidenceInterval = [783.9205 s; 807.3764 s] (CI 99.9%), Margin = 11.7280 s (1.47% of Mean)
Skewness = 0.47, Kurtosis = 2.12, MValue = 2
-------------------- Histogram --------------------
[776.601 s ; 804.842 s) | @@@@@@@@@@@@
[804.842 s ; 822.143 s) | @@@
---------------------------------------------------

BigramAndTrigramBenchmark.WikiDetox: 
Mean = 5.3131 s, StdErr = 0.0060 s (0.11%); N = 14, StdDev = 0.0224 s
Min = 5.2615 s, Q1 = 5.3019 s, Median = 5.3078 s, Q3 = 5.3292 s, Max = 5.3533 s
IQR = 0.0272 s, LowerFence = 5.2611 s, UpperFence = 5.3700 s
ConfidenceInterval = [5.2879 s; 5.3384 s] (CI 99.9%), Margin = 0.0253 s (0.48% of Mean)
Skewness = -0.31, Kurtosis = 2.98, MValue = 2
-------------------- Histogram --------------------
[5.253 s ; 5.361 s) | @@@@@@@@@@@@@@
--------------------------------------------------


Here @ in histogram referes to each iteration

@@ -32,7 +32,7 @@ public void Setup_Preceptron_LightGBM()

if (!File.Exists(s_dataPath_Wiki))
{
throw new FileNotFoundException(s_dataPath_Wiki);
throw new FileNotFoundException($"Could not find {s_dataPath_Wiki} Please ensure you have run 'build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData' from the root");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be /p:IncludeBenchmarkData=true, right?

@@ -163,8 +163,8 @@ public static class TestDatasets
public static TestDataset WikiDetox = new TestDataset
{
name = "WikiDetox",
trainFilename = "Input/WikiDetoxAnnotated160kRows.tsv",
testFilename = "Input/WikiDetoxAnnotated160kRows.tsv"
trainFilename = "external/WikiDetoxAnnotated160kRows.tsv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the casing here is different than what we use in the build.proj.

Here it is:
WikiDetoxAnnotated160kRows.tsv
In build.proj it is:
wikiDetoxAnnotated160kRows.tsv

When running on a case sensitive file system (i.e. Linux), they need to be the same.

private static string s_dataPath_Wiki;
private static string s_modelPath_Wiki;

[GlobalSetup(Targets = new string[] { nameof(Preceptron_CV), nameof(LightGBM_CV) })]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup needs to run for all the tests, right? It initializes the s_dataPath_Wiki variable, which appears to be needed even in the WikiDetox benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt only one globalSetup can be applied to one target

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: please do remember that BenchmarkDotNet runs all these benchmarks in dedicated processes, so if BenchmarkA has SetupMethodA and it initializes a static field, another benchmark from the same class won't have the static field initialized and the setup needs to initialize it as well

This is why when I was doing a cleanup I made all of the static fields instance fields. So good practice is to avoid using static for fields to avoid confusion.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

}

[Benchmark]
public void Preceptron_CV()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name these benchmarks so the output shows a more useful name? Or is there another way to have a more useful name [instead of nameof()]?

Currently, "Perceptron_CV" doesn't well describe this benchmark.
Perhaps: "Wiki Detox using CV Bigrams+Trichargram with AveragedPerceptron"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this format sound: [mode]_[task]_[dataset]_[featurization]_[learner]

Mode is one of: CV / TrainTest / Test / etc
Task is one of: MulticlassClassification / Regression / BinaryClassification / Ranking / AnomalyDetection / Clustering / MultiOutputRegression / etc

So our benchmark names would then be:

  • CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAAveragedPerceptron
  • CV_Multiclass_WikiDetox_BigramsAndTrichar_LightGBMMulticlass
  • CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron
  • CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC
  • Test_Multiclass_WikiDetox_BigramsAndTrichar_OVAAveragedPerceptron (this could better convey that we are specifically benchmarking the bulk scoring speed, whereas the above benchmarks mainly training speeds)

We'll need a naming convention as we expand to additional datasets and pipelines.

@justinormont
Copy link
Contributor

justinormont commented Sep 10, 2018

@danmosemsft for your questions about speed: #820 (comment)

@sfilipi the running time of LightGBM_CV is much too large (with ~15 iterations it will take well over an hour). Is it possible to reduce the input size or otherwise improve the running time while still having a representative scenario?

Similar question for Preceptron_CV, although 73 sec is doable, it would certainly be easier to get accurate numbers if it was smaller so we coudl do more iterations. Would, say, a smaller input make it run faster while remaining representative?

WikiDetox runtime looks great.

These benchmarks are created to be representative of customer tasks. The names are currently a bit odd, the Preceptron_CV & LightGBM_CV are model training benchmarks, while WikiDetox is a scoring speed benchmark. Since the last one (currently named WikiDetox) is only scoring, it's fast. Specifically, I would NOT truncate the datasets, or they would no longer be representative.

The longer term solution, I think, is greatly increasing the number of datasets/pipelines represented. Statistical significance should be obtained across many datasets; currently we are getting statistical significance of the runtimes by re-running the same test many times. While this creates a representative number for this dataset and pipeline, it over-focuses on this dataset/pipeline and will overfit our perf improvements to improve a small number of datasets/pipelines.

We should strive to make overall improvements across many datasets (and tasks). Some datasets will get slower, but on aggregate-metrics we will see gains.

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'd fix these before pushing into the repo:

@Anipik
Copy link
Contributor Author

Anipik commented Sep 10, 2018

Better error message if the dataset is missing:
#820 (comment) (nice to have)

@justinormont what else information do you want me to add here ? Currently we throw this
throw new FileNotFoundException($"Could not find {_dataPath_Wiki} Please ensure you have run 'build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData=true' from the root");

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@justinormont justinormont merged commit f0f04ef into dotnet:master Sep 11, 2018
@Anipik Anipik deleted the BenchMark branch October 10, 2018 18:23
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants