-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added Benchmark performance tests for wikidetoxData #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dotnet-bot test MachineLearning-CI |
@@ -29,7 +29,7 @@ static void Main(string[] args) | |||
|
|||
private static IConfig CreateCustomConfig() | |||
=> DefaultConfig.Instance | |||
.With(Job.Default | |||
.With(Job.VeryLongRun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to change this? /cc @adamsitnik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should not change it.
What the current setting is : 1 warmup iteration, up to 20 workload iterations in 1 process
What LongRun is : up to 30 warmup iterations, up to 500 workload iterations repeated in 4 dedicated processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the effect of having warm-up iterations?
Are we running in the same process, or another? The normal user has a single run, so I think only the first warm-up iteration is representative. For instance, the WordEmbedding transform only spends time loading the model the first time; this time is significant (seconds to minutes). I'm not sure if the static loading of the word embedding models will be torn-down in the current tests.
Static dictionary which keeps the word embedding models:
private static Dictionary<string, WeakReference<Model>> _vocab = new Dictionary<string, WeakReference<Model>>(); |
If we run benchmarks in concurrency, we'll hit the loader lock, which will cause only one model to load per process in parallel:
private static object _embeddingsLock = new object(); |
Can we check, for the WordEmbedding transforms benchmark, if the model is reloaded from disk for iteration (it should be)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to verify from the time of warmup iterations. There shouldn't be large difference between both the cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinormont every benchmark is executed in a dedicated process. We have 1 warmup iteration to exclude the cost of Jitting from the final result.
However, if in the real life scenarios this code is executed only once it makes sense to have a custom config for it, which is going to spawn new process n
times and every process is going to run the benchmark exactly once. This is possible with BenchmarkDotNet.
@justinormont could you say something more about typical scenarios? (how many times train and predict are executed)
@davidwrighton this could be also interesting from the Runtime perspective, especially with tiered jitting enabled (cc @kouvel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't just one type of user. What's being tested here is a batch scoring scenario where the user has a pre-trained model and they want to score a dataset against the model. This benchmark is meant to cover the speed of this scoring process (it doesn't return a useful accuracy metric as there is data leakage from running in a Train-Train (mostly same dataset used for both training and the final eval). The user in this benchmark likely is expected to have a cold-start.
Todo: see if the subsequent iterations also load the Word Embedding file (or we won't be including its load-time in the benchmark).
Future pull requests: We should expand the user scenarios tested. And increase the datasets so we don't overfit our perf improvements to only the few represented datasets.
Maml.MainAll(cmd); | ||
} | ||
|
||
[GlobalSetup] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) typically "setup" methods are towards the top of the class file. It makes for easier reading to see the stuff that will run first at the top.
string outDir = Path.Combine(currentAssemblyLocation.Directory.FullName, "TestOutput"); | ||
|
||
s_output_Wiki = Path.Combine(outDir, @"BenchmarkDefForMLNET\WikiDetox\00-baseline,_Bigram+Trichar\"); | ||
Directory.CreateDirectory(s_output_Wiki); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to save an output model? Is it required? What are we going to do with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see we use it in WikiDox_Test
. It seems like that test is dependent on Preceptron_CV
running first. I'm not sure that is a good idea - to have one benchmark depend on another benchmark.
In reply to: 215315428 [](ancestors = 215315428)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt is right, benchmark should not rely on the side effects of any other benchmarks. What if the user provides a filter to run only one of them?
@Anipik you can use Target
property of the [GlobalSetup]
an example:
[GlobalSetup(Targets = new string[] { nameof(Preceptron_CV), nameof(LightGBM_CV) }]
public void Setup_Preceptron_LightGBM() { }
[GlobalSetup(Target = nameof(WikiDox_Test))]
public void Setup_WikiDox()
{
Setup_Preceptron_LightGBM();
Preceptron_CV();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like that test is dependent on Preceptron_CV running first.
We can put one of the trained model on azure and pull that for this benchmark ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can provide a trained model, but this assumes the model will be static. It's likely that the model itself will change in time. For instance, the default tokenizer should be changed, or we may change the workings of the TextTransform.
What we could do is create the model once (not repeatedly) before the test runs for the first time. Then we use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we could do is create the model once (not repeatedly) before the test runs for the first time. Then we use it.
we can do that in the global setup for that test
[GlobalSetup] | ||
public void Setup() | ||
{ | ||
s_dataPath_Wiki = Program.GetInvariantCultureDataPath("wikiDetoxAnnotated160kRows.tsv"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wikiDetoxAnnotated160kRows.tsv [](start = 67, length = 30)
There is a way to define datasets in the tests: look at machinelearning/test/Microsoft.ML.TestFramework/Datasets.cs.
Might make sense to add this one to the list.
{ | ||
string modelPath = Path.Combine(s_output_Wiki, @"0.model.fold000.zip"); | ||
string cmd = @"Test data=" + s_dataPath_Wiki + " in=" + modelPath; | ||
Maml.MainAll(cmd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maml.MainAll(cmd); [](start = 12, length = 18)
don't know much about benchmark tests: do we need to track the time the run starts-ends or the framework does it for us? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Benchmark] attribute automatically does that for you. At the end of the run it gives you all the statistics like mean, median, memory used
} | ||
|
||
[Benchmark] | ||
public void WikiDox_Test() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove "_Test" from the benchmark name. The names should be short and meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually test here refers to the "Test" and "CV" commads given to Maml.MainAll api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Anipik I am sorry, I did not know. Very often people call benchmarks "Test" and later on it's hard to identify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np :). i will try to rename it more properly and avoid "test" keyword
build.proj
Outdated
@@ -79,6 +79,10 @@ | |||
<TestFile Include="$(MSBuildThisFileDirectory)/test/data/external/winequality-white.csv" | |||
Url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv" | |||
DestinationFile="$(MSBuildThisFileDirectory)test/data/external/winequality-white.csv" /> | |||
|
|||
<TestFile Include="$(MSBuildThisFileDirectory)/test/data/external/wikiDetoxAnnotated160kRows.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be downloading this file unless we need to run perf tests. It is a rather large file to download. So we should probably put this behind a Condition
.
{ | ||
internal class EmptyWriter : TextWriter | ||
{ | ||
private static EmptyWriter _instance = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be simpified to internal static readonly EmptyWriter Instance = new EmptyWriter()
- it would be nice to add a comment with explanation why do we need that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
[GlobalSetup(Target = nameof(wikiDetox))] | ||
public void Setup_wikiDetox() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wiki should always have a capital W in type and method names, ideally
<ProjectReference Include="..\..\src\Microsoft.ML.KMeansClustering\Microsoft.ML.KMeansClustering.csproj" /> | ||
<ProjectReference Include="..\..\src\Microsoft.ML.StandardLearners\Microsoft.ML.StandardLearners.csproj" /> | ||
<ProjectReference Include="..\..\src\Microsoft.ML.LightGBM\Microsoft.ML.LightGBM.csproj" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to keep sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A good habit -- it leads to fewer merge conflicts..)
build.proj
Outdated
@@ -33,8 +33,8 @@ | |||
RestoreProjects; | |||
BuildRedist; | |||
BuildNative; | |||
$(TraversalBuildDependsOn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we dont do this, then we have to build the Microsoft.ML.Benchmark project again before running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dotnet run will build the project again so its safe to revert it
public static TestDataset WikiDetox = new TestDataset | ||
{ | ||
name = "WikiDetox", | ||
trainFilename = "Input/WikiDetoxAnnotated160kRows.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "external/WikiDetoxAnnotated160kRows.tsv"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing a similar thing for all the other files in the benchmark project https://github.com/dotnet/machinelearning/pull/820/files/d2fe71419e148aebd916626889caccb7fd25d845#diff-ea979ea573e3bccc8a3837b5e09743ffL28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference here is that these Datasets are used throughout the tests, not just in the benchmark tests.
I see you are mapping ..\data\external
to Input\
in the Benchmarks.csproj
. However, other tests try to read these files from underneath $RepoRoot/test/data
, and there is no "Input" folder in that directory. Since this class holds common datasets used throughout the tests, I don't think it makes sense to have this "Input" directory in this file name in the common code. If some other test tried using this dataset, it wouldn't work.
Benchmarks Result BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.400
[Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
Job-HIDSQF : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
Toolchain=netcoreapp2.1 MaxIterationCount=20 WarmupCount=1
|
@sfilipi the running time of LightGBM_CV is much too large (with ~15 iterations it will take well over an hour). Is it possible to reduce the input size or otherwise improve the running time while still having a representative scenario? Similar question for Preceptron_CV, although 73 sec is doable, it would certainly be easier to get accurate numbers if it was smaller so we coudl do more iterations. Would, say, a smaller input make it run faster while remaining representative? WikiDetox runtime looks great. |
|
@@ -32,7 +32,7 @@ public void Setup_Preceptron_LightGBM() | |||
|
|||
if (!File.Exists(s_dataPath_Wiki)) | |||
{ | |||
throw new FileNotFoundException(s_dataPath_Wiki); | |||
throw new FileNotFoundException($"Could not find {s_dataPath_Wiki} Please ensure you have run 'build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData' from the root"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be /p:IncludeBenchmarkData=true
, right?
@@ -163,8 +163,8 @@ public static class TestDatasets | |||
public static TestDataset WikiDetox = new TestDataset | |||
{ | |||
name = "WikiDetox", | |||
trainFilename = "Input/WikiDetoxAnnotated160kRows.tsv", | |||
testFilename = "Input/WikiDetoxAnnotated160kRows.tsv" | |||
trainFilename = "external/WikiDetoxAnnotated160kRows.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the casing here is different than what we use in the build.proj
.
Here it is:
WikiDetoxAnnotated160kRows.tsv
In build.proj it is:
wikiDetoxAnnotated160kRows.tsv
When running on a case sensitive file system (i.e. Linux), they need to be the same.
private static string s_dataPath_Wiki; | ||
private static string s_modelPath_Wiki; | ||
|
||
[GlobalSetup(Targets = new string[] { nameof(Preceptron_CV), nameof(LightGBM_CV) })] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup needs to run for all the tests, right? It initializes the s_dataPath_Wiki
variable, which appears to be needed even in the WikiDetox
benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt only one globalSetup can be applied to one target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: please do remember that BenchmarkDotNet runs all these benchmarks in dedicated processes, so if BenchmarkA
has SetupMethodA
and it initializes a static field, another benchmark from the same class won't have the static field initialized and the setup needs to initialize it as well
This is why when I was doing a cleanup I made all of the static fields instance fields. So good practice is to avoid using static for fields to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
[Benchmark] | ||
public void Preceptron_CV() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name these benchmarks so the output shows a more useful name? Or is there another way to have a more useful name [instead of nameof()]?
Currently, "Perceptron_CV" doesn't well describe this benchmark.
Perhaps: "Wiki Detox using CV Bigrams+Trichargram with AveragedPerceptron"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this format sound: [mode]_[task]_[dataset]_[featurization]_[learner]
Mode is one of: CV
/ TrainTest
/ Test
/ etc
Task is one of: MulticlassClassification
/ Regression
/ BinaryClassification
/ Ranking
/ AnomalyDetection
/ Clustering
/ MultiOutputRegression
/ etc
So our benchmark names would then be:
- CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAAveragedPerceptron
- CV_Multiclass_WikiDetox_BigramsAndTrichar_LightGBMMulticlass
- CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron
- CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC
- Test_Multiclass_WikiDetox_BigramsAndTrichar_OVAAveragedPerceptron (this could better convey that we are specifically benchmarking the bulk scoring speed, whereas the above benchmarks mainly training speeds)
We'll need a naming convention as we expand to additional datasets and pipelines.
@danmosemsft for your questions about speed: #820 (comment)
These benchmarks are created to be representative of customer tasks. The names are currently a bit odd, the Preceptron_CV & LightGBM_CV are model training benchmarks, while WikiDetox is a scoring speed benchmark. Since the last one (currently named WikiDetox) is only scoring, it's fast. Specifically, I would NOT truncate the datasets, or they would no longer be representative. The longer term solution, I think, is greatly increasing the number of datasets/pipelines represented. Statistical significance should be obtained across many datasets; currently we are getting statistical significance of the runtimes by re-running the same test many times. While this creates a representative number for this dataset and pipeline, it over-focuses on this dataset/pipeline and will overfit our perf improvements to improve a small number of datasets/pipelines. We should strive to make overall improvements across many datasets (and tasks). Some datasets will get slower, but on aggregate-metrics we will see gains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'd fix these before pushing into the repo:
- Re-enable the required Wiki Detox dataset file in the CDN:
Added Benchmark performance tests for wikidetoxData #820 (comment) - Less confusing names for the benchmarks:
Added Benchmark performance tests for wikidetoxData #820 (comment) - Better error message if the dataset is missing:
Added Benchmark performance tests for wikidetoxData #820 (comment) (nice to have)
@justinormont what else information do you want me to add here ? Currently we throw this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds BenchMark tests for AveragePreceptron and LightGBM classifier on wikiDetox Dataset.
cc @eerhardt @danmosemsft @sfilipi