Skip to content

Add NaiveBayes sample & docs #3246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 16, 2019
Merged

Add NaiveBayes sample & docs #3246

merged 7 commits into from
Apr 16, 2019

Conversation

ganik
Copy link
Member

@ganik ganik commented Apr 8, 2019

repros #3226

@ganik ganik requested a review from codemzs April 8, 2019 21:49
@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #3246 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3246      +/-   ##
==========================================
+ Coverage   72.61%   72.62%   +<.01%     
==========================================
  Files         804      807       +3     
  Lines      145025   145080      +55     
  Branches    16213    16213              
==========================================
+ Hits       105314   105366      +52     
- Misses      35294    35297       +3     
  Partials     4417     4417
Flag Coverage Δ
#Debug 72.62% <ø> (ø) ⬆️
#production 68.17% <ø> (+0.01%) ⬆️
#test 88.93% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.7% <0%> (-0.21%) ⬇️
...OnnxTransformer.StaticPipe/OnnxStaticExtensions.cs 100% <0%> (ø)
...L.DnnImageFeaturizer.ResNet18/ResNet18Extension.cs 100% <0%> (ø)
...r.StaticPipe/DnnImageFeaturizerStaticExtensions.cs 100% <0%> (ø)
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.26% <0%> (+0.15%) ⬆️
...StandardTrainers/Standard/LinearModelParameters.cs 60.31% <0%> (+0.26%) ⬆️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 74.03% <0%> (+0.33%) ⬆️

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #3246 into master will increase coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3246      +/-   ##
==========================================
+ Coverage   72.61%   72.69%   +0.07%     
==========================================
  Files         804      807       +3     
  Lines      145025   145172     +147     
  Branches    16213    16225      +12     
==========================================
+ Hits       105314   105529     +215     
+ Misses      35294    35227      -67     
+ Partials     4417     4416       -1
Flag Coverage Δ
#Debug 72.69% <ø> (+0.07%) ⬆️
#production 68.22% <ø> (+0.06%) ⬆️
#test 88.97% <ø> (+0.05%) ⬆️
Impacted Files Coverage Δ
...classClassification/MulticlassNaiveBayesTrainer.cs 87.17% <ø> (ø) ⬆️
...oft.ML.StandardTrainers/StandardTrainersCatalog.cs 92.34% <ø> (ø) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0%> (-20.52%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 90.57% <0%> (-1.41%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.7% <0%> (-0.21%) ⬇️
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs 99.58% <0%> (-0.2%) ⬇️
...StandardTrainers/Standard/Simple/SimpleTrainers.cs 77.61% <0%> (-0.17%) ⬇️
src/Microsoft.ML.Recommender/RecommenderCatalog.cs 70.83% <0%> (ø) ⬆️
...dardTrainers/Standard/Online/AveragedPerceptron.cs 89.7% <0%> (ø) ⬆️
... and 29 more

@ganik ganik changed the title Add NaiveBayes sample [WIP] Add NaiveBayes sample Apr 9, 2019
Console.WriteLine($"Label: {p.Label}, Prediction: {p.PredictedLabel}");

// Expected output:
// Label: 1, Prediction: 2
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 [](start = 39, length = 1)

it basically assigns one class for prediction which looks like bug for me.
Have you create issue about that?
Not sure it's worth having sample which showing broken learner. #Resolved

Copy link
Member Author

@ganik ganik Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a repro for issue #3226 and @codemzs is looking into it #Resolved

Copy link
Member

@codemzs codemzs Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bug. Naive Bayes considers features to be binary in our implementation, that is how features are binned. In this sample pipeline all your features are greater than equal to zero that means the feature histogram for each class will be of the same size hence you are seeing this behavior. Please modify your code to have feature values take either negative or positive values.

When we were implementing Naive Bayes we thought about this case of features taking continuous values and for that we would need to implement Gaussian distribution to bin the features. However it wasn't a requirement at the time.

CC: @glebuk @TomFinley @justinormont #Resolved

Copy link
Contributor

@justinormont justinormont Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have samples where Naive Bayes works well? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a such sample


In reply to: 274157820 [](ancestors = 274157820)

@ganik ganik changed the title [WIP] Add NaiveBayes sample Add NaiveBayes sample Apr 10, 2019
@ganik ganik changed the title Add NaiveBayes sample Add NaiveBayes sample & docs Apr 10, 2019
// Label: 2, Prediction: 2
// Label: 3, Prediction: 3
// Label: 2, Prediction: 2
// Label: 3, Prediction: 3
Copy link
Member

@codemzs codemzs Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE! #Resolved

@@ -75,7 +75,7 @@ namespace Samples.Dynamic.Trainers.MulticlassClassification
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0)
{
var random = new Random(seed);
float randomFloat() => (float)random.NextDouble();
float randomFloat() => (float)(random.NextDouble() - 0.5);
Copy link
Member

@codemzs codemzs Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float randomFloat() => (float)(random.NextDouble() - 0.5); [](start = 12, length = 58)

This is great, but did you re-generate all the TT by running custom tool to make sure the samples are not broken? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only 3 tt depend on this one, they are regenerated


In reply to: 274642625 [](ancestors = 274642625)

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@natke natke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there some extra assumptions that we were going to explicitly document for this trainer?

@@ -75,7 +75,7 @@ namespace Samples.Dynamic.Trainers.MulticlassClassification
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0)
{
var random = new Random(seed);
float randomFloat() => (float)random.NextDouble();
float randomFloat() => (float)(random.NextDouble() - 0.5);
Copy link
Contributor

@natke natke Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this? #Resolved

Copy link
Member

@codemzs codemzs Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natke Its to make sure feature values are evenly distributed between -0.5 and +0.5. This gives us even number of positive and negative examples. Naive Bayes considers two types of feature values 1) greater than zero and 2) less than equal to zero and you want to have a sample with both those feature values to have sensible prediction. I believe @ganik has talked about it briefly in the doc that he has attached here. #Resolved

Copy link
Contributor

@natke natke Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great. Is it worth adding a comment to the code? Also, which doc? #ByDesign

Copy link
Member

@codemzs codemzs Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natke I believe he has already here and this should show up in the docs right? #Resolved

Copy link
Contributor

@natke natke Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so yes it is spelt out in the trainer code comments. I wonder if we should add a comment to this sample code too, to be absolutely clear. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant add it to the sample code since this code is shared (generated from .tt which is shared) by 3 other trainers that don't have this NaiveBayes problem


In reply to: 274708597 [](ancestors = 274708597)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have random values from -.5 to .5 range, some trainers like NB need that, others like OVA dont but will be ok with that


In reply to: 274651105 [](ancestors = 274651105)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I know how to do it, I ll send next iteration


In reply to: 275193160 [](ancestors = 275193160,274708597)

using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.SamplesUtils;
Copy link

@shmoradims shmoradims Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this as per the checklist #Resolved

@shmoradims
Copy link

shmoradims commented Apr 12, 2019

using Microsoft.ML.SamplesUtils;

ditto #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/OneVersusAll.cs:6 in 8831b0f. [](commit_id = 8831b0f, deletion_comment = False)

@shmoradims
Copy link

shmoradims commented Apr 12, 2019

public static class OneVersusAll

this one doesn't have a .tt file? #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/OneVersusAll.cs:10 in 8831b0f. [](commit_id = 8831b0f, deletion_comment = False)

/// in a class even though they may be dependent on each other. It is a multi-class trainer that accepts
/// binary feature values of type float, i.e., feature values are either true or false, specifically a
/// feature value greater than zero is treated as true.
/// </summary>
Copy link

@shmoradims shmoradims Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info: this is good for the 1st pass for docs. please leave the 2nd pass empty, so that we improve this next week. #Resolved

@ganik
Copy link
Member Author

ganik commented Apr 15, 2019

public static class OneVersusAll

it does


In reply to: 482643079 [](ancestors = 482643079)


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/OneVersusAll.cs:10 in 8831b0f. [](commit_id = 8831b0f, deletion_comment = False)

@@ -3,7 +3,7 @@
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.SamplesUtils;


namespace Samples.Dynamic.Trainers.MulticlassClassification
Copy link

@shmoradims shmoradims Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line? #Resolved

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[SDCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/NaiveBayes.cs)]
Copy link

@shmoradims shmoradims Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDCA [](start = 27, length = 4)

rename #Resolved

Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@ganik ganik merged commit 66ff419 into dotnet:master Apr 16, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants