-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add NaiveBayes sample & docs #3246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3246 +/- ##
==========================================
+ Coverage 72.61% 72.62% +<.01%
==========================================
Files 804 807 +3
Lines 145025 145080 +55
Branches 16213 16213
==========================================
+ Hits 105314 105366 +52
- Misses 35294 35297 +3
Partials 4417 4417
|
Codecov Report
@@ Coverage Diff @@
## master #3246 +/- ##
==========================================
+ Coverage 72.61% 72.69% +0.07%
==========================================
Files 804 807 +3
Lines 145025 145172 +147
Branches 16213 16225 +12
==========================================
+ Hits 105314 105529 +215
+ Misses 35294 35227 -67
+ Partials 4417 4416 -1
|
Console.WriteLine($"Label: {p.Label}, Prediction: {p.PredictedLabel}"); | ||
|
||
// Expected output: | ||
// Label: 1, Prediction: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 [](start = 39, length = 1)
it basically assigns one class for prediction which looks like bug for me.
Have you create issue about that?
Not sure it's worth having sample which showing broken learner. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a bug. Naive Bayes considers features to be binary in our implementation, that is how features are binned. In this sample pipeline all your features are greater than equal to zero that means the feature histogram for each class will be of the same size hence you are seeing this behavior. Please modify your code to have feature values take either negative or positive values.
When we were implementing Naive Bayes we thought about this case of features taking continuous values and for that we would need to implement Gaussian distribution to bin the features. However it wasn't a requirement at the time.
CC: @glebuk @TomFinley @justinormont #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have samples where Naive Bayes works well? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Label: 2, Prediction: 2 | ||
// Label: 3, Prediction: 3 | ||
// Label: 2, Prediction: 2 | ||
// Label: 3, Prediction: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE! #Resolved
@@ -75,7 +75,7 @@ namespace Samples.Dynamic.Trainers.MulticlassClassification | |||
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0) | |||
{ | |||
var random = new Random(seed); | |||
float randomFloat() => (float)random.NextDouble(); | |||
float randomFloat() => (float)(random.NextDouble() - 0.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float randomFloat() => (float)(random.NextDouble() - 0.5); [](start = 12, length = 58)
This is great, but did you re-generate all the TT by running custom tool to make sure the samples are not broken? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only 3 tt depend on this one, they are regenerated
In reply to: 274642625 [](ancestors = 274642625)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were there some extra assumptions that we were going to explicitly document for this trainer?
@@ -75,7 +75,7 @@ namespace Samples.Dynamic.Trainers.MulticlassClassification | |||
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0) | |||
{ | |||
var random = new Random(seed); | |||
float randomFloat() => (float)random.NextDouble(); | |||
float randomFloat() => (float)(random.NextDouble() - 0.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natke Its to make sure feature values are evenly distributed between -0.5 and +0.5. This gives us even number of positive and negative examples. Naive Bayes considers two types of feature values 1) greater than zero and 2) less than equal to zero and you want to have a sample with both those feature values to have sensible prediction. I believe @ganik has talked about it briefly in the doc that he has attached here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great. Is it worth adding a comment to the code? Also, which doc? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so yes it is spelt out in the trainer code comments. I wonder if we should add a comment to this sample code too, to be absolutely clear. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant add it to the sample code since this code is shared (generated from .tt which is shared) by 3 other trainers that don't have this NaiveBayes problem
In reply to: 274708597 [](ancestors = 274708597)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we have random values from -.5 to .5 range, some trainers like NB need that, others like OVA dont but will be ok with that
In reply to: 274651105 [](ancestors = 274651105)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I know how to do it, I ll send next iteration
In reply to: 275193160 [](ancestors = 275193160,274708597)
using System.Linq; | ||
using Microsoft.ML; | ||
using Microsoft.ML.Data; | ||
using Microsoft.ML.SamplesUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this as per the checklist #Resolved
/// in a class even though they may be dependent on each other. It is a multi-class trainer that accepts | ||
/// binary feature values of type float, i.e., feature values are either true or false, specifically a | ||
/// feature value greater than zero is treated as true. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info: this is good for the 1st pass for docs. please leave the 2nd pass empty, so that we improve this next week. #Resolved
@@ -3,7 +3,7 @@ | |||
using System.Linq; | |||
using Microsoft.ML; | |||
using Microsoft.ML.Data; | |||
using Microsoft.ML.SamplesUtils; | |||
|
|||
|
|||
namespace Samples.Dynamic.Trainers.MulticlassClassification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line? #Resolved
/// <example> | ||
/// <format type="text/markdown"> | ||
/// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDCA [](start = 27, length = 4)
rename #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repros #3226