-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer #3387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -26,7 +26,18 @@ | |||
|
|||
namespace Microsoft.ML.Trainers | |||
{ | |||
/// <include file='./doc.xml' path='doc/members/member[@name="KMeans++"]/*' /> | |||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think scoring function to the trained modle should be explained? #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean Predict function? the one that predicts which cluster data point goes to?
In reply to: 276525272 [](ancestors = 276525272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do u mean how distance is calculated when algo converges to centroids?
In reply to: 277107516 [](ancestors = 277107516,276525272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In either case I dont think so. For former its not specific to KMeans++. For latter first how am I supposed to do that? Should I deduce this from the code? Secondly, I dont think this information is important to know for .NET devs. All they need to know that this is an improved version of KMeans and high level details of improvements that are described in wiki links
In reply to: 277107690 [](ancestors = 277107690,277107516,276525272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I did. Debugging ML model is debugging math equation. Without them, how can user improve their models? We keep saying that .NET devs can ignore details. However, this doesn't look super true in ML.NET's gitter channel.
Anyway, I am not blocking your PR. Just want to pass some of my observations. :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, once we have more time we could prioritize such tasks to the top, currently this is lower pri to the rest of things that need to be done.
In reply to: 277148107 [](ancestors = 277148107)
/// <a href ='https://en.wikipedia.org/wiki/Naive_Bayes_classifier'>Naive Bayes</a> | ||
/// is a probabilistic classifier that can be used for multiclass problems. | ||
/// Using Bayes' theorem, the conditional probability for a sample belonging to a class | ||
/// can be calculated based on the sample count for each feature combination groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to add math equation to describe the scoring funciton. It's not easy to figure out how this modle is doing prediction given only text description. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ll need to read a K-Means++ paper to figure out that. I dont think users do care about exact formula. What they would care is that it is an improved version of K-Means and can be used as such (where K-Means is used)
In reply to: 276525896 [](ancestors = 276525896)
Do we not need a table in documentation like other xml PRs? Plus you also need to specify uid. Best to check mine or Wei-Sheng’s PRs for reference. #Resolved |
/// <summary> | ||
/// K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified | ||
/// number of clusters in order to minimize the within-cluster sum of squares. | ||
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say "this algorithm" or "this implementation" instead of K-means++? Is K-means++ meaningful to a user? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think K-means++ brings good message about the algo: that its a improved variation of KMeans. I would rather leave it as it is
In reply to: 276843341 [](ancestors = 276843341)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thats's fine, as long as it clear that this trainer is the K-means++ trainer :-) #Resolved
/// number of clusters in order to minimize the within-cluster sum of squares. | ||
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a> | ||
/// method for choosing the initial cluster centers. | ||
/// YYK-Means accelerates K-Means up to an order of magnitude while producing exactly the same clustering results(modulo floating point precision issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question: is YYK-Means meaningful to a user? Do they configure it here, or it is a fixed property of this trainer? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is abbreviation from "Yinyang K-Means". I ll change it to K-Means++
In reply to: 276843606 [](ancestors = 276843606)
/// they may be dependent on each other. | ||
/// This multi-class trainer accepts binary feature values of type float, i.e., | ||
/// feature values are either true or false. | ||
/// Specifically a feature value greater than zero is treated as true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After much discussion I'm happy to leave this as is for now and work on some good examples either for the samples or the conceptual docs. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// can be different from LightGbmClassifier, which develops a multi-class classifier directly. | ||
/// Note that even if the classifier indicates that it does not need caching, OneVersusAll will always | ||
/// request caching, as it will be performing multiple passes over the data set. | ||
/// This learner will request normalization from the data pipeline if the classifier indicates it would benefit from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learner -> trainer or algorithm? #Resolved
/// they may be dependent on each other. | ||
/// This multi-class trainer accepts binary feature values of type float, i.e., | ||
/// feature values are either true or false. | ||
/// Specifically a feature value greater than zero is treated as true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After much discussion I'm happy to leave this as is for now and work on some good examples either for the samples or the conceptual docs. #Resolved
/// <summary> | ||
/// K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified | ||
/// number of clusters in order to minimize the within-cluster sum of squares. | ||
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thats's fine, as long as it clear that this trainer is the K-means++ trainer :-) #Resolved
Codecov Report
@@ Coverage Diff @@
## master #3387 +/- ##
==========================================
+ Coverage 72.69% 72.76% +0.06%
==========================================
Files 807 808 +1
Lines 145172 145452 +280
Branches 16225 16244 +19
==========================================
+ Hits 105533 105835 +302
+ Misses 35223 35196 -27
- Partials 4416 4421 +5
|
/// | | | | ||
/// | -- | -- | | ||
/// | Machine learning task | Clustering | | ||
/// | Is normalization required? | Yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes [](start = 39, length = 3)
Are you sure? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. By default it uses default TrainerInfo() which ahs this default settings
In reply to: 277113743 [](ancestors = 277113743)
/// | Machine learning task | Clustering | | ||
/// | Is normalization required? | Yes | | ||
/// | Is caching required? | Yes | | ||
/// | Required NuGet in addition to Microsoft.ML | None | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None [](start = 55, length = 4)
please check this. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// K-Means++ observes that there is a lot of redundancy across iterations in the KMeans algorithms and most points do not change their clusters during an iteration. | ||
/// It uses various bounding techniques to identify this redundancy and eliminate many distance computations and optimize centroid computations. | ||
/// For more information on K-means, and K-means++ see: | ||
/// <a href='https://en.wikipedia.org/wiki/K-means_clustering'>K-means</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// It assumes independence among the presence of features in a class even though | ||
/// they may be dependent on each other. | ||
/// This multi-class trainer accepts binary feature values of type float, i.e., | ||
/// feature values are either true or false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dude this is not true, it can accept whatever. It just treats features as greater than zero and less than zero...you should explain this more. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <remarks> | ||
/// <format type="text/markdown"><. | ||
/// ### Trainer Characteristics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// [](start = 3, length = 5)
input/output table is missing #Resolved
part of #2522