Skip to content

Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer #3387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 21, 2019
Merged

Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer #3387

merged 13 commits into from
Apr 21, 2019

Conversation

ganik
Copy link
Member

@ganik ganik commented Apr 17, 2019

part of #2522

@ganik ganik requested review from shmoradims, artidoro and natke April 17, 2019 23:18
@@ -26,7 +26,18 @@

namespace Microsoft.ML.Trainers
{
/// <include file='./doc.xml' path='doc/members/member[@name="KMeans++"]/*' />
/// <summary>
Copy link
Member

@wschin wschin Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think scoring function to the trained modle should be explained? #WontFix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean Predict function? the one that predicts which cluster data point goes to?


In reply to: 276525272 [](ancestors = 276525272)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do u mean how distance is calculated when algo converges to centroids?


In reply to: 277107516 [](ancestors = 277107516,276525272)

Copy link
Member Author

@ganik ganik Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case I dont think so. For former its not specific to KMeans++. For latter first how am I supposed to do that? Should I deduce this from the code? Secondly, I dont think this information is important to know for .NET devs. All they need to know that this is an improved version of KMeans and high level details of improvements that are described in wiki links


In reply to: 277107690 [](ancestors = 277107690,277107516,276525272)

Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did. Debugging ML model is debugging math equation. Without them, how can user improve their models? We keep saying that .NET devs can ignore details. However, this doesn't look super true in ML.NET's gitter channel.

Anyway, I am not blocking your PR. Just want to pass some of my observations. :) #Resolved

Copy link
Member Author

@ganik ganik Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, once we have more time we could prioritize such tasks to the top, currently this is lower pri to the rest of things that need to be done.


In reply to: 277148107 [](ancestors = 277148107)

/// <a href ='https://en.wikipedia.org/wiki/Naive_Bayes_classifier'>Naive Bayes</a>
/// is a probabilistic classifier that can be used for multiclass problems.
/// Using Bayes' theorem, the conditional probability for a sample belonging to a class
/// can be calculated based on the sample count for each feature combination groups.
Copy link
Member

@wschin wschin Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to add math equation to describe the scoring funciton. It's not easy to figure out how this modle is doing prediction given only text description. #ByDesign

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ll need to read a K-Means++ paper to figure out that. I dont think users do care about exact formula. What they would care is that it is an improved version of K-Means and can be used as such (where K-Means is used)


In reply to: 276525896 [](ancestors = 276525896)

@codemzs
Copy link
Member

codemzs commented Apr 18, 2019

Do we not need a table in documentation like other xml PRs? Plus you also need to specify uid. Best to check mine or Wei-Sheng’s PRs for reference. #Resolved

@ganik ganik added the documentation Related to documentation of ML.NET label Apr 18, 2019
/// <summary>
/// K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified
/// number of clusters in order to minimize the within-cluster sum of squares.
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a>
Copy link
Contributor

@natke natke Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say "this algorithm" or "this implementation" instead of K-means++? Is K-means++ meaningful to a user? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think K-means++ brings good message about the algo: that its a improved variation of KMeans. I would rather leave it as it is


In reply to: 276843341 [](ancestors = 276843341)

Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thats's fine, as long as it clear that this trainer is the K-means++ trainer :-) #Resolved

/// number of clusters in order to minimize the within-cluster sum of squares.
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a>
/// method for choosing the initial cluster centers.
/// YYK-Means accelerates K-Means up to an order of magnitude while producing exactly the same clustering results(modulo floating point precision issues).
Copy link
Contributor

@natke natke Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question: is YYK-Means meaningful to a user? Do they configure it here, or it is a fixed property of this trainer? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is abbreviation from "Yinyang K-Means". I ll change it to K-Means++


In reply to: 276843606 [](ancestors = 276843606)

/// they may be dependent on each other.
/// This multi-class trainer accepts binary feature values of type float, i.e.,
/// feature values are either true or false.
/// Specifically a feature value greater than zero is treated as true.
Copy link
Contributor

@natke natke Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more clarification. How and when would someone use this in practice? (let's work on some words here: @ganik and @codemzs) #Resolved

Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After much discussion I'm happy to leave this as is for now and work on some good examples either for the samples or the conceptual docs. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good thx


In reply to: 277108028 [](ancestors = 277108028)

/// can be different from LightGbmClassifier, which develops a multi-class classifier directly.
/// Note that even if the classifier indicates that it does not need caching, OneVersusAll will always
/// request caching, as it will be performing multiple passes over the data set.
/// This learner will request normalization from the data pipeline if the classifier indicates it would benefit from it.
Copy link
Contributor

@natke natke Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learner -> trainer or algorithm? #Resolved

/// they may be dependent on each other.
/// This multi-class trainer accepts binary feature values of type float, i.e.,
/// feature values are either true or false.
/// Specifically a feature value greater than zero is treated as true.
Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After much discussion I'm happy to leave this as is for now and work on some good examples either for the samples or the conceptual docs. #Resolved

/// <summary>
/// K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified
/// number of clusters in order to minimize the within-cluster sum of squares.
/// K-means++ improves upon K-means by using the <a href='https://research.microsoft.com/apps/pubs/default.aspx?id=252149'>Yinyang K-Means</a>
Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thats's fine, as long as it clear that this trainer is the K-means++ trainer :-) #Resolved

@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #3387 into master will increase coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3387      +/-   ##
==========================================
+ Coverage   72.69%   72.76%   +0.06%     
==========================================
  Files         807      808       +1     
  Lines      145172   145452     +280     
  Branches    16225    16244      +19     
==========================================
+ Hits       105533   105835     +302     
+ Misses      35223    35196      -27     
- Partials     4416     4421       +5
Flag Coverage Δ
#Debug 72.76% <ø> (+0.06%) ⬆️
#production 68.27% <ø> (+0.04%) ⬆️
#test 89.04% <ø> (+0.06%) ⬆️
Impacted Files Coverage Δ
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 84.64% <ø> (ø) ⬆️
...classClassification/MulticlassNaiveBayesTrainer.cs 87.17% <ø> (ø) ⬆️
...ulticlassClassification/PairwiseCouplingTrainer.cs 90.07% <ø> (ø) ⬆️
src/Microsoft.ML.OnnxTransformer/OnnxTransform.cs 86.24% <ø> (ø) ⬆️
...rd/MulticlassClassification/OneVersusAllTrainer.cs 74.87% <ø> (ø) ⬆️
src/Microsoft.ML.OnnxTransformer/OnnxCatalog.cs 100% <ø> (ø) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0%> (-20.52%) ⬇️
src/Microsoft.ML.Data/Transforms/KeyToValue.cs 79.16% <0%> (-0.65%) ⬇️
...ML.Transforms/MutualInformationFeatureSelection.cs 78.58% <0%> (-0.33%) ⬇️
...ft.ML.Data/Evaluators/BinaryClassifierEvaluator.cs 77.18% <0%> (-0.01%) ⬇️
... and 75 more

/// | | |
/// | -- | -- |
/// | Machine learning task | Clustering |
/// | Is normalization required? | Yes |
Copy link
Member

@codemzs codemzs Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes [](start = 39, length = 3)

Are you sure? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. By default it uses default TrainerInfo() which ahs this default settings


In reply to: 277113743 [](ancestors = 277113743)

/// | Machine learning task | Clustering |
/// | Is normalization required? | Yes |
/// | Is caching required? | Yes |
/// | Required NuGet in addition to Microsoft.ML | None |
Copy link
Member

@codemzs codemzs Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None [](start = 55, length = 4)

please check this. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep verified, its true


In reply to: 277113763 [](ancestors = 277113763)

/// K-Means++ observes that there is a lot of redundancy across iterations in the KMeans algorithms and most points do not change their clusters during an iteration.
/// It uses various bounding techniques to identify this redundancy and eliminate many distance computations and optimize centroid computations.
/// For more information on K-means, and K-means++ see:
/// <a href='https://en.wikipedia.org/wiki/K-means_clustering'>K-means</a>
Copy link
Member

@codemzs codemzs Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K-means [](start = 8, length = 70)

We have been using text format for md stuff... #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx


In reply to: 277113813 [](ancestors = 277113813)

/// It assumes independence among the presence of features in a class even though
/// they may be dependent on each other.
/// This multi-class trainer accepts binary feature values of type float, i.e.,
/// feature values are either true or false.
Copy link
Member

@codemzs codemzs Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dude this is not true, it can accept whatever. It just treats features as greater than zero and less than zero...you should explain this more. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx done.


In reply to: 277114008 [](ancestors = 277114008)

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <remarks>
/// <format type="text/markdown"><![CDATA[
/// To create this trainer, use [KMeansTrainer](xref:Microsoft.ML.Trainers.KMeansTrainer).
/// ### Trainer Characteristics
Copy link

@shmoradims shmoradims Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// [](start = 3, length = 5)

input/output table is missing #Resolved

@ganik ganik changed the title Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise [WIP] Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise Apr 21, 2019
@ganik ganik changed the title [WIP] Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise [WIP] Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer Apr 21, 2019
@ganik ganik changed the title [WIP] Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer Docs 2nd pass for NaiveBayes, KMeans, OVA, Pairwise and OnnxTransformer Apr 21, 2019
@ganik ganik merged commit 3389f1d into dotnet:master Apr 21, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants