Skip to content

Docs & samples for SDCA-based trainers #2771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 4, 2019

Conversation

shmoradims
Copy link

Docs & samples for SDCA binary, multi-class, and regression.
Related to #2522

var mlContext = new MLContext(seed: 0);

// Create in-memory examples as C# native class.
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000);
Copy link
Member

@wschin wschin Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, GenerateRandomMulticlassClassificationExamples is not searchable on the doc site, so the only way to fully learn this pipeline is to clone ML.NET. Because SDCA can work with very tiny data set, we could add things like this

        private class DataPoint
        {
            [VectorType(3)]
            public float[] Features;
        }
        var samples = new List<DataPoint>() 
        { 
             new DataPoint(){ Features= new float[3] {1, 0, 0} }, 
             new DataPoint(){ Features= new float[3] {0, 2, 1} }, 
             new DataPoint(){ Features= new float[3] {1, 2, 3} }, 
             new DataPoint(){ Features= new float[3] {0, 1, 0} }, 
             new DataPoint(){ Features= new float[3] {0, 2, 1} },
             new DataPoint(){ Features= new float[3] {-100, 50, -100} } 
         };

into this file and use them. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type is visible if they use the example, and they can inspect the values with the debugger, BUT moving Featurization into the Samples Utils is a real problem..


In reply to: 261013871 [](ancestors = 261013871)

Copy link
Member

@wschin wschin Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't expect user will have visual studio, on for example, Linux. I'd say the best case is that user knows everything they need after reading this example. Please take a look at an scikit-learn example.

X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X, Y)

Does scikit-learn ask user to go outside the text above to understand that example? In addition, those functions are not searchable on ML.NET doc site, which means a big hole to new users. Honestly, I am not sure if SamplesUtils should be used because it hides some vital information and therefore pushes our examples away from those scikit-learn ones (in terms of readibility). #Pending

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my latest comment on #2627. Removing SamplesUtils is a design decision that needs to be made first.

Please see also see my response in another comment where I have the doc links. All of DatasetUtils are searchable:
https://docs.microsoft.com/en-us/dotnet/api/?view=ml-dotnet&term=Microsoft.ML.SamplesUtils.DatasetUtils


In reply to: 261068842 [](ancestors = 261068842)

Copy link
Member

@wschin wschin Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Even if it's searchable, it still has no meaningful document at this page. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our documentation coverage is low but we're actively working on it, hence this PR. So that page will become meaningful eventually.


In reply to: 261718177 [](ancestors = 261718177)

Copy link
Member

@wschin wschin Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not put things separated if they are considered a whole example. The organization of the entire documentation will never be organized and learned in a structured way--- this is how computer science world works. Let me give you another example. How would user learn the definition of a vector column by adding VectorType attribute? Assume that he already finds the doc of GenerateRandomMulticlass. He still need to click on the returned type of GenerateRandomMulticlass, which is List<DatasetUtils.MulticlassClassificationExample>. Then, another page will be opened. Where is the vector attribute of my Features? User needs to click on Fields again to open the 3rd page which contains

[Microsoft.ML.Data.VectorType(new System.Int32[] { 10 })]
public float[] Features;

Hiding things in this hierarchical way is definitely a learning barrier. #WontFix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed by Shaauheen, let's keep the samples as is for V1. Post V1, we can address this by proper discussions that were canceled in favor of API work. For now, some sample for V1, is better than no sample.


In reply to: 261763964 [](ancestors = 261763964)

@@ -5,7 +5,7 @@

namespace Microsoft.ML.Samples.Dynamic.Trainers.BinaryClassification
Copy link
Member

@sfilipi sfilipi Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Samples.Dynamic [](start = 10, length = 28)

let's keep the namespace for all samples Microsoft.ML.Samples.Dynamic. The nesting is unecessary #Resolved

Copy link
Author

@shmoradims shmoradims Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've explained the reasoning here:
#2729 (comment)

btw, we don't need namespace nesting for transforms. only for trainers, because of naming conflicts.


In reply to: 261050465 [](ancestors = 261050465)

@@ -17,7 +17,7 @@ public static void Example()
// Create in-memory examples as C# native class.
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000);

// Convert native C# class to IDataView, a consumble format to ML.NET functions.
// Convert native C# class to IDataView, a consumable format to ML.NET functions.
Copy link
Member

@sfilipi sfilipi Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native C# class [](start = 23, length = 15)

please spell out the full type: List #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more specificity, without spelling out the full type.

If we want full-types, we should spell them out instead of using var. Because the compiler will make sure it's always correct and catch the changes. Having it in the comments will go stale in no time.

So far we've been using var in the samples, so I'll keep it for consistency.


In reply to: 261050690 [](ancestors = 261050690)

@sfilipi
Copy link
Member

sfilipi commented Feb 28, 2019

        // Create in-memory examples as C# native class.

if you can fix this on this file too.. #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/LightGbmWithOptions.cs:19 in c5e034b. [](commit_id = c5e034b, deletion_comment = False)

The following text describes the SDCA algorithm details.
It's used for the remarks section of all SDCA-based trainers (binary, multiclass, regression)
-->
<member name="SDCA_remarks">
Copy link
Member

@sfilipi sfilipi Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 3, length = 29)

so shall we keep the docs.xml for reuse, than? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can reduce the usage to 1 by using cref links, we don't need doc.xml. We should use inline documentation.

I this case we have 4 SDCA trainers that all need to explain what SDCA is and we cannot cref them to each other. So I'm keeping doc.xml for it.


In reply to: 261051276 [](ancestors = 261051276)

@codecov
Copy link

codecov bot commented Feb 28, 2019

Codecov Report

Merging #2771 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2771      +/-   ##
==========================================
- Coverage   71.66%   71.65%   -0.01%     
==========================================
  Files         809      809              
  Lines      142378   142383       +5     
  Branches    16119    16119              
==========================================
- Hits       102030   102027       -3     
- Misses      35915    35922       +7     
- Partials     4433     4434       +1
Flag Coverage Δ
#Debug 71.65% <0%> (-0.01%) ⬇️
#production 67.9% <0%> (-0.01%) ⬇️
#test 85.84% <ø> (ø) ⬆️
Impacted Files Coverage Δ
...oft.ML.StandardLearners/StandardLearnersCatalog.cs 88.57% <ø> (ø) ⬆️
...crosoft.ML.StandardLearners/Standard/SdcaBinary.cs 72.11% <ø> (ø) ⬆️
...oft.ML.StandardLearners/Standard/SdcaRegression.cs 96.05% <ø> (ø) ⬆️
...oft.ML.StandardLearners/Standard/SdcaMultiClass.cs 92.22% <ø> (ø) ⬆️
src/Microsoft.ML.SamplesUtils/ConsoleUtils.cs 0% <0%> (ø) ⬆️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️

@shmoradims
Copy link
Author

        // Create in-memory examples as C# native class.

please see my response on the other comments.


In reply to: 468143030 [](ancestors = 468143030)


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/LightGbmWithOptions.cs:19 in c5e034b. [](commit_id = c5e034b, deletion_comment = False)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@shmoradims shmoradims merged commit 09f40d0 into dotnet:master Mar 4, 2019
@shauheen shauheen added this to the 0319 milestone Mar 4, 2019
@shmoradims shmoradims deleted the sdca_trainers_docs2 branch March 11, 2019 20:51
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants