-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Docs & samples for SDCA-based trainers #2771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...L.Samples/Dynamic/Trainers/BinaryClassification/StochasticDualCoordinateAscentWithOptions.cs
Show resolved
Hide resolved
var mlContext = new MLContext(seed: 0); | ||
|
||
// Create in-memory examples as C# native class. | ||
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, GenerateRandomMulticlassClassificationExamples
is not searchable on the doc site, so the only way to fully learn this pipeline is to clone ML.NET. Because SDCA can work with very tiny data set, we could add things like this
private class DataPoint
{
[VectorType(3)]
public float[] Features;
}
var samples = new List<DataPoint>()
{
new DataPoint(){ Features= new float[3] {1, 0, 0} },
new DataPoint(){ Features= new float[3] {0, 2, 1} },
new DataPoint(){ Features= new float[3] {1, 2, 3} },
new DataPoint(){ Features= new float[3] {0, 1, 0} },
new DataPoint(){ Features= new float[3] {0, 2, 1} },
new DataPoint(){ Features= new float[3] {-100, 50, -100} }
};
into this file and use them. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type is visible if they use the example, and they can inspect the values with the debugger, BUT moving Featurization into the Samples Utils is a real problem..
In reply to: 261013871 [](ancestors = 261013871)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't expect user will have visual studio, on for example, Linux. I'd say the best case is that user knows everything they need after reading
this example. Please take a look at an scikit-learn example.
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X, Y)
Does scikit-learn ask user to go outside the text
above to understand that example? In addition, those functions are not searchable on ML.NET doc site, which means a big hole to new users. Honestly, I am not sure if SamplesUtils
should be used because it hides some vital information and therefore pushes our examples away from those scikit-learn ones (in terms of readibility). #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function and type are both searchable:
- https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.samplesutils.datasetutils.generaterandommulticlassclassificationexamples?view=ml-dotnet#Microsoft_ML_SamplesUtils_DatasetUtils_GenerateRandomMulticlassClassificationExamples_System_Int32_
- https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.samplesutils.datasetutils.multiclassclassificationexample?view=ml-dotnet
There no featurization here, b/c we're using randomly generated data. The featurization issue happens for loading data from file.
In reply to: 261051041 [](ancestors = 261051041,261013871)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my latest comment on #2627. Removing SamplesUtils is a design decision that needs to be made first.
Please see also see my response in another comment where I have the doc links. All of DatasetUtils are searchable:
https://docs.microsoft.com/en-us/dotnet/api/?view=ml-dotnet&term=Microsoft.ML.SamplesUtils.DatasetUtils
In reply to: 261068842 [](ancestors = 261068842)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Even if it's searchable, it still has no meaningful document at this page. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our documentation coverage is low but we're actively working on it, hence this PR. So that page will become meaningful eventually.
In reply to: 261718177 [](ancestors = 261718177)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not put things separated if they are considered a whole example. The organization of the entire documentation will never be organized and learned in a structured way--- this is how computer science world works. Let me give you another example. How would user learn the definition of a vector column by adding VectorType
attribute? Assume that he already finds the doc of GenerateRandomMulticlass
. He still need to click on the returned type of GenerateRandomMulticlass
, which is List<DatasetUtils.MulticlassClassificationExample>
. Then, another page will be opened. Where is the vector attribute of my Features
? User needs to click on Fields
again to open the 3rd page which contains
[Microsoft.ML.Data.VectorType(new System.Int32[] { 10 })]
public float[] Features;
Hiding things in this hierarchical way is definitely a learning barrier. #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pointed by Shaauheen, let's keep the samples as is for V1. Post V1, we can address this by proper discussions that were canceled in favor of API work. For now, some sample for V1, is better than no sample.
In reply to: 261763964 [](ancestors = 261763964)
@@ -5,7 +5,7 @@ | |||
|
|||
namespace Microsoft.ML.Samples.Dynamic.Trainers.BinaryClassification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Samples.Dynamic [](start = 10, length = 28)
let's keep the namespace for all samples Microsoft.ML.Samples.Dynamic. The nesting is unecessary #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've explained the reasoning here:
#2729 (comment)
btw, we don't need namespace nesting for transforms. only for trainers, because of naming conflicts.
In reply to: 261050465 [](ancestors = 261050465)
@@ -17,7 +17,7 @@ public static void Example() | |||
// Create in-memory examples as C# native class. | |||
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000); | |||
|
|||
// Convert native C# class to IDataView, a consumble format to ML.NET functions. | |||
// Convert native C# class to IDataView, a consumable format to ML.NET functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
native C# class [](start = 23, length = 15)
please spell out the full type: List #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more specificity, without spelling out the full type.
If we want full-types, we should spell them out instead of using var
. Because the compiler will make sure it's always correct and catch the changes. Having it in the comments will go stale in no time.
So far we've been using var
in the samples, so I'll keep it for consistency.
In reply to: 261050690 [](ancestors = 261050690)
The following text describes the SDCA algorithm details. | ||
It's used for the remarks section of all SDCA-based trainers (binary, multiclass, regression) | ||
--> | ||
<member name="SDCA_remarks"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 3, length = 29)
so shall we keep the docs.xml for reuse, than? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can reduce the usage to 1 by using cref links, we don't need doc.xml. We should use inline documentation.
I this case we have 4 SDCA trainers that all need to explain what SDCA is and we cannot cref them to each other. So I'm keeping doc.xml for it.
In reply to: 261051276 [](ancestors = 261051276)
Codecov Report
@@ Coverage Diff @@
## master #2771 +/- ##
==========================================
- Coverage 71.66% 71.65% -0.01%
==========================================
Files 809 809
Lines 142378 142383 +5
Branches 16119 16119
==========================================
- Hits 102030 102027 -3
- Misses 35915 35922 +7
- Partials 4433 4434 +1
|
please see my response on the other comments. In reply to: 468143030 [](ancestors = 468143030) Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/LightGbmWithOptions.cs:19 in c5e034b. [](commit_id = c5e034b, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs & samples for SDCA binary, multi-class, and regression.
Related to #2522