-
Notifications
You must be signed in to change notification settings - Fork 1.9k
In-memory & self-contained sample template. #2979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 🚀 🥇 🏃♀️
public float Label { get; set; } | ||
// Predicted score from the trainer. | ||
public float Score { get; set; } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like extending the DataPoint class, because that is effectively what happens, columns get added. not a biggie though. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's cons and pros here. I think keeping input and output separate is easier to understand for users.
In reply to: 266150723 [](ancestors = 266150723)
} | ||
|
||
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count) | ||
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static IEnumerable GenerateRandomDataPoints(int count, int seed=0) [](start = 7, length = 86)
i like this, but do you think it will be quick to create them artificially for each task? Ranking and time series come to mind. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes it's not easy, ranking being an example. For those, I'll keep the text-loader style.
so this template is mostly suitable for regression and binary classification.
In reply to: 266151216 [](ancestors = 266151216)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text data is also not easy to randomly generate.
In reply to: 266157156 [](ancestors = 266157156,266151216)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Zeeshan A meant meaningful text data but I don't think we need meaningful data to demonstrate the functionality of a module. The amount of data might be a problem to trainers, but to my knowledge, there is no trainer directly consuming strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sample is just a template. If in some scenarios it doesn't make sense, we can try text-loader instead.
In reply to: 266160094 [](ancestors = 266160094,266157156,266151216)
Codecov Report
@@ Coverage Diff @@
## master #2979 +/- ##
==========================================
- Coverage 72.3% 72.29% -0.02%
==========================================
Files 796 796
Lines 142349 142349
Branches 16051 16051
==========================================
- Hits 102923 102908 -15
- Misses 35041 35060 +19
+ Partials 4385 4381 -4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -45,11 +75,21 @@ private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count) | |||
} | |||
} | |||
|
|||
// Example with label and 50 feature values. A data set is a collection of such examples. | |||
private class DataPoint | |||
{ | |||
public float Label { get; set; } | |||
[VectorType(50)] | |||
public float[] Features { get; set; } | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume all the samples are going to use same DataPoints (to be consistent), right? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so. We now have samples, examples, instances, which are kind of less precise than data point. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the label type and number of features might change depending on the scenario, but the outline is the same.
In reply to: 266160360 [](ancestors = 266160360)
Related to #2726 I created this in-memory and self-contained sample for FastTree. I'll use the final version from this PR as template for the following samples.