Skip to content

In-memory & self-contained sample template. #2979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2019

Conversation

shmoradims
Copy link

Related to #2726 I created this in-memory and self-contained sample for FastTree. I'll use the final version from this PR as template for the following samples.

Copy link
Member

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🚀 🥇 🏃‍♀️

public float Label { get; set; }
// Predicted score from the trainer.
public float Score { get; set; }
}
Copy link
Member

@sfilipi sfilipi Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like extending the DataPoint class, because that is effectively what happens, columns get added. not a biggie though. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's cons and pros here. I think keeping input and output separate is easier to understand for users.


In reply to: 266150723 [](ancestors = 266150723)

}

private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count)
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0)
Copy link
Member

@sfilipi sfilipi Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static IEnumerable GenerateRandomDataPoints(int count, int seed=0) [](start = 7, length = 86)

i like this, but do you think it will be quick to create them artificially for each task? Ranking and time series come to mind. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes it's not easy, ranking being an example. For those, I'll keep the text-loader style.

so this template is mostly suitable for regression and binary classification.


In reply to: 266151216 [](ancestors = 266151216)

Copy link
Contributor

@zeahmed zeahmed Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text data is also not easy to randomly generate.


In reply to: 266157156 [](ancestors = 266157156,266151216)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Zeeshan A meant meaningful text data but I don't think we need meaningful data to demonstrate the functionality of a module. The amount of data might be a problem to trainers, but to my knowledge, there is no trainer directly consuming strings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample is just a template. If in some scenarios it doesn't make sense, we can try text-loader instead.


In reply to: 266160094 [](ancestors = 266160094,266157156,266151216)

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #2979 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2979      +/-   ##
==========================================
- Coverage    72.3%   72.29%   -0.02%     
==========================================
  Files         796      796              
  Lines      142349   142349              
  Branches    16051    16051              
==========================================
- Hits       102923   102908      -15     
- Misses      35041    35060      +19     
+ Partials     4385     4381       -4
Flag Coverage Δ
#Debug 72.29% <ø> (-0.02%) ⬇️
#production 68% <ø> (-0.02%) ⬇️
#test 88.49% <ø> (+0.01%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 74% <0%> (+0.33%) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <0%> (+0.62%) ⬆️

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -45,11 +75,21 @@ private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count)
}
}

// Example with label and 50 feature values. A data set is a collection of such examples.
private class DataPoint
{
public float Label { get; set; }
[VectorType(50)]
public float[] Features { get; set; }
}
Copy link
Contributor

@zeahmed zeahmed Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all the samples are going to use same DataPoints (to be consistent), right? #Resolved

Copy link
Member

@wschin wschin Mar 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. We now have samples, examples, instances, which are kind of less precise than data point. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the label type and number of features might change depending on the scenario, but the outline is the same.


In reply to: 266160360 [](ancestors = 266160360)

@shmoradims shmoradims merged commit 32017a3 into dotnet:master Mar 18, 2019
@shmoradims shmoradims deleted the samples_template branch April 2, 2019 23:41
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants