-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Transforms components docs #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ntData and sample for the text transform.
Adding samples for the ConcatEstimator, and KeyToValue, Term that will need to get referenced from the respective catalogs when they happen. re-organized the samples based on static-dynamic. Renamed.
var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse") | ||
.Append(new TermEstimator(ml, "ReviewReverse" , defaultColumnName)); | ||
|
||
// Another pipeline, that customizes the advanced settings of the FeaturizeText transformer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Another pip [](start = 11, length = 15)
@justinormont, @TomFinley you guys think I should add more explanations about SortOrder here in the comments? #Resolved
// DefaultColumnName column obtained post-transformation. | ||
// radiation galaxy universe duck | ||
// space galaxy universe radiation | ||
// bus pickup universe radiation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
universe radiation [](start = 26, length = 18)
why is this here, log an issue post merge.
var trainData = ml.CreateStreamingDataView(data); | ||
|
||
// Preview of the data. | ||
// Review ReviewReverse, Label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Review ReviewReverse, Label | |
// Review, ReviewReverse, Label | |
``` #Resolved |
// "animals birds cats dogs fish horse", "radiation galaxy universe duck", 1 | ||
// "horse birds house fish duck cats", "space galaxy universe radiation", 0 | ||
// "car truck driver bus pickup", "bus pickup", 1 | ||
// "car truck driver bus pickup horse", "car truck", 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to say the goal of the dataset. Eg: "The goal of the dataset to classify if the review matches ..."
I ask this, mainly as I'm reading the example, I have no idea what the labels represent vs. the data. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse") | ||
.Append(new TermEstimator(ml, "ReviewReverse" , defaultColumnName)); | ||
|
||
// Another pipeline, that customizes the advanced settings of the TermEstimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to say why changing the hyperparameters of the TermEstimator is useful. Why keep only the first 10 ASCIIbetically ordered terms? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// making use of default settings. | ||
string defaultColumnName = "DefaultKeys"; | ||
// REVIEW create through the catalog extension | ||
var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use WordTokenizer+Term instead of TextTransform? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two added datasets could use an entry in the dataset README.md: https://github.com/dotnet/machinelearning/blob/586533cb80d7e0ea18c02f2e510e1a5cc0e27f41/test/data/README.md |
var transformedData = pipeline.Fit(trainData).Transform(trainData); | ||
|
||
// Getting the data of the newly created column as an IEnumerable of SampleInfertDataWithFeatures. | ||
var featuresColumn = transformedData.AsEnumerable<SampleInfertDataWithFeatures>(ml, reuseRowObject: false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does reuseRowObject
do? Would defaults be fine for these samples? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the enumerable, it determines whether to return the same object on every row, or allocate a new one per row. It is a required param; doesn't have a default.
For the settings of the transforms, i am using both defaults and non-defaults; since the purpose of this snippet is to educate about usage.
In reply to: 227591855 [](ancestors = 227591855)
printHelper("Induced", normalizedColumn); | ||
|
||
// Preview of the data. | ||
// Induced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Induced [](start = 15, length = 7)
nit: this doesn't match $"{colName} column obtained post-transformation." #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if these comment are supposed to be the output from printHelper. If they are, then the headers don't match:
// Preview of the data.
// Induced
vs
Console.WriteLine($"{colName} column obtained post-transformation.");
Same thing applies to other comments showing data preview.
In reply to: 227982369 [](ancestors = 227982369,227974775)
separator: '\t', hasHeader: true); | ||
|
||
// Read the data, and leave 10% out, so we can use them for testing | ||
var data = reader.Read(new MultiFileSource(dataFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new MultiFileSource(dataFile) [](start = 35, length = 29)
I think you can use dataFile directly for a single file. #Closed
Console.WriteLine($"L2 - {averagedMetrics.L2}"); | ||
Console.WriteLine($"LossFunction - {averagedMetrics.LossFn}"); | ||
Console.WriteLine($"RMS - {averagedMetrics.Rms}"); | ||
Console.WriteLine($"RSquared - {averagedMetrics.RSquared}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add preview #Closed
var trainData = ml.CreateStreamingDataView(data); | ||
|
||
// Preview of the topics data; a dataset that contains one column with two set of keys assigned to a body of text | ||
// Review and ReviewReverse. The dataset will be used to classify how accurately the keys are assigned to the text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReviewReverse [](start = 26, length = 13)
It's not clear to me what this column means. #Resolved
// making use of default settings. | ||
string defaultColumnName = "DefaultKeys"; | ||
// REVIEW create through the catalog extension | ||
var default_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReviewReverse [](start = 83, length = 13)
I think it's better to leave this out, if you want to do the transformation in-place. #Resolved
// radiation galaxy universe duck | ||
// space galaxy universe radiation | ||
// bus pickup universe radiation | ||
// car truck universe radiation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
universe radiation [](start = 25, length = 18)
this seems to be a bug too. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and condition the order in which they get evaluated by changing sort from the default Occurence (order in which they get encountered) | ||
// to value/alphabetically. | ||
string customizedColumnName = "CustomizedKeys"; | ||
var customized_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReviewReverse [](start = 86, length = 13)
plz remove for in-place transformation like above #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trainers.SdcaRegression(); | ||
Transformers.ConcatEstimator(); | ||
TransformSamples.MinMaxNormalizer(); | ||
// TrainersSamples.LightGbmRegression(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TrainersSamples.LightGbmRegression(); [](start = 12, length = 40)
looks weird. #Resolved
string dataFile = SamplesUtils.DatasetUtils.DownloadHousingRegressionDataset(); | ||
|
||
// Creating the ML.Net IHostEnvironment object, needed for the pipeline | ||
var env = new LocalEnvironment(seed: 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var env = new LocalEnvironment(seed: 0); [](start = 11, length = 41)
why for transforms examples we use mlContext, and here we create LocalEnvironment? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"9",8,19,20.1,61,5,9 | ||
"10",NA,194,8.6,69,5,10 | ||
"11",7,NA,6.9,74,5,11 | ||
"12",16,256,9.7,69,5,12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any usage of this file in PR, not even mentions of air quality.
Is it necessary?
And is it regression dataset? because we have #938 #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning to keep the samples consistent with our other repo. They use this dataset, so it will get used in the next set of samples. Just added the two we need in one shot .
In reply to: 228309756 [](ancestors = 228309756)
re-arranging lines.
{ | ||
public partial class TransformSamples | ||
{ | ||
public static void KeyToValue_Term() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyToValue_Term [](start = 27, length = 15)
This is standing out, what this "_" mean, and why it cannot be KeyToValueAndTerm or KeyToValueThenTerm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addresses P1 non-Image transforms for #1209.
Adds samples for Nomalizers, Text, Cocat, Term, KeyToVal