Skip to content

Transforms components docs #1321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Oct 30, 2018
Merged

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Oct 19, 2018

Addresses P1 non-Image transforms for #1209.

Adds samples for Nomalizers, Text, Cocat, Term, KeyToVal

Adding samples for the ConcatEstimator, and KeyToValue, Term that will need to get referenced from the respective catalogs when they happen.

re-organized the samples based on static-dynamic. Renamed.
@sfilipi sfilipi self-assigned this Oct 19, 2018
@sfilipi sfilipi added API Issues pertaining the friendly API documentation Related to documentation of ML.NET labels Oct 19, 2018
var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse")
.Append(new TermEstimator(ml, "ReviewReverse" , defaultColumnName));

// Another pipeline, that customizes the advanced settings of the FeaturizeText transformer.
Copy link
Member Author

@sfilipi sfilipi Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Another pip [](start = 11, length = 15)

@justinormont, @TomFinley you guys think I should add more explanations about SortOrder here in the comments? #Resolved

// DefaultColumnName column obtained post-transformation.
// radiation galaxy universe duck
// space galaxy universe radiation
// bus pickup universe radiation
Copy link
Member Author

@sfilipi sfilipi Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

universe radiation [](start = 26, length = 18)

why is this here, log an issue post merge.

var trainData = ml.CreateStreamingDataView(data);

// Preview of the data.
// Review ReviewReverse, Label
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Review ReviewReverse, Label
// Review, ReviewReverse, Label
``` #Resolved

// "animals birds cats dogs fish horse", "radiation galaxy universe duck", 1
// "horse birds house fish duck cats", "space galaxy universe radiation", 0
// "car truck driver bus pickup", "bus pickup", 1
// "car truck driver bus pickup horse", "car truck", 0
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to say the goal of the dataset. Eg: "The goal of the dataset to classify if the review matches ..."

I ask this, mainly as I'm reading the example, I have no idea what the labels represent vs. the data. #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's not obvious what the labels mean.


In reply to: 227236398 [](ancestors = 227236398)

var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse")
.Append(new TermEstimator(ml, "ReviewReverse" , defaultColumnName));

// Another pipeline, that customizes the advanced settings of the TermEstimator.
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to say why changing the hyperparameters of the TermEstimator is useful. Why keep only the first 10 ASCIIbetically ordered terms? #Pending

Copy link
Member Author

@sfilipi sfilipi Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you'd phrase it differently?


In reply to: 227236905 [](ancestors = 227236905)

// making use of default settings.
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizer(ml, "ReviewReverse", "ReviewReverse")
Copy link
Contributor

@justinormont justinormont Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use WordTokenizer+Term instead of TextTransform? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample is about TermEstimator.


In reply to: 227237000 [](ancestors = 227237000)

@justinormont
Copy link
Contributor

justinormont commented Oct 23, 2018

The two added datasets could use an entry in the dataset README.md: https://github.com/dotnet/machinelearning/blob/586533cb80d7e0ea18c02f2e510e1a5cc0e27f41/test/data/README.md
#Resolved

var transformedData = pipeline.Fit(trainData).Transform(trainData);

// Getting the data of the newly created column as an IEnumerable of SampleInfertDataWithFeatures.
var featuresColumn = transformedData.AsEnumerable<SampleInfertDataWithFeatures>(ml, reuseRowObject: false);
Copy link
Contributor

@GalOshri GalOshri Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does reuseRowObject do? Would defaults be fine for these samples? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the enumerable, it determines whether to return the same object on every row, or allocate a new one per row. It is a required param; doesn't have a default.

For the settings of the transforms, i am using both defaults and non-defaults; since the purpose of this snippet is to educate about usage.


In reply to: 227591855 [](ancestors = 227591855)

printHelper("Induced", normalizedColumn);

// Preview of the data.
// Induced
Copy link

@shmoradims shmoradims Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Induced [](start = 15, length = 7)

nit: this doesn't match $"{colName} column obtained post-transformation." #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The col name is "Induced", see line 39.


In reply to: 227974775 [](ancestors = 227974775)

Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if these comment are supposed to be the output from printHelper. If they are, then the headers don't match:
// Preview of the data.
// Induced

vs

Console.WriteLine($"{colName} column obtained post-transformation.");

Same thing applies to other comments showing data preview.


In reply to: 227982369 [](ancestors = 227982369,227974775)

separator: '\t', hasHeader: true);

// Read the data, and leave 10% out, so we can use them for testing
var data = reader.Read(new MultiFileSource(dataFile));
Copy link

@shmoradims shmoradims Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new MultiFileSource(dataFile) [](start = 35, length = 29)

I think you can use dataFile directly for a single file. #Closed

Console.WriteLine($"L2 - {averagedMetrics.L2}");
Console.WriteLine($"LossFunction - {averagedMetrics.LossFn}");
Console.WriteLine($"RMS - {averagedMetrics.Rms}");
Console.WriteLine($"RSquared - {averagedMetrics.RSquared}");
Copy link

@shmoradims shmoradims Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add preview #Closed

var trainData = ml.CreateStreamingDataView(data);

// Preview of the topics data; a dataset that contains one column with two set of keys assigned to a body of text
// Review and ReviewReverse. The dataset will be used to classify how accurately the keys are assigned to the text.
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 26, length = 13)

It's not clear to me what this column means. #Resolved

// making use of default settings.
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse")
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 83, length = 13)

I think it's better to leave this out, if you want to do the transformation in-place. #Resolved

// radiation galaxy universe duck
// space galaxy universe radiation
// bus pickup universe radiation
// car truck universe radiation
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

universe radiation [](start = 25, length = 18)

this seems to be a bug too. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will investigate post PR.


In reply to: 228249405 [](ancestors = 228249405)

// and condition the order in which they get evaluated by changing sort from the default Occurence (order in which they get encountered)
// to value/alphabetically.
string customizedColumnName = "CustomizedKeys";
var customized_pipeline = new WordTokenizeEstimator(ml, "ReviewReverse", "ReviewReverse")
Copy link

@shmoradims shmoradims Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReviewReverse [](start = 86, length = 13)

plz remove for in-place transformation like above #Resolved

@shmoradims shmoradims self-requested a review October 25, 2018 17:51
Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Trainers.SdcaRegression();
Transformers.ConcatEstimator();
TransformSamples.MinMaxNormalizer();
// TrainersSamples.LightGbmRegression();
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TrainersSamples.LightGbmRegression(); [](start = 12, length = 40)

looks weird. #Resolved

string dataFile = SamplesUtils.DatasetUtils.DownloadHousingRegressionDataset();

// Creating the ML.Net IHostEnvironment object, needed for the pipeline
var env = new LocalEnvironment(seed: 0);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var env = new LocalEnvironment(seed: 0); [](start = 11, length = 41)

why for transforms examples we use mlContext, and here we create LocalEnvironment? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for spotting it. I updated them.


In reply to: 228308134 [](ancestors = 228308134)

"9",8,19,20.1,61,5,9
"10",NA,194,8.6,69,5,10
"11",7,NA,6.9,74,5,11
"12",16,256,9.7,69,5,12
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any usage of this file in PR, not even mentions of air quality.
Is it necessary?
And is it regression dataset? because we have #938 #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning to keep the samples consistent with our other repo. They use this dataset, so it will get used in the next set of samples. Just added the two we need in one shot .


In reply to: 228309756 [](ancestors = 228309756)

{
public partial class TransformSamples
{
public static void KeyToValue_Term()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyToValue_Term [](start = 27, length = 15)

This is standing out, what this "_" mean, and why it cannot be KeyToValueAndTerm or KeyToValueThenTerm?

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi merged commit 1bcb79d into dotnet:master Oct 30, 2018
@sfilipi sfilipi deleted the transformsComponentsDocs branch October 30, 2018 20:06
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants