Skip to content

Creation of components through MLContext and cleanup (KeyToValue, ValueToKey, OneHotEncoding) #2340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 2, 2019

Conversation

artidoro
Copy link
Contributor

@artidoro artidoro commented Jan 31, 2019

This PR is part of the work outlined in #1798, and focuses on the KeyToValue, ValueToKey, OneHotEncoding transformers/estimators:

  1. Internalize constructors of estimators and transformers
  2. Keep ColumnInfo
  3. Rename Arguments -> Options
  4. Internalize Options only when they are not used by public constructor. For all other cases, retain Options as public
  5. Move ColumnInfo to the estimators

This is meant to be a test PR. Once we agree on the format, I will be more confident in doing the remaining.

@codecov
Copy link

codecov bot commented Jan 31, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@578c188). Click here to learn what that means.
The diff coverage is 92.02%.

@@            Coverage Diff            @@
##             master    #2340   +/-   ##
=========================================
  Coverage          ?   71.24%           
=========================================
  Files             ?      783           
  Lines             ?   140733           
  Branches          ?    16086           
=========================================
  Hits              ?   100261           
  Misses            ?    36017           
  Partials          ?     4455
Flag Coverage Δ
#Debug 71.24% <92.02%> (?)
#production 67.6% <85.89%> (?)
#test 85.28% <100%> (?)

@sfilipi
Copy link
Member

sfilipi commented Jan 31, 2019

    public override SchemaShape GetOutputSchema(SchemaShape inputSchema)

xml summary #Resolved


Refers to: src/Microsoft.ML.Data/Transforms/KeyToValue.cs:519 in 9f1871f. [](commit_id = 9f1871f, deletion_comment = False)

@@ -295,12 +248,12 @@ private ColInfo[] CreateInfos(Schema inputSchema)
}

internal ValueToKeyMappingTransformer(IHostEnvironment env, IDataView input,
Copy link
Member

@sfilipi sfilipi Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueToKeyMappingTransformer [](start = 17, length = 28)

do we want to keep the ctors of the transformers public, symmetrically with the model parameters (former predictors)? #Resolved

Copy link
Contributor

@TomFinley TomFinley Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bad example, since we've determined after some time that having public constructors for model parameters is pointless (see e.g., #1866 and leadup discussion to this, also cc @eerhardt, @glebuk). That said, in favor of your point, direct construction of transformers is potentially useful -- unlike direct creation of model parameters, with which by itself literally nothing can be done.

However, unless we have a solid immediate scenario that literally cannot be accommodated using the usual MLContext pathway (not the case here), we should not do it. I sort of like APIs with one way for a user to do things, not multiple, at least for v1. (And even if we did have such a scenario, considering the timeframes, we probably wouldn't want to do it anyway.) #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per #1098, we do want to make them internal. For V1, MLContext would be the way to create them all.

Regarding model parameters, there is a separate issue for that #1866 . We can do those as a separate effort / PR perhaps ?


In reply to: 252551812 [](ancestors = 252551812)

Copy link
Contributor

@glebuk glebuk Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should couple public-ness of transformers with model params. Those are different things. As Rogan suggested in #1886, it can be valuable to have public model ctors for a variety of scenarios. But that is a set of different reason vs transformers. However, in case of transformers, I agree, if there is a way to create them with MLContext, there is no burning reason to have a public ctor. #Resolved

Copy link
Contributor Author

@artidoro artidoro Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background:

  • Trainable transformers can only be instantiated through estimators (using .Fit() )
  • Non trainable transforms can be instantiated directly (through their constructor)
  • We only create estimators through MLContext, not transformers

I think we should keep the constructors of the non trainable transformers public, while internalizing the constructors of the estimators. #Resolved

Copy link
Member

@abgoswam abgoswam Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artidoro . Perhaps we should also make the constructors for non-trainable transforms internal.

I believe @TomFinley and @glebuk also mention the same in their comments.


In reply to: 252782663 [](ancestors = 252782663)

Copy link
Contributor Author

@artidoro artidoro Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Tom and Gleb, I think the decision is to internalize constructors for all transformers. They will only be instantiated through their respective estimators.

The question remains on how to make it look less silly when you have a non trainable transformer that needs to be instantiated using .Fit() on some data. #Resolved

Copy link
Member

@abgoswam abgoswam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -102,7 +103,7 @@ public KeyToValueMappingTransformer(IHostEnvironment env, params (string outputC
/// Factory method for SignatureDataTransform.
/// </summary>
[BestFriend]
internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input)
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input)
Copy link
Contributor Author

@artidoro artidoro Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args [](start = 76, length = 4)

rename to options. #Resolved

@@ -317,14 +270,14 @@ private ColInfo[] CreateInfos(Schema inputSchema)

[BestFriend]
// Factory method for SignatureDataTransform.
internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input)
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input)
Copy link
Contributor Author

@artidoro artidoro Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args [](start = 76, length = 4)

rename to options. #Resolved

@@ -119,7 +119,7 @@ public Arguments()

internal const string UserName = "Categorical Transform";

internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input)
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input)
Copy link
Contributor Author

@artidoro artidoro Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args [](start = 76, length = 4)

rename to options. #Resolved

@@ -32,15 +32,15 @@ public static void KeyToValue_Term()
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizingEstimator(ml, "Review")
.Append(new ValueToKeyMappingEstimator(ml, defaultColumnName, "Review"));
.Append(ml.Transforms.Conversion.MapValueToKey(defaultColumnName, "Review"));

// Another pipeline, that customizes the advanced settings of the TermEstimator.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TermEstimator [](start = 78, length = 13)

it's no longer TermEstimator #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Jan 31, 2019

public class KeyToValue_TermExample

hmm. #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs:9 in 0c415ef. [](commit_id = 0c415ef, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

public class KeyToValue_TermExample

Can we call it ValueToKeyAndKeyToValue?
Or The Value on There and Back Again


In reply to: 459552730 [](ancestors = 459552730)


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs:9 in 0c415ef. [](commit_id = 0c415ef, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 1, 2019

    public static void KeyToValue_Term()

we tend to put

        /// <example>
        /// <format type="text/markdown">
        /// <![CDATA[
        /// [!code-csharp[FeaturizeText](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/WordEmbeddingTransform.cs)]
        /// ]]>
        /// </format>
        /// </example>

on top of estimators in catalog. Can you do it as well?
#Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs:11 in 0c415ef. [](commit_id = 0c415ef, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

public class KeyToValue_TermExample

Or maybe break them into two samples files?


In reply to: 459554338 [](ancestors = 459554338,459552730)


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs:9 in 0c415ef. [](commit_id = 0c415ef, deletion_comment = False)

@@ -102,13 +102,13 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="maxNumTerms">Maximum number of keys to keep per column when auto-training.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxNumTerms [](start = 25, length = 11)

I know it's unrelated to your changes, but we no longer call this transform as TermTransform, can we rename parameter to maxNumberOfKeys #Resolved

public class ColumnInfo
{
public readonly string Name;
public readonly string InputColumnName;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfilipi, why it's Name and not OutputColumnName? #Resolved

/// <param name="maxNumTerms">Maximum number of terms to keep per column when auto-training.</param>
/// <param name="sort">How items should be ordered when vectorized. If <see cref="SortOrder.Occurrence"/> choosen they will be in the order encountered.
/// If <see cref="SortOrder.Value"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
/// <param name="term">List of terms.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// List of terms. [](start = 12, length = 45)

Just curious in intent of your PR to hide constructors, or you also want to make it more clear to use. If later, I would put better description. Or maybe break this one into two constructors. sort and max num should be useless if you provide list of terms (can we use keys instead of terms?) #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can file an issue on improving the API for ValueToKey estimator, but the general intent of this PR is mainly to hide and rename.
I agree it needs to be improved though especially the ColumnInfo object does not look right.


In reply to: 252895393 [](ancestors = 252895393)

@@ -120,7 +120,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// view, and the key-values will be taken from taht column. If unspecified, the key-values will be determined
/// from the input data upon fitting.</param>
public static ValueToKeyMappingEstimator MapValueToKey(this TransformsCatalog.ConversionTransforms catalog,
ValueToKeyMappingTransformer.ColumnInfo[] columns, IDataView keyData = null)
ValueToKeyMappingEstimator.ColumnInfo[] columns, IDataView keyData = null)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueToKeyMappingEstimator.ColumnInfo[] columns, IDataView keyData = null [](start = 12, length = 73)

I don't understand why we have this constructor.
I can have list of terms in each columnInfo, and I can specify keyData, what will happen? Is keyData get affected by maxNum and sort?
It brings so many questions. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should clean this one out in a separate PR. Do you want to create an issue? or should I do it?


In reply to: 252895984 [](ancestors = 252895984)

@@ -120,7 +120,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// view, and the key-values will be taken from taht column. If unspecified, the key-values will be determined
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taht [](start = 56, length = 4)

that #Resolved

/// <param name="catalog">The transform catalog</param>
/// <param name="columns">The column settings.</param>
/// <param name="keyData">Specifies an ordering for the encoding. If specified, this should be a single column data view,
/// and the key-values will be taken from taht column. If unspecified, the ordering will be determined from the input data upon fitting.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taht [](start = 50, length = 4)

that #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 1, 2019

    public static KeyToVectorMappingEstimator MapKeyToVector(this TransformsCatalog.ConversionTransforms catalog,

you use it in your sample, why not link it? #Resolved


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:97 in 5cc95e0. [](commit_id = 5cc95e0, deletion_comment = False)

@artidoro
Copy link
Contributor Author

artidoro commented Feb 1, 2019

    public static KeyToVectorMappingEstimator MapKeyToVector(this TransformsCatalog.ConversionTransforms catalog,

We don't use it in the KeyToValueValueToKey sample. We only use MapKeyToValue and MapValueToKey.


In reply to: 459595321 [](ancestors = 459595321)


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:97 in 5cc95e0. [](commit_id = 5cc95e0, deletion_comment = False)

@artidoro artidoro self-assigned this Feb 1, 2019
@artidoro artidoro added the API Issues pertaining the friendly API label Feb 1, 2019
Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@artidoro artidoro merged commit 0e80d6c into dotnet:master Feb 2, 2019
@artidoro artidoro deleted the k2vv2k branch March 13, 2019 17:56
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants