-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Creation of components through MLContext and cleanup (KeyToValue, ValueToKey, OneHotEncoding) #2340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2340 +/- ##
=========================================
Coverage ? 71.24%
=========================================
Files ? 783
Lines ? 140733
Branches ? 16086
=========================================
Hits ? 100261
Misses ? 36017
Partials ? 4455
|
@@ -295,12 +248,12 @@ private ColInfo[] CreateInfos(Schema inputSchema) | |||
} | |||
|
|||
internal ValueToKeyMappingTransformer(IHostEnvironment env, IDataView input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueToKeyMappingTransformer [](start = 17, length = 28)
do we want to keep the ctors of the transformers public, symmetrically with the model parameters (former predictors)? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bad example, since we've determined after some time that having public constructors for model parameters is pointless (see e.g., #1866 and leadup discussion to this, also cc @eerhardt, @glebuk). That said, in favor of your point, direct construction of transformers is potentially useful -- unlike direct creation of model parameters, with which by itself literally nothing can be done.
However, unless we have a solid immediate scenario that literally cannot be accommodated using the usual MLContext
pathway (not the case here), we should not do it. I sort of like APIs with one way for a user to do things, not multiple, at least for v1. (And even if we did have such a scenario, considering the timeframes, we probably wouldn't want to do it anyway.) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should couple public-ness of transformers with model params. Those are different things. As Rogan suggested in #1886, it can be valuable to have public model ctors for a variety of scenarios. But that is a set of different reason vs transformers. However, in case of transformers, I agree, if there is a way to create them with MLContext, there is no burning reason to have a public ctor. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Background:
- Trainable transformers can only be instantiated through estimators (using .Fit() )
- Non trainable transforms can be instantiated directly (through their constructor)
- We only create estimators through MLContext, not transformers
I think we should keep the constructors of the non trainable transformers public, while internalizing the constructors of the estimators. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@artidoro . Perhaps we should also make the constructors for non-trainable transforms internal
.
I believe @TomFinley and @glebuk also mention the same in their comments.
In reply to: 252782663 [](ancestors = 252782663)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with Tom and Gleb, I think the decision is to internalize constructors for all transformers. They will only be instantiated through their respective estimators.
The question remains on how to make it look less silly when you have a non trainable transformer that needs to be instantiated using .Fit() on some data. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -102,7 +103,7 @@ public KeyToValueMappingTransformer(IHostEnvironment env, params (string outputC | |||
/// Factory method for SignatureDataTransform. | |||
/// </summary> | |||
[BestFriend] | |||
internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input) | |||
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args [](start = 76, length = 4)
rename to options. #Resolved
@@ -317,14 +270,14 @@ private ColInfo[] CreateInfos(Schema inputSchema) | |||
|
|||
[BestFriend] | |||
// Factory method for SignatureDataTransform. | |||
internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input) | |||
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args [](start = 76, length = 4)
rename to options. #Resolved
@@ -119,7 +119,7 @@ public Arguments() | |||
|
|||
internal const string UserName = "Categorical Transform"; | |||
|
|||
internal static IDataTransform Create(IHostEnvironment env, Arguments args, IDataView input) | |||
internal static IDataTransform Create(IHostEnvironment env, Options args, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args [](start = 76, length = 4)
rename to options. #Resolved
@@ -32,15 +32,15 @@ public static void KeyToValue_Term() | |||
string defaultColumnName = "DefaultKeys"; | |||
// REVIEW create through the catalog extension | |||
var default_pipeline = new WordTokenizingEstimator(ml, "Review") | |||
.Append(new ValueToKeyMappingEstimator(ml, defaultColumnName, "Review")); | |||
.Append(ml.Transforms.Conversion.MapValueToKey(defaultColumnName, "Review")); | |||
|
|||
// Another pipeline, that customizes the advanced settings of the TermEstimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TermEstimator [](start = 78, length = 13)
it's no longer TermEstimator #Resolved
we tend to put
on top of estimators in catalog. Can you do it as well? Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/KeyToValue_Term.cs:11 in 0c415ef. [](commit_id = 0c415ef, deletion_comment = False) |
@@ -102,13 +102,13 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co | |||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param> | |||
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | |||
/// <param name="maxNumTerms">Maximum number of keys to keep per column when auto-training.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxNumTerms [](start = 25, length = 11)
I know it's unrelated to your changes, but we no longer call this transform as TermTransform, can we rename parameter to maxNumberOfKeys
#Resolved
public class ColumnInfo | ||
{ | ||
public readonly string Name; | ||
public readonly string InputColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfilipi, why it's Name and not OutputColumnName? #Resolved
/// <param name="maxNumTerms">Maximum number of terms to keep per column when auto-training.</param> | ||
/// <param name="sort">How items should be ordered when vectorized. If <see cref="SortOrder.Occurrence"/> choosen they will be in the order encountered. | ||
/// If <see cref="SortOrder.Value"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param> | ||
/// <param name="term">List of terms.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// List of terms. [](start = 12, length = 45)
Just curious in intent of your PR to hide constructors, or you also want to make it more clear to use. If later, I would put better description. Or maybe break this one into two constructors. sort and max num should be useless if you provide list of terms (can we use keys instead of terms?) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can file an issue on improving the API for ValueToKey estimator, but the general intent of this PR is mainly to hide and rename.
I agree it needs to be improved though especially the ColumnInfo object does not look right.
In reply to: 252895393 [](ancestors = 252895393)
@@ -120,7 +120,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co | |||
/// view, and the key-values will be taken from taht column. If unspecified, the key-values will be determined | |||
/// from the input data upon fitting.</param> | |||
public static ValueToKeyMappingEstimator MapValueToKey(this TransformsCatalog.ConversionTransforms catalog, | |||
ValueToKeyMappingTransformer.ColumnInfo[] columns, IDataView keyData = null) | |||
ValueToKeyMappingEstimator.ColumnInfo[] columns, IDataView keyData = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueToKeyMappingEstimator.ColumnInfo[] columns, IDataView keyData = null [](start = 12, length = 73)
I don't understand why we have this constructor.
I can have list of terms in each columnInfo, and I can specify keyData, what will happen? Is keyData get affected by maxNum and sort?
It brings so many questions. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should clean this one out in a separate PR. Do you want to create an issue? or should I do it?
In reply to: 252895984 [](ancestors = 252895984)
@@ -120,7 +120,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co | |||
/// view, and the key-values will be taken from taht column. If unspecified, the key-values will be determined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taht [](start = 56, length = 4)
that #Resolved
/// <param name="catalog">The transform catalog</param> | ||
/// <param name="columns">The column settings.</param> | ||
/// <param name="keyData">Specifies an ordering for the encoding. If specified, this should be a single column data view, | ||
/// and the key-values will be taken from taht column. If unspecified, the ordering will be determined from the input data upon fitting.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taht [](start = 50, length = 4)
that #Resolved
you use it in your sample, why not link it? #Resolved Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:97 in 5cc95e0. [](commit_id = 5cc95e0, deletion_comment = False) |
We don't use it in the KeyToValueValueToKey sample. We only use MapKeyToValue and MapValueToKey. In reply to: 459595321 [](ancestors = 459595321) Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:97 in 5cc95e0. [](commit_id = 5cc95e0, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is part of the work outlined in #1798, and focuses on the KeyToValue, ValueToKey, OneHotEncoding transformers/estimators:
ColumnInfo
Arguments
->Options
Options
only when they are not used by public constructor. For all other cases, retain Options as publicColumnInfo
to the estimatorsThis is meant to be a test PR. Once we agree on the format, I will be more confident in doing the remaining.