-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Converted WordEmbeddingsTransform into Estimator. #928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -21,10 +23,16 @@ | |||
[assembly: LoadableClass(typeof(WordEmbeddingsTransform), null, typeof(SignatureLoadDataTransform), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typeof(WordEmbeddingsTransform) [](start = 25, length = 31)
should be typeof(IDataView), typeof(WordEmbeddingTransform) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not just IDataVIew, but also WordEmbeddingTransform
In reply to: 217929112 [](ancestors = 217929112)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all this confusion, but I was confused as well. First type should be return type of your static Create method. And since you use IDataTransform as return type, you need to use IDataTransform
In reply to: 218249508 [](ancestors = 218249508,217929112)
|
||
return | ||
(ref VBuffer<float> dst) => | ||
if (!_parent.InputType.IsVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputType [](start = 29, length = 9)
If you use InputType only here, probably it should be part of Mapper. #Resolved
return h.Apply("Loading Model", | ||
ch => new WordEmbeddingsTransform(h, ctx, input)); | ||
ch => new WordEmbeddingsTransform(h, ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch => new WordEmbeddingsTransform(h, ctx)) [](start = 16, length = 42)
Is this recursive call? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like your comment is addressed, right?
In reply to: 218154567 [](ancestors = 218154567,217929300)
{ | ||
Contracts.CheckValue(env, nameof(env)); | ||
IHost h = env.Register(RegistrationName); | ||
h.CheckValue(ctx, nameof(ctx)); | ||
h.CheckValue(input, nameof(input)); | ||
return h.Apply("Loading Model", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return h.Apply("Loading Model", [](start = 12, length = 31)
you don't have to do this Apply magic #Resolved
: base(env, RegistrationName, Contracts.CheckRef(args, nameof(args)).Column, | ||
input, TestIsTextVector) | ||
public WordEmbeddingsTransform(IHostEnvironment env, string inputColumn, string outputColumn, | ||
PretrainedModelKind? modelKind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PretrainedModelKind? [](start = 11, length = 20)
Nullable type looks weird here.
In argument/columns it's used to setup default value to columns where it's optional, but tend to flatten our structure with ColumnInfo class in each Transformer.
Can you also use ColumnInfo class instead of reusing Column class (Column is mostly for arguments and command line) #Resolved
} | ||
|
||
// Factory method for SignatureLoadDataTransform. | ||
public static IDataTransform Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
if you merge with master you can make them private #Resolved
_currentVocab = GetVocabularyDictionary(); | ||
_outputType = new VectorType(NumberType.R4, 3 * _currentVocab.Dimension); | ||
Metadata.Seal(); | ||
_currentVocab = GetVocabularyDictionary(linesToSkip); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetVocabularyDictionary [](start = 28, length = 23)
Can we move file reading to mapper?
Since you use trivial estimator you basically create transformer in estimator constructor.
Which would force heavy IO processing, and I think purpose of Estimator to be something light weight. #Resolved
label: ctx.Bool.Scalar)); | ||
|
||
var est = data2.MakeNewEstimator() | ||
.Append(row => row.SentimentText_Features_TransformedText.WordEmbeddings()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SentimentText_Features_TransformedText.WordEmbeddings() [](start = 35, length = 55)
I think we need to postpone Pigsty extension.
Pigsty suppose to be something convenient, and the whole idea, hey user, run text transform, pull special settings for text transform, take special column, and then apply WordEmbeddings is something I wouldn't call convenient.
I would rather prefer to apply WordEmbeddings for just text column and do all magic inside that extension. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed today, lets keep it as it is currently implemented.
In reply to: 217930149 [](ancestors = 217930149)
for (int word = 0; word < src.Count; word++) | ||
throw Host.ExceptParam(nameof(input), | ||
"Text input given, expects a text vector"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong.
You validate parent in mapper, but throw nameof(input)
#Resolved
|
||
var info = Infos[iinfo]; | ||
if (!info.TypeSrc.IsVector) | ||
public Mapper(WordEmbeddingsTransform parent, ISchema inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputSchema [](start = 66, length = 11)
I think you need to validate somewhere this schema.
You can obtain transformer, but no one guarantee you what it's gonna be applied to same data, or even data with same schema.
We do have some in our base class, but we only check columns with specific names are present, but no one checks types. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GetOutputSchema
method in Estimator class does that. Is it not sufficient? or need to do it here has well?
In reply to: 217930424 [](ancestors = 217930424)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -442,4 +498,95 @@ private Model GetVocabularyDictionary() | |||
} | |||
} | |||
} | |||
|
|||
public sealed class WordEmbeddingsEstimator : TrivialEstimator<WordEmbeddingsTransform> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordEmbeddingsEstimator [](start = 24, length = 23)
I'm not certain that this is a trivial estimator, since loading the model file is a fairly elaborate operation. Potentially involving a transfer of some sort of file from Azure, and the actual schema shape does not seem to depend upon this. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the understanding that any transform that does not require training falls into trivial estimator
category. That's where WordEmbedding
transform falls into. Why do you think its not trivial
? what other option are there for this?
In reply to: 218166505 [](ancestors = 218166505)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always push loading of data to mapper.
In reply to: 218247817 [](ancestors = 218247817,218166505)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that pretty much whats happening right now in this transform.
In reply to: 218250223 [](ancestors = 218250223,218247817,218166505)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to non-trivial estimator as there is a data loading involved during training.
In reply to: 218251100 [](ancestors = 218251100,218250223,218247817,218166505)
{ | ||
if (!inputSchema.TryFindColumn(colInfo.input, out var col)) | ||
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.input); | ||
if (!(col.ItemType is TextType) || col.Kind != SchemaShape.Column.VectorKind.VariableVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VariableVector [](start = 93, length = 14)
This is a rather stronger test than the existing TestIsTextVector
test. Variable size input vectors can be used of course, but also fixed sized vectors can be used. Unless there is a very strong argument for it, I'd rather not degrade our capability, since there's a fairly straightforward way to apply it to fixed size columns as well. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I relaxed the checking to either be vector or variable length vector.
In reply to: 218167090 [](ancestors = 218167090)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static class WordEmbeddingsStaticExtensions | ||
{ | ||
public static Vector<float> WordEmbeddings(this Vector<string> input, WordEmbeddingsTransform.PretrainedModelKind modelKind = WordEmbeddingsTransform.PretrainedModelKind.Sswe, | ||
string customLookupTable = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we explain the relation between the pretrained model, and an explicitly specified model as via a custom lookup table? My expectation, which may be wrong, is that the two options are mutually exclusive, is it not so? If that is so, then we ought to have two overloads. One for the pre-trained model, the other for the explicitly specified model. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, through the code it seems like both are mutually exclusive. I have separated out the code for both in different constructors.
In reply to: 218169157 [](ancestors = 218169157)
public static class WordEmbeddingsStaticExtensions | ||
{ | ||
public static Vector<float> WordEmbeddings(this Vector<string> input, WordEmbeddingsTransform.PretrainedModelKind modelKind = WordEmbeddingsTransform.PretrainedModelKind.Sswe, | ||
string customLookupTable = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the proposed most conspicuous part of the public API, we have been tending to write clear documentation for our Pigsty extension methods. #Closed
|
||
public static class WordEmbeddingsStaticExtensions | ||
{ | ||
public static Vector<float> WordEmbeddings(this Vector<string> input, WordEmbeddingsTransform.PretrainedModelKind modelKind = WordEmbeddingsTransform.PretrainedModelKind.Sswe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector [](start = 56, length = 6)
I believe this must be over VarVector
, not Vector
, is it not so? Indeed given the structure of your test, where you do call this as far as I see, I'm not sure how that test is succeeding, since I imagine it would have failed? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It because the output tokens vector (SentimentText_Features_TransformedText
) produced by the TextFeatrizer is of type Vector
currently. I set it VarVector
and it fails there actually.
In reply to: 218171746 [](ancestors = 218171746)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That sounds like a mistake. The transformed tokenized text will definitely be VarVector<string>
not Vector<string>
. We ought to fix that there, and in any event not propagate the error there.
In reply to: 218183133 [](ancestors = 218183133,218171746)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it by getting a VarVector in AssertStatic.
var data2 = dynamicData.AssertStatic(Env, ctx => (
SentimentText_Features_TransformedText: ctx.Text.VarVector,
SentimentText: ctx.Text.Scalar,
label: ctx.Bool.Scalar));
Is this is the correct behavior for this method that it does not complain about getting Vector type when underlying type is VarVector?
In reply to: 218223351 [](ancestors = 218223351,218219076,218219010,218183133,218171746)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Zeeshan, I think SentimentText_Features_TransformedText
that you're asserting from is a variable sized vector, so asserting that it is a VarVector
as you have done here seems correct to me. Is it not so? Perhaps I miss the point.
(On the other hand, we ought to have some mechanism to convert a Vector
to VarVector
, since that seems like something that might be useful sometimes.)
In reply to: 218239184 [](ancestors = 218239184,218223351,218219076,218219010,218183133,218171746)
|
||
public static class WordEmbeddingsStaticExtensions | ||
{ | ||
/// <include file='doc.xml' path='doc/members/member[@name="WordEmbeddings"]/*' /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 78)
While linking this is probably nice, we should probably also have parameter descriptors, should we not? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking to our local expert @sfilipi is never wasted, but we could also try it out... maybe from the test assembly, hit the dots, see what shows up in intellisense. Based on my (almost certainly imperfect) understanding, I think any defined elements are just smooshed together in the output, so even while we can get the import to get the summary and remarks, any params would be rendered without complaint.
I do wonder however what stylecop will make of all that though...
In reply to: 218236334 [](ancestors = 218236334,218234747)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
Host.AssertValue(ctx); | ||
Host.AssertNonEmpty(Infos); | ||
Host.AssertNonEmpty(ColumnPairs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it make sense. #Resolved
/// <param name="env">Host Environment.</param> | ||
/// <param name="modelKind">The pretrained word embedding model.</param> | ||
/// <param name="columns">Input/Output columns.</param> | ||
public WordEmbeddingsTransform(IHostEnvironment env, PretrainedModelKind? modelKind, params ColumnInfo[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PretrainedModelKind? modelKind [](start = 61, length = 30)
can you push modelKind and customLookup table to ColumnInfo class? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not I don't think so because modelKind and customLookupTable are same for all the columns and only needed once.
In reply to: 218250633 [](ancestors = 218250633)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
public static WordEmbeddingsTransform Create(IHostEnvironment env, ModelLoadContext ctx, IDataView input) | ||
public static WordEmbeddingsTransform Create(IHostEnvironment env, ModelLoadContext ctx) | ||
{ | ||
Contracts.CheckValue(env, nameof(env)); | ||
IHost h = env.Register(RegistrationName); | ||
h.CheckValue(ctx, nameof(ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h.CheckValue(ctx, nameof(ctx)); [](start = 12, length = 31)
I think we have tendency to validate version info for model as well
ctx.CheckAtModel(GetVersionInfo());
Dunno why it's not present here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, model checking is not there before I started converting into estimator. I will add.
In reply to: 218251360 [](ancestors = 218251360)
/// Instantiates <see cref="WordEmbeddingsTransform"/> using the custom word embedding model by loading it from the file specified by the <paramref name="customLookupTable"/>. | ||
/// </summary> | ||
/// <param name="env">Host Environment.</param> | ||
/// <param name="customLookupTable">Filename for custom word embedding model.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Filename for custom word embedding model. [](start = 7, length = 86)
This is lookup table, but it's actually a file.
Can we give it proper name? We should leave argument class untouched, but here I would prefer to have something meaningful. #Resolved
|
||
_modelKind = modelKind; | ||
_modelFileNameWithPath = EnsureModelFile(env, out _linesToSkip, (PretrainedModelKind)_modelKind); | ||
Host.CheckNonWhiteSpace(_modelFileNameWithPath, nameof(_modelFileNameWithPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host.CheckNonWhiteSpace(_modelFileNameWithPath, nameof(_modelFileNameWithPath)); [](start = 12, length = 80)
why this is not responsibility of EnsureModelFile? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the code inside EnsureModelFile and found out that if the call is successful it will always have correct FileName otherwise throw error. I am going to remove this check.
In reply to: 218253058 [](ancestors = 218253058)
|
||
_modelKind = null; | ||
_modelFileNameWithPath = customLookupTable; | ||
Host.CheckNonWhiteSpace(_modelFileNameWithPath, nameof(_modelFileNameWithPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameof(_modelFileNameWithPath) [](start = 60, length = 30)
nameof(customLookupTable), user don't know about your internal variables. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a check on customLookupTable
rather than _modelFileNameWithPath. I will fix.
In reply to: 218253183 [](ancestors = 218253183)
=> Create(env, ctx).MakeDataTransform(input); | ||
|
||
// Factory method for SignatureLoadRowMapper. | ||
public static IRowMapper Create(IHostEnvironment env, ModelLoadContext ctx, ISchema inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
also can be private #Resolved
} | ||
return GetGetterVec(ch, input, iinfo); | ||
} | ||
Contracts.CheckValue(inputSchema, nameof(inputSchema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts [](start = 16, length = 9)
Host which you create in base constuctor #Resolved
{ | ||
_inputTypes[i] = inputSchema.GetColumnType(ColMapNewToOld[i]); | ||
if (!_inputTypes[i].IsVector || !_inputTypes[i].ItemType.IsText) | ||
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", _parent.ColumnPairs[i].input, new VectorType(TextType.Instance).ToString(), _inputTypes[i].ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically _parent.CheckInputColumn #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least it should be, and you have slightly different implementations, which I would suggest to move into one.
In reply to: 218255515 [](ancestors = 218255515)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is the same check. I am wondering if we need to do it here as well because _parent.CheckInputColumn does it already?
In reply to: 218255515 [](ancestors = 218255515)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckInputColumn got called in ITransformer.GetOutputSchema, and this got called during ITransformer.Transform
In reply to: 218257338 [](ancestors = 218257338,218255515)
for (int word = 0; word < src.Count; word++) | ||
throw Host.Except(_parent.ColumnPairs[iinfo].input, | ||
"Text input given, expects a text vector"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you run this check here? why not in constructor? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its not need here anymore as constructor is checking it already.
In reply to: 218256497 [](ancestors = 218256497)
{ | ||
} | ||
|
||
public WordEmbeddingsEstimator(IHostEnvironment env, WordEmbeddingsTransform transformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
this can be private. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes #927