-
Notifications
You must be signed in to change notification settings - Fork 1.9k
word embedding transform #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word embedding transform #545
Conversation
Thanks @Ivanidzo4ka , can you please create an issue, and explain what is missing from ML.NET. 👍 |
if (string.IsNullOrWhiteSpace(_modelFileNameWithPath)) | ||
{ | ||
throw Host.Except("Model file for Word Embedding transform could not be found! " + | ||
@"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors' [](start = 61, length = 67)
what should this look like now?
if (_modelFileNameWithPath == null) | ||
{ | ||
throw Host.Except("Model file for Word Embedding transform could not be found! " + | ||
@"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVect [](start = 61, length = 62)
and here
|
||
private static Dictionary<PretrainedModelKind, string> _modelsMetaData = new Dictionary<PretrainedModelKind, string>() | ||
{ | ||
{ PretrainedModelKind.GloVe50D, "glove.6B.50d.txt" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GloVe50D [](start = 35, length = 8)
do we have public locations for all of this?
Do we want to be the ones storing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use. | ||
The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>. | ||
<para> | ||
Note: As WordEmbedding requires a column with text vector, e.g. %3C'This', 'is', 'good'%3E, users need to create an input column by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
' [](start = 85, length = 1)
apostrophes need be encoded too: ' #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
</item> | ||
</list> | ||
In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is | ||
also created with the text vector, similar as running .split(' '). However, due to the variable length of this column it cannot be properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
' ' [](start = 71, length = 3)
encode #Resolved
<example name="WordEmbeddings"> | ||
<example> | ||
<code language="csharp"> | ||
pipeline.Add(new WordEmbeddings(("InTextCol" , "OutTextCol"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" [](start = 43, length = 1)
encode those all. #Resolved
remove links to internal storage
WordEmbedding. The output from WordEmbedding is named ngram_TransformedText.__ | ||
</para> | ||
<para> | ||
License attributes for pretrained models: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License attributes for pretrained models: [](start = 9, length = 42)
@GalOshri is this wording looks ok for you?
</summary> | ||
<remarks> | ||
WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use. | ||
The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fastText
should be lower camel cased [1] #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and SSWE
is upper cased #Resolved
<para> | ||
Note: As WordEmbedding requires a column with text vector, e.g. %3C%27This%27, %27is%27, %27good%27%3E, users need to create an input column by: | ||
<list type="bullet"> | ||
<item><description>concatenating columns with TX type,</description></item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concatenating columns of single unigrams is very unlikely.
We should be recommending only to use output_tokens=True
in NGramFeaturizer()
. All of our current pre-trained models require tokens which are lowercased unigrams w/ diacritics removed (which are the defaults for NGramFeaturizer). #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarification, didn't knew about this.
Will update documentation in later PR.
In reply to: 204193596 [](ancestors = 204193596)
</list> | ||
In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is | ||
also created with the text vector, similar as running .split(%27 %27). However, due to the variable length of this column it cannot be properly | ||
converted to pandas dataframe, thus any pipelines/transforms output this text vector column will throw errors. However, we use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pandas dataframe" won't apply to the ML.NET #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
</item> | ||
<item> | ||
<description> | ||
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glove
should be capitalized as GloVe
#Resolved
<item> | ||
<description> | ||
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>. | ||
More information can be found <a href="https://nlp.stanford.edu/projects/glove/">here</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking to their repo would also be nice: https://github.com/stanfordnlp/GloVe #Resolved
</item> | ||
<item> | ||
<description> | ||
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding, Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf).
, which is their asked for citation format, as per: https://nlp.stanford.edu/projects/glove/
#Resolved
@dotnet-bot test OSX10.13 Release |
@dotnet-bot test OSX10.13 Debug |
int deno = 0; | ||
srcGetter(ref src); | ||
var values = dst.Values; | ||
Utils.EnsureSize(ref values, 3 * dimension, keepOld: false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With keepOld: false
, the Utils.EnsureSize()
function always allocates a new array. Would it be faster to perform the size check and just zero the existing?
array = new T[newSize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure why we allocate new array all the time, (considering what we clean it immediately after) so I rather remove this "keepOld"
In reply to: 204564374 [](ancestors = 204564374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reactivating. Read implementation or documentation of EnsureSize
a bit more carefully. THe keepOld: false
certainly does not always allocate a new array. The difference is, in the situation where it is necessary to resize the array, whether that is created through Array.Resize
or new
... the latter is faster if we can get away with it, which we certainly can here.
In reply to: 204578950 [](ancestors = 204578950,204564374)
for (int i = 0; i < dimension; i++) | ||
{ | ||
float currentTerm = wordVector[i]; | ||
if (values[i] > currentTerm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out the code fix for #545 (review)? I'm not seeing it. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed Array.Clear and replace it with setting MaxValue,0,MinValue.
In reply to: 204566178 [](ancestors = 204566178)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I see it now. #Resolved
int offset = 2 * dimension; | ||
for (int i = 0; i < dimension; i++) | ||
{ | ||
values[i] = int.MaxValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array is of type float, so float.MaxValue will be better #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
<para> | ||
Note: As WordEmbedding requires a column with text vector, e.g. %3C%27this%27, %27is%27, %27good%27%3E, users need to create an input column by | ||
using the output_tokens=True for TextTransform to convert a column with sentences like "This is good" into %3C%27this%27, %27is%27, %27good%27 %3E. | ||
The column for the output token column is renamed original column with a prefix of %27_TranformedText%27. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can word smith this a bit..
The suffix of %27_TransformedText%27 is added to the original column name to create the output token column. For instance if the input column is %27body%27, the output tokens column is named %27body_TransformedText%27.
#Resolved
if (model == null) | ||
model = new Model(dimension); | ||
if (model.Dimension != dimension) | ||
ch.Warning($"Dimension mismatch while reading model file: '{_modelFileNameWithPath}', line number 1, expected dimension = {model.Dimension}, received dimension = {dimension}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this warning. The purpose of this block of code is to allow the 1st line to be of different length (and ignored if so). Hence the warning is superfluous. Currently this warning is displayed for all fastText models, even though the model is read correctly (by ignoring the 1st line).
Background: In fastText models, the 1st line is: . Other word embedding models don't have a header line. #Resolved
|
var name = Path.GetFileName(errorResult.FileName); | ||
throw ch.Except($"{errorMessage}\nModel file for Word Embedding transform could not be found! " + | ||
$@"Please copy the model file '{name}' from '{url}' to '{directory}'."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user story here doesn't seem great. As in, I'm not sure this transform is actually usable at all. This ResourceManagerUtils
points people to the problematically named "https://aka.ms/tlc-resources/"
. From an outside user's perspective, does that make it useless?
This transform is clearly written with the expectation of a console application not a library. If it were a library, we might expect these sorts of "optional things" would be brought in via nuget dependencies... that is, if you want to use this or that, you subscribe to the appropriate nuget, then in the Arguments
of this thing assign the relevant resource as an actual object published by that nuget. (It would be some variety of IComponentFactory
.)
I feel like this whole approach needs some deeper thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy. That's pretty big. Hmmm... hmmm... this is pretty complicated then. All right let's punt on that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Ivanidzo4ka ! But seriously, could you create an issue?
Introduce word embedding transform
I heard word embedding can be nice thing for Text classification
(edited by @justinormont to fix model type names)
closes #615