Skip to content

word embedding transform #545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jul 31, 2018
Merged

Conversation

Ivanidzo4ka
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka commented Jul 17, 2018

I heard word embedding can be nice thing for Text classification

  • Create issue
  • Put legal attributes for fastText files
  • Put legal attributes for GloVe files

(edited by @justinormont to fix model type names)
closes #615

@shauheen
Copy link
Contributor

Thanks @Ivanidzo4ka , can you please create an issue, and explain what is missing from ML.NET. 👍

if (string.IsNullOrWhiteSpace(_modelFileNameWithPath))
{
throw Host.Except("Model file for Word Embedding transform could not be found! " +
@"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors' [](start = 61, length = 67)

what should this look like now?

if (_modelFileNameWithPath == null)
{
throw Host.Except("Model file for Word Embedding transform could not be found! " +
@"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVect [](start = 61, length = 62)

and here


private static Dictionary<PretrainedModelKind, string> _modelsMetaData = new Dictionary<PretrainedModelKind, string>()
{
{ PretrainedModelKind.GloVe50D, "glove.6B.50d.txt" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GloVe50D [](start = 35, length = 8)

do we have public locations for all of this?
Do we want to be the ones storing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have aka.ms/tlc-resources/


In reply to: 204168101 [](ancestors = 204168101)

WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use.
The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>.
<para>
Note: As WordEmbedding requires a column with text vector, e.g. %3C'This', 'is', 'good'%3E, users need to create an input column by:
Copy link
Member

@sfilipi sfilipi Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

' [](start = 85, length = 1)

apostrophes need be encoded too: ' #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope %27 will work


In reply to: 204168630 [](ancestors = 204168630)

</item>
</list>
In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is
also created with the text vector, similar as running .split(' '). However, due to the variable length of this column it cannot be properly
Copy link
Member

@sfilipi sfilipi Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

' ' [](start = 71, length = 3)

encode #Resolved

@sfilipi
Copy link
Member

sfilipi commented Jul 20, 2018

      pipeline.Add(new LightLda(("InTextCol" , "OutTextCol")));

bummer! #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/doc.xml:182 in 68696f4. [](commit_id = 68696f4, deletion_comment = False)

<example name="WordEmbeddings">
<example>
<code language="csharp">
pipeline.Add(new WordEmbeddings(("InTextCol" , "OutTextCol")));
Copy link
Member

@sfilipi sfilipi Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" [](start = 43, length = 1)

encode those all. #Resolved

Ivan Matantsev added 2 commits July 20, 2018 14:29
remove links to internal storage
@Ivanidzo4ka Ivanidzo4ka requested a review from GalOshri July 20, 2018 23:44
WordEmbedding. The output from WordEmbedding is named ngram_TransformedText.__
</para>
<para>
License attributes for pretrained models:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License attributes for pretrained models: [](start = 9, length = 42)

@GalOshri is this wording looks ok for you?

Ivan Matantsev added 2 commits July 20, 2018 16:46
</summary>
<remarks>
WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use.
The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>.
Copy link
Contributor

@justinormont justinormont Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastText should be lower camel cased [1] #Resolved

Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and SSWE is upper cased #Resolved

<para>
Note: As WordEmbedding requires a column with text vector, e.g. %3C%27This%27, %27is%27, %27good%27%3E, users need to create an input column by:
<list type="bullet">
<item><description>concatenating columns with TX type,</description></item>
Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concatenating columns of single unigrams is very unlikely.

We should be recommending only to use output_tokens=True in NGramFeaturizer(). All of our current pre-trained models require tokens which are lowercased unigrams w/ diacritics removed (which are the defaults for NGramFeaturizer). #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarification, didn't knew about this.
Will update documentation in later PR.


In reply to: 204193596 [](ancestors = 204193596)

</list>
In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is
also created with the text vector, similar as running .split(%27 %27). However, due to the variable length of this column it cannot be properly
converted to pandas dataframe, thus any pipelines/transforms output this text vector column will throw errors. However, we use
Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pandas dataframe" won't apply to the ML.NET #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curse of copy paste!


In reply to: 204193641 [](ancestors = 204193641)

</item>
<item>
<description>
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glove should be capitalized as GloVe #Resolved

<item>
<description>
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
More information can be found <a href="https://nlp.stanford.edu/projects/glove/">here</a>.
Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to their repo would also be nice: https://github.com/stanfordnlp/GloVe #Resolved

</item>
<item>
<description>
Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
Copy link
Contributor

@justinormont justinormont Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding, Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)., which is their asked for citation format, as per: https://nlp.stanford.edu/projects/glove/
#Resolved

@Ivanidzo4ka
Copy link
Contributor Author

@dotnet-bot test OSX10.13 Release

@Ivanidzo4ka
Copy link
Contributor Author

@dotnet-bot test OSX10.13 Debug

int deno = 0;
srcGetter(ref src);
var values = dst.Values;
Utils.EnsureSize(ref values, 3 * dimension, keepOld: false);
Copy link
Contributor

@justinormont justinormont Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With keepOld: false, the Utils.EnsureSize() function always allocates a new array. Would it be faster to perform the size check and just zero the existing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure why we allocate new array all the time, (considering what we clean it immediately after) so I rather remove this "keepOld"


In reply to: 204564374 [](ancestors = 204564374)

Copy link
Contributor

@TomFinley TomFinley Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reactivating. Read implementation or documentation of EnsureSize a bit more carefully. THe keepOld: false certainly does not always allocate a new array. The difference is, in the situation where it is necessary to resize the array, whether that is created through Array.Resize or new... the latter is faster if we can get away with it, which we certainly can here.


In reply to: 204578950 [](ancestors = 204578950,204564374)

for (int i = 0; i < dimension; i++)
{
float currentTerm = wordVector[i];
if (values[i] > currentTerm)
Copy link
Contributor

@justinormont justinormont Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out the code fix for #545 (review)? I'm not seeing it. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed Array.Clear and replace it with setting MaxValue,0,MinValue.


In reply to: 204566178 [](ancestors = 204566178)

Copy link
Contributor

@justinormont justinormont Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I see it now. #Resolved

int offset = 2 * dimension;
for (int i = 0; i < dimension; i++)
{
values[i] = int.MaxValue;
Copy link
Contributor

@justinormont justinormont Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is of type float, so float.MaxValue will be better #Resolved

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

<para>
Note: As WordEmbedding requires a column with text vector, e.g. %3C%27this%27, %27is%27, %27good%27%3E, users need to create an input column by
using the output_tokens=True for TextTransform to convert a column with sentences like "This is good" into %3C%27this%27, %27is%27, %27good%27 %3E.
The column for the output token column is renamed original column with a prefix of %27_TranformedText%27.
Copy link
Contributor

@justinormont justinormont Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can word smith this a bit..
The suffix of %27_TransformedText%27 is added to the original column name to create the output token column. For instance if the input column is %27body%27, the output tokens column is named %27body_TransformedText%27. #Resolved

if (model == null)
model = new Model(dimension);
if (model.Dimension != dimension)
ch.Warning($"Dimension mismatch while reading model file: '{_modelFileNameWithPath}', line number 1, expected dimension = {model.Dimension}, received dimension = {dimension}");
Copy link
Contributor

@justinormont justinormont Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this warning. The purpose of this block of code is to allow the 1st line to be of different length (and ignored if so). Hence the warning is superfluous. Currently this warning is displayed for all fastText models, even though the model is read correctly (by ignoring the 1st line).

Background: In fastText models, the 1st line is: . Other word embedding models don't have a header line. #Resolved

@Ivanidzo4ka Ivanidzo4ka requested a review from Zruty0 July 25, 2018 19:55
@justinormont
Copy link
Contributor

:shipit:

var name = Path.GetFileName(errorResult.FileName);
throw ch.Except($"{errorMessage}\nModel file for Word Embedding transform could not be found! " +
$@"Please copy the model file '{name}' from '{url}' to '{directory}'.");
}
Copy link
Contributor

@TomFinley TomFinley Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user story here doesn't seem great. As in, I'm not sure this transform is actually usable at all. This ResourceManagerUtils points people to the problematically named "https://aka.ms/tlc-resources/". From an outside user's perspective, does that make it useless?

This transform is clearly written with the expectation of a console application not a library. If it were a library, we might expect these sorts of "optional things" would be brought in via nuget dependencies... that is, if you want to use this or that, you subscribe to the appropriate nuget, then in the Arguments of this thing assign the relevant resource as an actual object published by that nuget. (It would be some variety of IComponentFactory.)

I feel like this whole approach needs some deeper thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problematically named resource has it's own issue #546

Model files which we use (except SSWE which is a 70mb) in range from 160MB(glove50d) to 6 GB(fastText). Which I think impossible to fit into any nuget due to size limitation.


In reply to: 205848439 [](ancestors = 205848439)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy. That's pretty big. Hmmm... hmmm... this is pretty complicated then. All right let's punt on that for now.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ivanidzo4ka ! But seriously, could you create an issue?

@Ivanidzo4ka Ivanidzo4ka merged commit b727d10 into dotnet:master Jul 31, 2018
codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018
Introduce word embedding transform
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Word embeddings
5 participants