word embedding transform #545

Ivanidzo4ka · 2018-07-17T20:39:52Z

I heard word embedding can be nice thing for Text classification

Create issue
Put legal attributes for fastText files
Put legal attributes for GloVe files

(edited by @justinormont to fix model type names)
closes #615

shauheen · 2018-07-18T13:20:12Z

Thanks @Ivanidzo4ka , can you please create an issue, and explain what is missing from ML.NET. 👍

sfilipi · 2018-07-20T20:58:59Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+            if (string.IsNullOrWhiteSpace(_modelFileNameWithPath))
+            {
+                throw Host.Except("Model file for Word Embedding transform could not be found! " +
+                    @"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " +


\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors' [](start = 61, length = 67)

what should this look like now?

sfilipi · 2018-07-20T20:59:28Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+            if (_modelFileNameWithPath == null)
+            {
+                throw Host.Except("Model file for Word Embedding transform could not be found! " +
+                    @"Please copy the model file '{0}' from '\\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVectors\' " +


\cloudmltlc\TLC\releases\Resources\3.8\TextAnalytics\WordVect [](start = 61, length = 62)

and here

sfilipi · 2018-07-20T21:00:38Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+
+        private static Dictionary<PretrainedModelKind, string> _modelsMetaData = new Dictionary<PretrainedModelKind, string>()
+        {
+             { PretrainedModelKind.GloVe50D, "glove.6B.50d.txt" },


GloVe50D [](start = 35, length = 8)

do we have public locations for all of this?
Do we want to be the ones storing them?

we have aka.ms/tlc-resources/

In reply to: 204168101 [](ancestors = 204168101)

sfilipi · 2018-07-20T21:03:03Z

src/Microsoft.ML.Transforms/Text/doc.xml

+        WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use. 
+        The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>.
+        <para>
+          Note: As WordEmbedding requires a column with text vector, e.g. %3C'This', 'is', 'good'%3E, users need to create an input column by:


' [](start = 85, length = 1)

apostrophes need be encoded too: ' #Resolved

hope %27 will work

In reply to: 204168630 [](ancestors = 204168630)

sfilipi · 2018-07-20T21:03:18Z

src/Microsoft.ML.Transforms/Text/doc.xml

+            </item>
+        </list>
+          In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is
+          also created with the text vector, similar as running .split(' '). However, due to the variable length of this column it cannot be properly


' ' [](start = 71, length = 3)

encode #Resolved

sfilipi · 2018-07-20T21:03:35Z

      pipeline.Add(new LightLda(("InTextCol" , "OutTextCol")));

bummer! #Resolved

Refers to: src/Microsoft.ML.Transforms/Text/doc.xml:182 in 68696f4. [](commit_id = 68696f4, deletion_comment = False)

sfilipi · 2018-07-20T21:04:01Z

src/Microsoft.ML.Transforms/Text/doc.xml

+    <example name="WordEmbeddings">
+      <example>
+        <code language="csharp">
+          pipeline.Add(new WordEmbeddings(("InTextCol" , "OutTextCol")));


" [](start = 43, length = 1)

encode those all. #Resolved

remove links to internal storage

Ivanidzo4ka · 2018-07-20T23:44:38Z

src/Microsoft.ML.Transforms/Text/doc.xml

+          WordEmbedding. The output from WordEmbedding is named ngram_TransformedText.__
+        </para>
+        <para>
+          License attributes for pretrained models:


License attributes for pretrained models: [](start = 9, length = 42)

@GalOshri is this wording looks ok for you?

justinormont · 2018-07-20T23:57:54Z

src/Microsoft.ML.Transforms/Text/doc.xml

+      </summary>
+      <remarks>
+        WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use. 
+        The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">FastText</a>, and <a href="http://anthology.aclweb.org/P/P14/P14-1146.pdf">Sswe</a>.


fastText should be lower camel cased [1] #Resolved

and SSWE is upper cased #Resolved

justinormont · 2018-07-21T00:06:29Z

src/Microsoft.ML.Transforms/Text/doc.xml

+        <para>
+          Note: As WordEmbedding requires a column with text vector, e.g. %3C%27This%27, %27is%27, %27good%27%3E, users need to create an input column by:
+          <list type="bullet">
+          <item><description>concatenating columns with TX type,</description></item>


Concatenating columns of single unigrams is very unlikely.

We should be recommending only to use output_tokens=True in NGramFeaturizer(). All of our current pre-trained models require tokens which are lowercased unigrams w/ diacritics removed (which are the defaults for NGramFeaturizer). #Resolved

Thank you for clarification, didn't knew about this.
Will update documentation in later PR.

In reply to: 204193596 [](ancestors = 204193596)

justinormont · 2018-07-21T00:07:05Z

src/Microsoft.ML.Transforms/Text/doc.xml

+        </list>
+          In the following example, after the NGramFeaturizer, features named ngram.__ are generated. A new column named ngram_TransformedText is
+          also created with the text vector, similar as running .split(%27 %27). However, due to the variable length of this column it cannot be properly
+          converted to pandas dataframe, thus any pipelines/transforms output this text vector column will throw errors. However, we use 


"pandas dataframe" won't apply to the ML.NET #Resolved

The curse of copy paste!

In reply to: 204193641 [](ancestors = 204193641)

justinormont · 2018-07-21T00:08:09Z

src/Microsoft.ML.Transforms/Text/doc.xml

+            </item>
+            <item>
+              <description>
+                Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.


Glove should be capitalized as GloVe #Resolved

justinormont · 2018-07-21T00:10:07Z

src/Microsoft.ML.Transforms/Text/doc.xml

+            <item>
+              <description>
+                Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
+                More information can be found <a href="https://nlp.stanford.edu/projects/glove/">here</a>.


Linking to their repo would also be nice: https://github.com/stanfordnlp/GloVe #Resolved

justinormont · 2018-07-21T00:13:25Z

src/Microsoft.ML.Transforms/Text/doc.xml

+            </item>
+            <item>
+              <description>
+                Glove models by Stanford University, or (Jeffrey Pennington, Richard Socher, Christopher D. Manning) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.


Recommend adding, Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)., which is their asked for citation format, as per: https://nlp.stanford.edu/projects/glove/
#Resolved

Ivanidzo4ka · 2018-07-23T20:50:48Z

@dotnet-bot test OSX10.13 Release

Ivanidzo4ka · 2018-07-23T20:51:34Z

@dotnet-bot test OSX10.13 Debug

justinormont · 2018-07-23T21:47:05Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+                    int deno = 0;
+                    srcGetter(ref src);
+                    var values = dst.Values;
+                    Utils.EnsureSize(ref values, 3 * dimension, keepOld: false);


With keepOld: false, the Utils.EnsureSize() function always allocates a new array. Would it be faster to perform the size check and just zero the existing?

machinelearning/src/Microsoft.ML.Core/Utilities/Utils.cs

Line 864 in c023727

array = new T[newSize];

I'm actually not sure why we allocate new array all the time, (considering what we clean it immediately after) so I rather remove this "keepOld"

In reply to: 204564374 [](ancestors = 204564374)

Reactivating. Read implementation or documentation of EnsureSize a bit more carefully. THe keepOld: false certainly does not always allocate a new array. The difference is, in the situation where it is necessary to resize the array, whether that is created through Array.Resize or new... the latter is faster if we can get away with it, which we certainly can here.

In reply to: 204578950 [](ancestors = 204578950,204564374)

justinormont · 2018-07-23T21:54:56Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+                            for (int i = 0; i < dimension; i++)
+                            {
+                                float currentTerm = wordVector[i];
+                                if (values[i] > currentTerm)


Can you point out the code fix for #545 (review)? I'm not seeing it. #Resolved

I removed Array.Clear and replace it with setting MaxValue,0,MinValue.

In reply to: 204566178 [](ancestors = 204566178)

Thanks. I see it now. #Resolved

justinormont · 2018-07-23T23:30:31Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+                    int offset = 2 * dimension;
+                    for (int i = 0; i < dimension; i++)
+                    {
+                        values[i] = int.MaxValue;


Array is of type float, so float.MaxValue will be better #Resolved

justinormont

LGTM

justinormont · 2018-07-24T07:01:04Z

src/Microsoft.ML.Transforms/Text/doc.xml

+        <para>
+          Note: As WordEmbedding requires a column with text vector, e.g. %3C%27this%27, %27is%27, %27good%27%3E, users need to create an input column by
+          using the output_tokens=True for TextTransform to convert a column with sentences like "This is good" into %3C%27this%27, %27is%27, %27good%27 %3E.
+          The column for the output token column is renamed original column with a prefix of %27_TranformedText%27.


We can word smith this a bit..
The suffix of %27_TransformedText%27 is added to the original column name to create the output token column. For instance if the input column is %27body%27, the output tokens column is named %27body_TransformedText%27. #Resolved

justinormont · 2018-07-24T17:01:37Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+                        if (model == null)
+                            model = new Model(dimension);
+                        if (model.Dimension != dimension)
+                            ch.Warning($"Dimension mismatch while reading model file: '{_modelFileNameWithPath}', line number 1, expected dimension = {model.Dimension}, received dimension = {dimension}");


We can remove this warning. The purpose of this block of code is to allow the 1st line to be of different length (and ignored if so). Hence the warning is superfluous. Currently this warning is displayed for all fastText models, even though the model is read correctly (by ignoring the 1st line).

Background: In fastText models, the 1st line is: . Other word embedding models don't have a header line. #Resolved

justinormont · 2018-07-26T00:07:10Z

…ka/machinelearning into ivanidze/wordembedding

TomFinley · 2018-07-27T17:39:30Z

src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs

+                        var name = Path.GetFileName(errorResult.FileName);
+                        throw ch.Except($"{errorMessage}\nModel file for Word Embedding transform could not be found! " +
+                            $@"Please copy the model file '{name}' from '{url}' to '{directory}'.");
+                    }


The user story here doesn't seem great. As in, I'm not sure this transform is actually usable at all. This ResourceManagerUtils points people to the problematically named "https://aka.ms/tlc-resources/". From an outside user's perspective, does that make it useless?

This transform is clearly written with the expectation of a console application not a library. If it were a library, we might expect these sorts of "optional things" would be brought in via nuget dependencies... that is, if you want to use this or that, you subscribe to the appropriate nuget, then in the Arguments of this thing assign the relevant resource as an actual object published by that nuget. (It would be some variety of IComponentFactory.)

I feel like this whole approach needs some deeper thought.

problematically named resource has it's own issue #546

Model files which we use (except SSWE which is a 70mb) in range from 160MB(glove50d) to 6 GB(fastText). Which I think impossible to fit into any nuget due to size limitation.

In reply to: 205848439 [](ancestors = 205848439)

Oh boy. That's pretty big. Hmmm... hmmm... this is pretty complicated then. All right let's punt on that for now.

TomFinley

Thank you @Ivanidzo4ka ! But seriously, could you create an issue?

Introduce word embedding transform

Ivan Matantsev added 4 commits June 14, 2018 13:37

first iteration

5508da3

merge with master

04eecaa

after merge commit

5069792

update entrypoints, add test

4863d9e

Ivan Matantsev added 2 commits July 20, 2018 12:12

merge with master

547d221

update documentation

68696f4

Ivanidzo4ka requested review from sfilipi, justinormont and yaeldekel July 20, 2018 20:53

sfilipi reviewed Jul 20, 2018

View reviewed changes

Ivan Matantsev added 2 commits July 20, 2018 14:29

update doc.xml

2492597

add license attributes.

2048722

remove links to internal storage

Ivanidzo4ka requested a review from GalOshri July 20, 2018 23:44

Ivanidzo4ka commented Jul 20, 2018

View reviewed changes

Ivan Matantsev added 2 commits July 20, 2018 16:46

.net header!

42899d3

"

de7d35f

justinormont reviewed Jul 20, 2018

View reviewed changes

justinormont reviewed Jul 21, 2018

View reviewed changes

one more test, better wording in documentation.

8589f56

Ivan Matantsev added 3 commits July 23, 2018 11:24

fix test

3d787d6

fill min/max properly

f69e4c8

proper fix for min/max

8df3928

justinormont reviewed Jul 23, 2018

View reviewed changes

Ivan Matantsev added 2 commits July 23, 2018 15:57

Merge branch 'master' into ivanidze/wordembedding

47fd24f

reuse dst array.

17cb142

justinormont reviewed Jul 23, 2018

View reviewed changes

float instead of int

a77a709

justinormont approved these changes Jul 24, 2018

View reviewed changes

justinormont reviewed Jul 24, 2018

View reviewed changes

Ivan Matantsev added 3 commits July 24, 2018 11:53

Address more comments from Justin

e75b892

remove unneccessary code duplication

3a0f0a2

revert code changes due concerns from justin

9b4d989

Ivanidzo4ka requested a review from Zruty0 July 25, 2018 19:55

Ivan Matantsev added 3 commits July 26, 2018 10:45

Merge branch 'master' into ivanidze/wordembedding

04d8da3

Merge branch 'ivanidze/wordembedding' of https://github.com/Ivanidzo4…

0bc1d59

…ka/machinelearning into ivanidze/wordembedding

trailing whitespace

9c76b6d

TomFinley reviewed Jul 27, 2018

View reviewed changes

replace call to EnsureSize with direct array allocation

ec99c14

TomFinley approved these changes Jul 28, 2018

View reviewed changes

Ivanidzo4ka merged commit b727d10 into dotnet:master Jul 31, 2018

codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018

word embedding transform (dotnet#545)

52d96b4

Introduce word embedding transform

eerhardt mentioned this pull request Sep 11, 2018

WordEmbedding Tests added plus added dimension check for the first row #880

Merged

justinormont mentioned this pull request Sep 20, 2018

Clarified roadmap to mention existence of current text/NLP features #800

Merged

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

word embedding transform #545

word embedding transform #545

Uh oh!

Conversation

Ivanidzo4ka commented Jul 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shauheen commented Jul 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfilipi Jul 20, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfilipi Jul 20, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfilipi commented Jul 20, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfilipi Jul 20, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 20, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Jul 21, 2018 • edited by Ivanidzo4ka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka commented Jul 23, 2018

Uh oh!

Ivanidzo4ka commented Jul 23, 2018

Uh oh!

justinormont Jul 23, 2018 • edited by TomFinley Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ivanidzo4ka commented Jul 17, 2018 •

edited

Loading

sfilipi Jul 20, 2018 •

edited by Ivanidzo4ka

Loading

sfilipi Jul 20, 2018 •

edited by Ivanidzo4ka

Loading

sfilipi commented Jul 20, 2018 •

edited by Ivanidzo4ka

Loading

sfilipi Jul 20, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 20, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 21, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 23, 2018 •

edited by TomFinley

Loading

TomFinley Jul 27, 2018 •

edited

Loading

justinormont Jul 23, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 23, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 23, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 24, 2018 •

edited by Ivanidzo4ka

Loading

justinormont Jul 24, 2018 •

edited by Ivanidzo4ka

Loading

TomFinley Jul 27, 2018 •

edited

Loading

TomFinley left a comment •

edited

Loading