Converted listed text transforms into transformers/estimators. #953

zeahmed · 2018-09-19T21:15:32Z

This PR fixes #950.

The following transforms were converted in this PR.

Stopwords Remover Transform
Text Normalizer Transform
Word Bag Transform
Word Hash Bag Transform
Ngram Transform
Ngram Hash Transform

…-wrapper

…learning into feature/transform-wrapper

…to texttransforms_2_estimators

zeahmed · 2018-09-19T21:18:49Z

test/Microsoft.ML.Tests/Transformers/TextFeaturizerTests.cs

+
+            var est = new WordBagEstimator(Env, "text", "bag_of_words").
+                Append(new WordHashBagEstimator(Env, "text", "bag_of_wordshash"));
+            //TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic);


//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)

This method is failing when the size of output vector is determined by the incoming IDataView. The method calls the fit method to learn the output schema where EmptyDataView is passed. When fitting on EmptyDataView, the size is zero which is making the output vector a variable sized vector...:)

@Zruty0, Need to find a way to handle this case. #Resolved

Need to fix how the ngram output vector size is computed i.e. need to ensure that vector size is at least one at the following line.

machinelearning/src/Microsoft.ML.Transforms/Text/NgramTransform.cs

Line 351 in 86f4d93

types[iinfo] = new VectorType(NumberType.Float, _ngramMaps[iinfo].Count);

In reply to: 218967284 [](ancestors = 218967284)

Why not do this in this PR?

In reply to: 218989146 [](ancestors = 218989146,218967284)

Actually, NgramTransform is computing metadata based on the size of ngrams. I need to deeply look into this to enable it for EmptyDataView. I am thinking to open an issue against this. What do you think?

In reply to: 219285825 [](ancestors = 219285825,218989146,218967284)

I have created #969 for tracking this.

In reply to: 219287182 [](ancestors = 219287182,219285825,218989146,218967284)

Is this an issue on empty inputs? Is it possible to test at least the rest?

In reply to: 219315899 [](ancestors = 219315899,219287182,219285825,218989146,218967284)

zeahmed · 2018-09-19T21:20:14Z

test/Microsoft.ML.Tests/Transformers/TextFeaturizerTests.cs

+                .Append(new TermEstimator(Env, "text", "terms"))
+                .Append(new NgramEstimator(Env, "terms", "ngrams"))
+                .Append(new NgramHashEstimator(Env, "terms", "ngramshash"));
+            //TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic);


//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)

same. see the comment on the test above on line 143. #Resolved

I have created #969 for tracking this.

In reply to: 218967678 [](ancestors = 218967678)

zeahmed · 2018-09-19T21:21:13Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

+        /// <param name="ordered">Whether the position of each source column should be included in the hash (when there are multiple source columns).</param>
+        /// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param>
+        /// <returns></returns>
+        public static Vector<float> BagofHashedWords(this Scalar<string> input,


BagofHashedWords [](start = 36, length = 16)

@TomFinley, @sfilipi, @Zruty0: Does this name sound good? #Resolved

I think ToBagOfHashedWords

In reply to: 218967973 [](ancestors = 218967973)

zeahmed · 2018-09-20T17:14:15Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// Stopword remover removes language-specific lists of stop words (most common words)
+    /// This is usually applied after tokenizing text, so it compares individual tokens
+    /// (case-insensitive comparison) to the stopwords.
+    /// </summary>


I am thinking to move these XML comments into doc.xml. #Resolved

When I put the doc.xml there, initially, it was temporarely, so that the docs can be on the runtime, and the entry point classes. the thought then was to remove them, after the CSharpApi would go. Will take another survey on whether people actually prefer to have the docs. #Resolved

ok, then I keep the comments as-is for now.

In reply to: 219248683 [](ancestors = 219248683)

Zruty0 · 2018-09-20T18:07:36Z

test/Microsoft.ML.StaticPipelineTesting/StaticPipeTests.cs

+            var est = data.MakeNewEstimator()
+                .Append(r => (
+                    r.label,
+                    normalized_text: r.text.Normalize(),


Normalize [](start = 44, length = 9)

NormalizeText? Because we have Normalize with a completely different meaning, working on floats #Closed

Zruty0 · 2018-09-20T18:07:53Z

test/Microsoft.ML.StaticPipelineTesting/StaticPipeTests.cs

+            var est = data.MakeNewEstimator()
+                .Append(r => (
+                    r.label,
+                    bagofword: r.text.BagofWords(),


BagofWords [](start = 38, length = 10)

ToBagOfWords #Closed

Zruty0 · 2018-09-20T18:09:39Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

+        /// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
+        /// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param>
+        /// <param name="weighting">Statistical measure used to evaluate how important a word is to a document in a corpus.</param>
+        /// <returns></returns>


[](start = 12, length = 19)

remove empty #Closed

Zruty0 · 2018-09-20T18:10:42Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

+        }
+
+        /// <summary>
+        /// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given text.


text [](start = 92, length = 4)

it's not a text. Mention that it typically is applied to the output of tokenizers #Resolved

Zruty0

Zruty0 · 2018-09-20T19:14:18Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

+        }
+
+        /// <summary>
+        /// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given tokenized text.


[](start = 78, length = 1)

stray space #Resolved

sfilipi · 2018-09-21T16:11:02Z

    public TermEstimator(IHostEnvironment env, params TermTransform.ColumnInfo[] columns)

internal or private? #WontFix

Refers to: src/Microsoft.ML.Data/Transforms/TermEstimator.cs:38 in 4659b80. [](commit_id = 4659b80, deletion_comment = False)

sfilipi · 2018-09-21T16:11:32Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

@@ -8,6 +8,8 @@
 using Microsoft.ML.Runtime.Data;
 using System;
 using System.Collections.Generic;
+using static Microsoft.ML.Runtime.TextAnalytics.StopWordsRemoverTransform;


Ctrl+R+G #Resolved

This is sorted already...:)

In reply to: 219550492 [](ancestors = 219550492)

zeahmed · 2018-09-21T17:26:08Z

    public TermEstimator(IHostEnvironment env, params TermTransform.ColumnInfo[] columns)

Actually, I cannot do it internal because its being used in CategoricalTransform which is in another assembly.

In reply to: 423589115 [](ancestors = 423589115)

Refers to: src/Microsoft.ML.Data/Transforms/TermEstimator.cs:38 in 4659b80. [](commit_id = 4659b80, deletion_comment = False)

artidoro · 2018-09-21T17:50:16Z

src/Microsoft.ML.Transforms/Text/TextStaticExtensions.cs

+            int skipLength = 0,
+            bool allLengths = true,
+            int maxNumTerms = 10000000,
+            NgramTransform.WeightingCriteria weighting = NgramTransform.WeightingCriteria.Tf) => new OutPipelineColumn(input, ngramLength, skipLength, allLengths, maxNumTerms, weighting);


Maybe more readable if you go to a new line here? #Resolved

artidoro · 2018-09-21T18:04:45Z

test/Microsoft.ML.Tests/Transformers/TextFeaturizerTests.cs

+                    text: ctx.LoadText(1)), hasHeader: true)
+                .Read(new MultiFileSource(sentimentDataPath));
+
+            var invalidData = TextLoader.CreateReader(Env, ctx => (


Why is invalid data the same as the valid one?

Its not valid. Here text is tried to be loaded as float.

In reply to: 219582043 [](ancestors = 219582043)

artidoro

Pete Luferenko and others added 10 commits September 17, 2018 13:04

Transform wrappers and a reference implementation for tokenizers

73c2aa8

Added pigsty extensions

d145d07

Merge remote-tracking branch 'upstream/master' into feature/transform…

6bd2bf4

…-wrapper

Added pigsty test

e8ef7ef

Fixed most important PR comments

1350587

PR comments

450efda

Merge branch 'feature/transform-wrapper' of github.com:Zruty0/machine…

db261f8

…learning into feature/transform-wrapper

Merge remote-tracking branch 'pete_fork/feature/transform-wrapper' in…

1653e1c

…to texttransforms_2_estimators

Converted all text transforms into transformers/estimators.

4a93c72

Resolved merge conflicts.

6b28ca6

zeahmed requested review from Ivanidzo4ka, TomFinley, artidoro, sfilipi and Zruty0 September 19, 2018 21:15

zeahmed commented Sep 19, 2018

View reviewed changes

Zruty0 mentioned this pull request Sep 20, 2018

New API for ML.NET #754

Closed

zeahmed commented Sep 20, 2018

View reviewed changes

Zruty0 reviewed Sep 20, 2018

View reviewed changes

Zruty0 approved these changes Sep 20, 2018

View reviewed changes

Addressed reviewers's comments.

ee835d1

Zruty0 reviewed Sep 20, 2018

View reviewed changes

Addressed reviewers' comments.

4659b80

zeahmed mentioned this pull request Sep 21, 2018

Converted LdaTransform into Transformer/Estimator. #972

Merged

sfilipi reviewed Sep 21, 2018

View reviewed changes

Resolved merge conflicts.

da08376

artidoro reviewed Sep 21, 2018

View reviewed changes

Addressed a few formating comments.

9688edd

artidoro reviewed Sep 21, 2018

View reviewed changes

artidoro approved these changes Sep 21, 2018

View reviewed changes

zeahmed merged commit 9a6c384 into dotnet:master Sep 21, 2018

sfilipi mentioned this pull request Oct 10, 2018

API reference - Samples for Transforms #1209

Closed

zeahmed deleted the texttransforms_2_estimators branch January 30, 2019 21:31

ghost locked as resolved and limited conversation to collaborators Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converted listed text transforms into transformers/estimators. #953

Converted listed text transforms into transformers/estimators. #953

zeahmed commented Sep 19, 2018

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 19, 2018 •

edited

Loading

Zruty0 Sep 20, 2018

zeahmed Sep 20, 2018

zeahmed Sep 20, 2018

artidoro Sep 21, 2018

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 20, 2018

zeahmed Sep 19, 2018 •

edited

Loading

Zruty0 Sep 20, 2018

zeahmed Sep 20, 2018 •

edited

Loading

sfilipi Sep 20, 2018 •

edited by zeahmed

Loading

zeahmed Sep 20, 2018

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited by zeahmed

Loading

Zruty0 left a comment

Zruty0 Sep 20, 2018 •

edited by zeahmed

Loading

sfilipi commented Sep 21, 2018 •

edited by zeahmed

Loading

sfilipi Sep 21, 2018 •

edited by zeahmed

Loading

zeahmed Sep 21, 2018 •

edited

Loading

zeahmed commented Sep 21, 2018

artidoro Sep 21, 2018 •

edited by zeahmed

Loading

artidoro Sep 21, 2018

zeahmed Sep 21, 2018

artidoro left a comment

Converted listed text transforms into transformers/estimators. #953

Converted listed text transforms into transformers/estimators. #953

Conversation

zeahmed commented Sep 19, 2018

zeahmed Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

zeahmed Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeahmed Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeahmed Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeahmed Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

sfilipi Sep 20, 2018 • edited by zeahmed Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zruty0 Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

Zruty0 Sep 20, 2018 • edited by zeahmed Loading

Choose a reason for hiding this comment

Zruty0 left a comment

Choose a reason for hiding this comment

Zruty0 Sep 20, 2018 • edited by zeahmed Loading

Choose a reason for hiding this comment

sfilipi commented Sep 21, 2018 • edited by zeahmed Loading

sfilipi Sep 21, 2018 • edited by zeahmed Loading

Choose a reason for hiding this comment

zeahmed Sep 21, 2018 • edited Loading

Choose a reason for hiding this comment

zeahmed commented Sep 21, 2018

artidoro Sep 21, 2018 • edited by zeahmed Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artidoro left a comment

Choose a reason for hiding this comment

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 19, 2018 •

edited

Loading

zeahmed Sep 20, 2018 •

edited

Loading

sfilipi Sep 20, 2018 •

edited by zeahmed

Loading

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited

Loading

Zruty0 Sep 20, 2018 •

edited by zeahmed

Loading

Zruty0 Sep 20, 2018 •

edited by zeahmed

Loading

sfilipi commented Sep 21, 2018 •

edited by zeahmed

Loading

sfilipi Sep 21, 2018 •

edited by zeahmed

Loading

zeahmed Sep 21, 2018 •

edited

Loading

artidoro Sep 21, 2018 •

edited by zeahmed

Loading