snapping featurizeText to the template #3438

sfilipi · 2019-04-19T17:47:38Z

Towards #3204. Snapping featurizeText to the template

codecov · 2019-04-19T18:24:47Z

Codecov Report

Merging #3438 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3438      +/-   ##
==========================================
+ Coverage   72.72%   72.73%   +<.01%     
==========================================
  Files         807      807              
  Lines      145206   145206              
  Branches    16230    16230              
==========================================
+ Hits       105608   105615       +7     
+ Misses      35179    35174       -5     
+ Partials     4419     4417       -2

Flag	Coverage Δ
#Debug	`72.73% <ø> (ø)`	⬆️
#production	`68.27% <ø> (ø)`	⬆️
#test	`88.98% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs	`90.57% <ø> (ø)`	⬆️
src/Microsoft.ML.Transforms/NormalizerCatalog.cs	`84.78% <0%> (ø)`	⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`89.89% <0%> (+0.62%)`	⬆️
src/Microsoft.ML.Maml/MAML.cs	`26.21% <0%> (+1.45%)`	⬆️

artidoro · 2019-04-19T18:26:40Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

@@ -17,11 +17,14 @@ namespace Microsoft.ML
    public static class TextCatalog
    {
        /// <summary>
-        /// Transform a text column into featurized float array that represents counts of ngrams and char-grams.
+        /// Create a <see cref="TextFeaturizingEstimator"/>, which transforms a text column into featurized float array that represents normalized counts of ngrams and char-grams.


float array [](start = 108, length = 11)

I think it would be better to avoid float array?
Maybe numeric vector column? Here and below #Resolved

I put System.Single, since numeric implies int, uint etc.

In reply to: 277057968 [](ancestors = 277057968)

natke · 2019-04-19T18:26:27Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    /// |  |  |
+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes. |
+    /// | Input column data type | [text](xref:System.ReadOnlyMemory<char[]>) |


Does this render correctly? #Resolved

I thought it was: xref:System.ReadOnlyMemory{System.Char}
As per Shahab's message in teams

In reply to: 277057901 [](ancestors = 277057901)

thanks for the catch!
#Resolved

actually i think we're doing TextDataViewType

In reply to: 277110296 [](ancestors = 277110296)

I am putting the following in there:
Scream now, or ... :)

| Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) |

In reply to: 277111141 [](ancestors = 277111141,277110296)

Checked with Shahab. The above is correct.

In reply to: 277111655 [](ancestors = 277111655,277111141,277110296)

natke · 2019-04-19T18:28:19Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    /// * [Tokenzation](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
+    /// * [Text normalization](https://en.wikipedia.org/wiki/Text_normalization)
+    /// * [Predefined and custom stopwords removal](https://en.wikipedia.org/wiki/Stop_words)
+    /// * [Word-based or character-based Ngram and SkipGram extraction](https://en.wikipedia.org/wiki/N-gram)


Does it actually produce skip-grams? #Resolved

let me error on the safe side and remove it.

In reply to: 277058416 [](ancestors = 277058416)

It does, it's one of the advanced options for the WordBagEstiamator.Options that can be set in the TextFeaturizingEstimator.Options. The option is SkipLength

In reply to: 277111709 [](ancestors = 277111709,277058416)

It would be great if you could add it back!

In reply to: 277113837 [](ancestors = 277113837,277111709,277058416)

artidoro · 2019-04-19T18:29:48Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

-    // integer index mapping through hashing) as an option.
-    /// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' />
+    /// <summary>
+    ///  A transform that turns a collection of text documents into numerical feature vectors.


A transform [](start = 9, length = 11)

Would avoid starting with a transform as the description of an estimator.

How about starting with Featurizes text data by ... #Resolved

artidoro · 2019-04-19T18:31:33Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

-    /// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' />
+    /// <summary>
+    ///  A transform that turns a collection of text documents into numerical feature vectors.
+    ///  The feature vectors are normalized counts of word and/or character ngrams (based on the options supplied) in a given tokenized text.


in a given tokenized text. [](start = 115, length = 26)

Does the input text need to be tokenized? I thought the input was just text.
Maybe removing tokenized will make it more clear. #Resolved

artidoro · 2019-04-19T18:36:15Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    /// * [L-p vector normalization](xref: Microsoft.ML.Transforms.LpNormNormalizingTransformer)
+    ///
+    ///  Features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data.
+    ///     </format>


I believe you can specify the maximum number of ngrams to keep in the advanced options. #Resolved

Maybe you could start by: By default, ....
And afterwards say that: the number of features can also be specified by selecting the maximum number of n-gram to keep in the advanced options.

In reply to: 277060289 [](ancestors = 277060289)

artidoro

Just one comment about skip-grams otherwise it looks good!

artidoro · 2019-04-19T23:54:06Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes. |
+    /// | Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) |
+    /// | Output column data type | vector of <xref:System.Single> |


vector [](start = 36, length = 6)

Known-sized Vector #Pending

leaving it, since this is an output.

In reply to: 277114222 [](ancestors = 277114222)

artidoro · 2019-04-19T23:56:21Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    ///
+    ///  By default the features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data.
+    ///  The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned.
+    ///     </format>


[](start = 7, length = 5)

nit: A few extra spaces.

Could you add the usual See Also comment about examples?

are we doing that? I haven't put it anywhere but LDA, where they would need to look at the previous steps.

In reply to: 277114384 [](ancestors = 277114384)

artidoro · 2019-04-19T23:59:34Z

src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs

+    ///
+    ///  By default the features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data.
+    ///  The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned.
+    ///     </format>


If you think it's important you could mention that the word tokens can be output to a column.
There is a setting in the options class: OutputTokensColumnName #Resolved

snapping featurizeText to the template

e8ff2f0

sfilipi requested review from natke and shmoradims April 19, 2019 17:47

sfilipi self-assigned this Apr 19, 2019

sfilipi added the documentation Related to documentation of ML.NET label Apr 19, 2019

sfilipi requested a review from codemzs April 19, 2019 17:48

sfilipi mentioned this pull request Apr 19, 2019

API reference - XML documentation template for transforms #3204

Closed

artidoro reviewed Apr 19, 2019

View reviewed changes

natke approved these changes Apr 19, 2019

View reviewed changes

artidoro reviewed Apr 19, 2019

View reviewed changes

review comments

a3e1cc5

artidoro approved these changes Apr 19, 2019

View reviewed changes

artidoro reviewed Apr 19, 2019

View reviewed changes

PR review comments

7bd30e0

sfilipi merged commit 1b277b5 into dotnet:master Apr 21, 2019

sfilipi deleted the featurizeTextDoc branch April 21, 2019 06:24

ghost locked as resolved and limited conversation to collaborators Mar 22, 2022

snapping featurizeText to the template #3438

snapping featurizeText to the template #3438

Uh oh!

Conversation

sfilipi commented Apr 19, 2019

Uh oh!

codecov bot commented Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

artidoro Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natke Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfilipi Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfilipi Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natke Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro left a comment

Choose a reason for hiding this comment

Uh oh!

artidoro Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro Apr 19, 2019 • edited by sfilipi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Apr 19, 2019 •

edited

Loading

artidoro Apr 19, 2019 •

edited by sfilipi

Loading

natke Apr 19, 2019 •

edited by sfilipi

Loading

sfilipi Apr 19, 2019 •

edited

Loading

sfilipi Apr 19, 2019 •

edited

Loading

natke Apr 19, 2019 •

edited by sfilipi

Loading

artidoro Apr 19, 2019 •

edited by sfilipi

Loading

artidoro Apr 19, 2019 •

edited by sfilipi

Loading

artidoro Apr 19, 2019 •

edited

Loading

artidoro Apr 19, 2019 •

edited by sfilipi

Loading

artidoro Apr 19, 2019 •

edited by sfilipi

Loading