Update documentation for WordBag #3440

Ivanidzo4ka · 2019-04-19T19:41:57Z

towards #3204

shmoradims · 2019-04-19T20:42:14Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// <param name="inputColumnNames">Name of the columns to transform.</param>
+        /// <remarks>
+        /// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in a way that <see cref="WordHashBagEstimator"/>
+        /// tokenizes text internally while <see cref="NgramHashingEstimator"/> takes tokenized text as input.


-> "in that the former .... and the latter ..."
(we don't want to many identical links)

#Resolved

shmoradims · 2019-04-19T20:45:06Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// |  |  |
+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes |
+    /// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |


(xref:Microsoft.ML.Data.TextDataViewType) [](start = 51, length = 43)

(xref:UID) withtout <> #Resolved

shmoradims · 2019-04-19T20:45:22Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |
+    /// | Output column data type | Vector of known-size of <xref:System.Single> |
+    ///
+    /// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and


/ [](start = 53, length = 1)

no / #Resolved

shmoradims · 2019-04-19T20:50:47Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// | Output column data type | Vector of known-size of <xref:System.Single> |
+    ///
+    /// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and
+    /// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data.


sequences of consecutive words of length 1-n [](start = 48, length = 44)

n-gram = 'sequences of n consecutive words'

details: 'sequences of n consecutive tokens' where token is word, char, etc. depending on the context.

#Resolved

I understand you want me to change this, I just don't understand your proposal

In reply to: 277088838 [](ancestors = 277088838)

shmoradims · 2019-04-19T20:51:36Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

@@ -185,6 +204,30 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
    /// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text.


ngrams (sequences of consecutive words of length 1-n) [](start = 36, length = 53)

ditto #Resolved

shmoradims · 2019-04-19T20:53:37Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// | Output column data type | Vector of known-size of <xref:System.Single> |
+    ///
+    /// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and
+    /// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data.


(sequences of consecutive words of length 1-n) [](start = 47, length = 46)

ditto #Resolved

shmoradims · 2019-04-19T20:53:48Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// |  |  |
+    /// | -- | -- |
+    /// | Does this estimator need to look at the data to train its parameters? | Yes |
+    /// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |


< [](start = 52, length = 1)

ditto #Resolved

shmoradims · 2019-04-19T20:54:10Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// It does so by hashing each ngram and using the hash value as the index in the bag.
+    ///
+    /// <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator>
+    /// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input.


in a way that the former takes tokenizes text internally while the latter takes tokenized text as input. [](start = 7, length = 105)

ditto: former ... latter... #Resolved

it's already former.. latter...

In reply to: 277089489 [](ancestors = 277089489)

in that the former ...

In reply to: 277095332 [](ancestors = 277095332,277089489)

I don't get your comment

In reply to: 277113027 [](ancestors = 277113027,277095332,277089489)

codecov · 2019-04-19T22:04:59Z

Codecov Report

Merging #3440 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3440      +/-   ##
==========================================
+ Coverage   72.73%   72.73%   +<.01%     
==========================================
  Files         807      807              
  Lines      145206   145206              
  Branches    16230    16230              
==========================================
+ Hits       105609   105613       +4     
+ Misses      35179    35175       -4     
  Partials     4418     4418

Flag	Coverage Δ
#Debug	`72.73% <ø> (ø)`	⬆️
#production	`68.27% <ø> (ø)`	⬆️
#test	`88.98% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
...soft.ML.Transforms/Text/NgramHashingTransformer.cs	`88.79% <ø> (ø)`	⬆️
...soft.ML.Transforms/Text/WrappedTextTransformers.cs	`93.63% <ø> (ø)`	⬆️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`86.1% <0%> (-0.16%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`84.9% <0%> (+0.2%)`	⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`89.89% <0%> (+0.62%)`	⬆️

singlis · 2019-04-19T23:09:33Z

src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs

@@ -866,7 +866,7 @@ public VBuffer<ReadOnlyMemory<char>>[] SlotNamesMetadata(out VectorDataViewType[
    /// |  |  |
    /// | -- | -- |
    /// | Does this estimator need to look at the data to train its parameters? | Yes |
-    /// | Input column data type | Vector of [Key](<xref:Microsoft.ML.Data.KeyDataViewType>) |
+    /// | Input column data type | Vector of [Key](xref:Microsoft.ML.Data.KeyDataViewType) |


Key [](start = 46, length = 3)

I would just do xref:Microsoft.ML.Data.KeyDataViewType #ByDesign

I think it's desirable to keep it as Ivan did, at least I believe that's the convention for key type: name it key, and link the KeyDataViewType.

In reply to: 277109991 [](ancestors = 277109991)

artidoro · 2019-04-19T23:21:32Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnName"/>
-        /// and outputs bag of word vector as <paramref name="outputColumnName"/>
+        /// Create a <see cref="WordBagEstimator"/>, which takes the data from the column specified in <paramref name="inputColumnName"/>
+        /// to a new column: <paramref name="outputColumnName"/> and produces a vector of counts of n-grams.


Would suggest a slightly different wording that uses 'maps' instead of 'takes': (I think this could be used below as well)

which maps the column specified in to a vector of counts of n-grams in a new column named . #Resolved

artidoro · 2019-04-19T23:23:11Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
-        /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
+        /// <remarks>
+        /// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in a way that the former


in a way that the former [](start = 99, length = 25)

Same comment as below:
"in that the former .... and the latter ..." #Resolved

artidoro · 2019-04-19T23:28:50Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
+    ///
+    /// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator>
+    /// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input.


a way that [](start = 11, length = 10)

same comment as in the extension: "in that the former ... while the latter ...." #Resolved

artidoro · 2019-04-19T23:38:54Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// | Output column data type | Vector of known-size of <xref:System.Single> |
+    ///
+    /// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
+    /// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data.


from a given data [](start = 82, length = 17)

maybe:
counts of n-grams -> counts of occurrences of ngrams
from a given data -> in the input vector of words

@singlis suggested "vector of n-gram counts" which I think is good #Resolved

artidoro

singlis · 2019-04-19T23:44:54Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

+        /// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
+        /// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
+        /// <param name="inputColumnName">Name of the column to take the data from.
+        /// This estimator operates over vector of text.</param>


text [](start = 51, length = 4)

no cref=System.String? I know text is pretty obvious, I am just thinking for consistency. #ByDesign

we had discussion few days ago. we use just text in normal comments

In reply to: 277113448 [](ancestors = 277113448)

singlis · 2019-04-19T23:45:14Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

@@ -316,12 +322,18 @@ public static class TextCatalog
                outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);

        /// <summary>
-        /// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/>
-        /// and outputs bag of word vector as <paramref name="outputColumnName"/>
+        /// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>


multople [](start = 72, length = 8)

multiple #Resolved

singlis · 2019-04-19T23:45:51Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/>
-        /// and outputs bag of word vector as <paramref name="outputColumnName"/>
+        /// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>
+        /// to a vector of counts of n-grams in a new column named <paramref name="outputColumnName"/>.


counts of n-grams [](start = 27, length = 17)

vector of n-gram counts? #Resolved

singlis · 2019-04-19T23:46:44Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

+        /// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.
+        /// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
+        /// <param name="inputColumnNames">Names of the multiple columns to take the data from.
+        /// This estimator operates over vector of text.</param>


text [](start = 51, length = 4)

cref=system.String? #ByDesign

If you take this change, please update all functions

In reply to: 277113589 [](ancestors = 277113589)

singlis · 2019-04-19T23:47:17Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnName"/>
-        /// and outputs bag of word vector as <paramref name="outputColumnName"/>
+        /// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/>
+        /// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>.


counts of hashed n-grams [](start = 27, length = 24)

vector of hashed n-gram counts? #Resolved

singlis · 2019-04-19T23:48:53Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

@@ -371,12 +389,18 @@ public static class TextCatalog
                maximumNumberOfInverts: maximumNumberOfInverts);

        /// <summary>
-        /// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnNames"/>
-        /// and outputs bag of word vector as <paramref name="outputColumnName"/>
+        /// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>


multople [](start = 72, length = 8)

multiple #Resolved

singlis

natke · 2019-04-19T23:56:20Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

+        /// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
+        /// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
+        /// <param name="inputColumnName">Name of the column to take the data from.
+        /// This estimator operates over vector of text.</param>


This could be a good place to put the clarification about the input text i.e. that it is tokenized.

This estimator operates over a vector of tokenized text. This is different from the see cref="NgramExtractingEstimator", which tokenizes the text itself.

natke · 2019-04-19T23:57:07Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

-        /// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
-        /// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
+        /// <remarks>
+        /// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former


Suggest removing this and adding a clarification to the input column definition.

natke · 2019-04-20T00:00:40Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// | Output column data type | Vector of known-size of <xref:System.Single> |
+    ///
+    /// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
+    /// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data.


@singlis suggested "vector of n-gram counts" which I think is good #Resolved

natke · 2019-04-20T00:08:34Z

src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs

+    /// produces a vector of n-gram counts (sequences of n consecutive words) from a given data.
+    /// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
+    ///
+    /// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator>


Same comment as above

Update documentation for WordBag

8de7ce4

Ivanidzo4ka added the documentation Related to documentation of ML.NET label Apr 19, 2019

Ivanidzo4ka requested review from sfilipi, natke, shmoradims and artidoro April 19, 2019 19:45

plurals

10bf564

shmoradims reviewed Apr 19, 2019

View reviewed changes

Ivan Matantsev added 3 commits April 19, 2019 14:25

address what I understand.

d6d39f7

Merge branch 'master' into Ivanidze/WordBagXmDocumentation

8d40275

one more change

d56383e

sfilipi mentioned this pull request Apr 19, 2019

API reference - XML documentation template for transforms #3204

Closed

address comments

d66ec54

singlis reviewed Apr 19, 2019

View reviewed changes

artidoro reviewed Apr 19, 2019

View reviewed changes

playing with words

5efcc2c

artidoro reviewed Apr 19, 2019

View reviewed changes

artidoro approved these changes Apr 19, 2019

View reviewed changes

singlis reviewed Apr 19, 2019

View reviewed changes

scott comments

685860c

singlis approved these changes Apr 19, 2019

View reviewed changes

natke approved these changes Apr 20, 2019

View reviewed changes

Ivanidzo4ka merged commit c0832b5 into dotnet:master Apr 20, 2019

ghost locked as resolved and limited conversation to collaborators Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update documentation for WordBag #3440

Update documentation for WordBag #3440

Ivanidzo4ka commented Apr 19, 2019

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

Ivanidzo4ka Apr 19, 2019

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

Ivanidzo4ka Apr 19, 2019

artidoro Apr 19, 2019

Ivanidzo4ka Apr 19, 2019

codecov bot commented Apr 19, 2019 •

edited

Loading

singlis Apr 19, 2019 •

edited by artidoro

Loading

artidoro Apr 19, 2019

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

artidoro Apr 19, 2019

natke Apr 20, 2019 •

edited by Ivanidzo4ka

Loading

artidoro left a comment

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

Ivanidzo4ka Apr 19, 2019

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis left a comment

natke Apr 19, 2019

natke Apr 19, 2019

natke Apr 20, 2019 •

edited by Ivanidzo4ka

Loading

natke Apr 20, 2019

		@@ -185,6 +204,30 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
		/// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text.

Update documentation for WordBag #3440

Update documentation for WordBag #3440

Conversation

Ivanidzo4ka commented Apr 19, 2019

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 19, 2019 • edited Loading

Codecov Report

singlis Apr 19, 2019 • edited by artidoro Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artidoro Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

artidoro Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

artidoro Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

artidoro left a comment

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

singlis Apr 19, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

singlis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natke Apr 20, 2019 • edited by Ivanidzo4ka Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

shmoradims Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

codecov bot commented Apr 19, 2019 •

edited

Loading

singlis Apr 19, 2019 •

edited by artidoro

Loading

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

artidoro Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

natke Apr 20, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

singlis Apr 19, 2019 •

edited by Ivanidzo4ka

Loading

natke Apr 20, 2019 •

edited by Ivanidzo4ka

Loading