-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Update documentation for WordBag #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation for WordBag #3440
Conversation
/// <param name="inputColumnNames">Name of the columns to transform.</param> | ||
/// <remarks> | ||
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in a way that <see cref="WordHashBagEstimator"/> | ||
/// tokenizes text internally while <see cref="NgramHashingEstimator"/> takes tokenized text as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> "in that the former .... and the latter ..."
(we don't want to many identical links)
#Resolved
/// | | | | ||
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes | | ||
/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(xref:Microsoft.ML.Data.TextDataViewType) [](start = 51, length = 43)
(xref:UID) withtout <> #Resolved
/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) | | ||
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ [](start = 53, length = 1)
no / #Resolved
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequences of consecutive words of length 1-n [](start = 48, length = 44)
n-gram = 'sequences of n consecutive words'
details: 'sequences of n consecutive tokens' where token is word, char, etc. depending on the context.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you want me to change this, I just don't understand your proposal
In reply to: 277088838 [](ancestors = 277088838)
@@ -185,6 +204,30 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema) | |||
/// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngrams (sequences of consecutive words of length 1-n) [](start = 36, length = 53)
ditto #Resolved
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sequences of consecutive words of length 1-n) [](start = 47, length = 46)
ditto #Resolved
/// | | | | ||
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes | | ||
/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< [](start = 52, length = 1)
ditto #Resolved
/// It does so by hashing each ngram and using the hash value as the index in the bag. | ||
/// | ||
/// <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator> | ||
/// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a way that the former takes tokenizes text internally while the latter takes tokenized text as input. [](start = 7, length = 105)
ditto: former ... latter... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #3440 +/- ##
==========================================
+ Coverage 72.73% 72.73% +<.01%
==========================================
Files 807 807
Lines 145206 145206
Branches 16230 16230
==========================================
+ Hits 105609 105613 +4
+ Misses 35179 35175 -4
Partials 4418 4418
|
@@ -866,7 +866,7 @@ public VBuffer<ReadOnlyMemory<char>>[] SlotNamesMetadata(out VectorDataViewType[ | |||
/// | | | | |||
/// | -- | -- | | |||
/// | Does this estimator need to look at the data to train its parameters? | Yes | | |||
/// | Input column data type | Vector of [Key](<xref:Microsoft.ML.Data.KeyDataViewType>) | | |||
/// | Input column data type | Vector of [Key](xref:Microsoft.ML.Data.KeyDataViewType) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key [](start = 46, length = 3)
I would just do xref:Microsoft.ML.Data.KeyDataViewType #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's desirable to keep it as Ivan did, at least I believe that's the convention for key type: name it key, and link the KeyDataViewType.
In reply to: 277109991 [](ancestors = 277109991)
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnName"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordBagEstimator"/>, which takes the data from the column specified in <paramref name="inputColumnName"/> | ||
/// to a new column: <paramref name="outputColumnName"/> and produces a vector of counts of n-grams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest a slightly different wording that uses 'maps' instead of 'takes': (I think this could be used below as well)
which maps the column specified in to a vector of counts of n-grams in a new column named . #Resolved
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | ||
/// <remarks> | ||
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in a way that the former |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a way that the former [](start = 99, length = 25)
Same comment as below:
"in that the former .... and the latter ..." #Resolved
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag. | ||
/// | ||
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator> | ||
/// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a way that [](start = 11, length = 10)
same comment as in the extension: "in that the former ... while the latter ...." #Resolved
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from a given data [](start = 82, length = 17)
maybe:
counts of n-grams -> counts of occurrences of ngrams
from a given data -> in the input vector of words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singlis suggested "vector of n-gram counts" which I think is good #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to take the data from. | ||
/// This estimator operates over vector of text.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text [](start = 51, length = 4)
no cref=System.String? I know text is pretty obvious, I am just thinking for consistency. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had discussion few days ago. we use just text in normal comments
In reply to: 277113448 [](ancestors = 277113448)
@@ -316,12 +322,18 @@ public static class TextCatalog | |||
outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting); | |||
|
|||
/// <summary> | |||
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/> | |||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | |||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multople [](start = 72, length = 8)
multiple #Resolved
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/> | ||
/// to a vector of counts of n-grams in a new column named <paramref name="outputColumnName"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counts of n-grams [](start = 27, length = 17)
vector of n-gram counts? #Resolved
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnNames">Names of the multiple columns to take the data from. | ||
/// This estimator operates over vector of text.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text [](start = 51, length = 4)
cref=system.String? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take this change, please update all functions
In reply to: 277113589 [](ancestors = 277113589)
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnName"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/> | ||
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counts of hashed n-grams [](start = 27, length = 24)
vector of hashed n-gram counts? #Resolved
@@ -371,12 +389,18 @@ public static class TextCatalog | |||
maximumNumberOfInverts: maximumNumberOfInverts); | |||
|
|||
/// <summary> | |||
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnNames"/> | |||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | |||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multople [](start = 72, length = 8)
multiple #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to take the data from. | ||
/// This estimator operates over vector of text.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a good place to put the clarification about the input text i.e. that it is tokenized.
This estimator operates over a vector of tokenized text. This is different from the see cref="NgramExtractingEstimator", which tokenizes the text itself.
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | ||
/// <remarks> | ||
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest removing this and adding a clarification to the input column definition.
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singlis suggested "vector of n-gram counts" which I think is good #Resolved
/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data. | ||
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag. | ||
/// | ||
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
towards #3204