Skip to content

Update documentation for WordBag #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Ivanidzo4ka
Copy link
Contributor

towards #3204

@Ivanidzo4ka Ivanidzo4ka added the documentation Related to documentation of ML.NET label Apr 19, 2019
/// <param name="inputColumnNames">Name of the columns to transform.</param>
/// <remarks>
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in a way that <see cref="WordHashBagEstimator"/>
/// tokenizes text internally while <see cref="NgramHashingEstimator"/> takes tokenized text as input.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "in that the former .... and the latter ..."
(we don't want to many identical links)

#Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(xref:Microsoft.ML.Data.TextDataViewType) [](start = 51, length = 43)

(xref:UID) withtout <> #Resolved

/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |
/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ [](start = 53, length = 1)

no / #Resolved

/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequences of consecutive words of length 1-n [](start = 48, length = 44)

n-gram = 'sequences of n consecutive words'

details: 'sequences of n consecutive tokens' where token is word, char, etc. depending on the context.

#Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you want me to change this, I just don't understand your proposal


In reply to: 277088838 [](ancestors = 277088838)

@@ -185,6 +204,30 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
/// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngrams (sequences of consecutive words of length 1-n) [](start = 36, length = 53)

ditto #Resolved

/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer/> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sequences of consecutive words of length 1-n) [](start = 47, length = 46)

ditto #Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Text](<xref:Microsoft.ML.Data.TextDataViewType>) |
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< [](start = 52, length = 1)

ditto #Resolved

/// It does so by hashing each ngram and using the hash value as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator>
/// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a way that the former takes tokenizes text internally while the latter takes tokenized text as input. [](start = 7, length = 105)

ditto: former ... latter... #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already former.. latter...


In reply to: 277089489 [](ancestors = 277089489)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that the former ...


In reply to: 277095332 [](ancestors = 277095332,277089489)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get your comment


In reply to: 277113027 [](ancestors = 277113027,277095332,277089489)

@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #3440 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3440      +/-   ##
==========================================
+ Coverage   72.73%   72.73%   +<.01%     
==========================================
  Files         807      807              
  Lines      145206   145206              
  Branches    16230    16230              
==========================================
+ Hits       105609   105613       +4     
+ Misses      35179    35175       -4     
  Partials     4418     4418
Flag Coverage Δ
#Debug 72.73% <ø> (ø) ⬆️
#production 68.27% <ø> (ø) ⬆️
#test 88.98% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
...soft.ML.Transforms/Text/NgramHashingTransformer.cs 88.79% <ø> (ø) ⬆️
...soft.ML.Transforms/Text/WrappedTextTransformers.cs 93.63% <ø> (ø) ⬆️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.1% <0%> (-0.16%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.9% <0%> (+0.2%) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <0%> (+0.62%) ⬆️

@@ -866,7 +866,7 @@ public VBuffer<ReadOnlyMemory<char>>[] SlotNamesMetadata(out VectorDataViewType[
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Key](<xref:Microsoft.ML.Data.KeyDataViewType>) |
/// | Input column data type | Vector of [Key](xref:Microsoft.ML.Data.KeyDataViewType) |
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key [](start = 46, length = 3)

I would just do xref:Microsoft.ML.Data.KeyDataViewType #ByDesign

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's desirable to keep it as Ivan did, at least I believe that's the convention for key type: name it key, and link the KeyDataViewType.


In reply to: 277109991 [](ancestors = 277109991)

/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnName"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordBagEstimator"/>, which takes the data from the column specified in <paramref name="inputColumnName"/>
/// to a new column: <paramref name="outputColumnName"/> and produces a vector of counts of n-grams.
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest a slightly different wording that uses 'maps' instead of 'takes': (I think this could be used below as well)

which maps the column specified in to a vector of counts of n-grams in a new column named . #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <remarks>
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in a way that the former
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a way that the former [](start = 99, length = 25)

Same comment as below:
"in that the former .... and the latter ..." #Resolved

/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator>
/// in a way that the former takes tokenizes text internally while the latter takes tokenized text as input.
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a way that [](start = 11, length = 10)

same comment as in the extension: "in that the former ... while the latter ...." #Resolved

/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a given data [](start = 82, length = 17)

maybe:
counts of n-grams -> counts of occurrences of ngrams
from a given data -> in the input vector of words

Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singlis suggested "vector of n-gram counts" which I think is good #Resolved

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to take the data from.
/// This estimator operates over vector of text.</param>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 51, length = 4)

no cref=System.String? I know text is pretty obvious, I am just thinking for consistency. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had discussion few days ago. we use just text in normal comments


In reply to: 277113448 [](ancestors = 277113448)

@@ -316,12 +322,18 @@ public static class TextCatalog
outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multople [](start = 72, length = 8)

multiple #Resolved

/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>
/// to a vector of counts of n-grams in a new column named <paramref name="outputColumnName"/>.
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counts of n-grams [](start = 27, length = 17)

vector of n-gram counts? #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnNames">Names of the multiple columns to take the data from.
/// This estimator operates over vector of text.</param>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 51, length = 4)

cref=system.String? #ByDesign

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take this change, please update all functions


In reply to: 277113589 [](ancestors = 277113589)

/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnName"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/>
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>.
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counts of hashed n-grams [](start = 27, length = 24)

vector of hashed n-gram counts? #Resolved

@@ -371,12 +389,18 @@ public static class TextCatalog
maximumNumberOfInverts: maximumNumberOfInverts);

/// <summary>
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnNames"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multople columns specified in <paramref name="inputColumnNames"/>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multople [](start = 72, length = 8)

multiple #Resolved

Copy link
Member

@singlis singlis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to take the data from.
/// This estimator operates over vector of text.</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a good place to put the clarification about the input text i.e. that it is tokenized.

This estimator operates over a vector of tokenized text. This is different from the see cref="NgramExtractingEstimator", which tokenizes the text itself.

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <remarks>
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing this and adding a clarification to the input column definition.

/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of counts of n-grams (sequences of n consecutive words) from a given data.
Copy link
Contributor

@natke natke Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singlis suggested "vector of n-gram counts" which I think is good #Resolved

/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data.
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

@Ivanidzo4ka Ivanidzo4ka merged commit c0832b5 into dotnet:master Apr 20, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants