-
Notifications
You must be signed in to change notification settings - Fork 1.9k
snapping featurizeText to the template #3438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3438 +/- ##
==========================================
+ Coverage 72.72% 72.73% +<.01%
==========================================
Files 807 807
Lines 145206 145206
Branches 16230 16230
==========================================
+ Hits 105608 105615 +7
+ Misses 35179 35174 -5
+ Partials 4419 4417 -2
|
@@ -17,11 +17,14 @@ namespace Microsoft.ML | |||
public static class TextCatalog | |||
{ | |||
/// <summary> | |||
/// Transform a text column into featurized float array that represents counts of ngrams and char-grams. | |||
/// Create a <see cref="TextFeaturizingEstimator"/>, which transforms a text column into featurized float array that represents normalized counts of ngrams and char-grams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float array [](start = 108, length = 11)
I think it would be better to avoid float array?
Maybe numeric vector column
? Here and below #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put System.Single, since numeric implies int, uint etc.
In reply to: 277057968 [](ancestors = 277057968)
/// | | | | ||
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes. | | ||
/// | Input column data type | [text](xref:System.ReadOnlyMemory<char[]>) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this render correctly? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was: xref:System.ReadOnlyMemory{System.Char}
As per Shahab's message in teams
In reply to: 277057901 [](ancestors = 277057901)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the catch!
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am putting the following in there:
Scream now, or ... :)
| Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) |
In reply to: 277111141 [](ancestors = 277111141,277110296)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with Shahab. The above is correct.
In reply to: 277111655 [](ancestors = 277111655,277111141,277110296)
/// * [Tokenzation](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) | ||
/// * [Text normalization](https://en.wikipedia.org/wiki/Text_normalization) | ||
/// * [Predefined and custom stopwords removal](https://en.wikipedia.org/wiki/Stop_words) | ||
/// * [Word-based or character-based Ngram and SkipGram extraction](https://en.wikipedia.org/wiki/N-gram) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it actually produce skip-grams? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, it's one of the advanced options for the WordBagEstiamator.Options
that can be set in the TextFeaturizingEstimator.Options
. The option is SkipLength
In reply to: 277111709 [](ancestors = 277111709,277058416)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could add it back!
In reply to: 277113837 [](ancestors = 277113837,277111709,277058416)
// integer index mapping through hashing) as an option. | ||
/// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' /> | ||
/// <summary> | ||
/// A transform that turns a collection of text documents into numerical feature vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A transform [](start = 9, length = 11)
Would avoid starting with a transform
as the description of an estimator.
How about starting with Featurizes text data by ...
#Resolved
/// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' /> | ||
/// <summary> | ||
/// A transform that turns a collection of text documents into numerical feature vectors. | ||
/// The feature vectors are normalized counts of word and/or character ngrams (based on the options supplied) in a given tokenized text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a given tokenized text. [](start = 115, length = 26)
Does the input text need to be tokenized? I thought the input was just text.
Maybe removing tokenized will make it more clear. #Resolved
/// * [L-p vector normalization](xref: Microsoft.ML.Transforms.LpNormNormalizingTransformer) | ||
/// | ||
/// Features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data. | ||
/// </format> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can specify the maximum number of ngrams to keep in the advanced options. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could start by: By default, ...
.
And afterwards say that: the number of features can also be specified by selecting the maximum number of n-gram to keep in the advanced options
.
In reply to: 277060289 [](ancestors = 277060289)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment about skip-grams otherwise it looks good!
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes. | | ||
/// | Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) | | ||
/// | Output column data type | vector of <xref:System.Single> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector [](start = 36, length = 6)
Known-sized Vector #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// | ||
/// By default the features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data. | ||
/// The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned. | ||
/// </format> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 7, length = 5)
nit: A few extra spaces.
Could you add the usual See Also comment about examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we doing that? I haven't put it anywhere but LDA, where they would need to look at the previous steps.
In reply to: 277114384 [](ancestors = 277114384)
/// | ||
/// By default the features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data. | ||
/// The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned. | ||
/// </format> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it's important you could mention that the word tokens can be output to a column.
There is a setting in the options class: OutputTokensColumnName
#Resolved
Towards #3204. Snapping featurizeText to the template