Skip to content

snapping featurizeText to the template #3438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 21, 2019
Merged

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Apr 19, 2019

Towards #3204. Snapping featurizeText to the template

@sfilipi sfilipi requested review from natke and shmoradims April 19, 2019 17:47
@sfilipi sfilipi self-assigned this Apr 19, 2019
@sfilipi sfilipi added the documentation Related to documentation of ML.NET label Apr 19, 2019
@sfilipi sfilipi requested a review from codemzs April 19, 2019 17:48
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #3438 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3438      +/-   ##
==========================================
+ Coverage   72.72%   72.73%   +<.01%     
==========================================
  Files         807      807              
  Lines      145206   145206              
  Branches    16230    16230              
==========================================
+ Hits       105608   105615       +7     
+ Misses      35179    35174       -5     
+ Partials     4419     4417       -2
Flag Coverage Δ
#Debug 72.73% <ø> (ø) ⬆️
#production 68.27% <ø> (ø) ⬆️
#test 88.98% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 90.57% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/NormalizerCatalog.cs 84.78% <0%> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.89% <0%> (+0.62%) ⬆️
src/Microsoft.ML.Maml/MAML.cs 26.21% <0%> (+1.45%) ⬆️

@@ -17,11 +17,14 @@ namespace Microsoft.ML
public static class TextCatalog
{
/// <summary>
/// Transform a text column into featurized float array that represents counts of ngrams and char-grams.
/// Create a <see cref="TextFeaturizingEstimator"/>, which transforms a text column into featurized float array that represents normalized counts of ngrams and char-grams.
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float array [](start = 108, length = 11)

I think it would be better to avoid float array?
Maybe numeric vector column? Here and below #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put System.Single, since numeric implies int, uint etc.


In reply to: 277057968 [](ancestors = 277057968)

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes. |
/// | Input column data type | [text](xref:System.ReadOnlyMemory<char[]>) |
Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this render correctly? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was: xref:System.ReadOnlyMemory{System.Char}
As per Shahab's message in teams


In reply to: 277057901 [](ancestors = 277057901)

Copy link
Member Author

@sfilipi sfilipi Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the catch!
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think we're doing TextDataViewType


In reply to: 277110296 [](ancestors = 277110296)

Copy link
Member Author

@sfilipi sfilipi Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am putting the following in there:
Scream now, or ... :)

| Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) |


In reply to: 277111141 [](ancestors = 277111141,277110296)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with Shahab. The above is correct.


In reply to: 277111655 [](ancestors = 277111655,277111141,277110296)

/// * [Tokenzation](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
/// * [Text normalization](https://en.wikipedia.org/wiki/Text_normalization)
/// * [Predefined and custom stopwords removal](https://en.wikipedia.org/wiki/Stop_words)
/// * [Word-based or character-based Ngram and SkipGram extraction](https://en.wikipedia.org/wiki/N-gram)
Copy link
Contributor

@natke natke Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it actually produce skip-grams? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me error on the safe side and remove it.


In reply to: 277058416 [](ancestors = 277058416)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, it's one of the advanced options for the WordBagEstiamator.Options that can be set in the TextFeaturizingEstimator.Options. The option is SkipLength


In reply to: 277111709 [](ancestors = 277111709,277058416)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could add it back!


In reply to: 277113837 [](ancestors = 277113837,277111709,277058416)

// integer index mapping through hashing) as an option.
/// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' />
/// <summary>
/// A transform that turns a collection of text documents into numerical feature vectors.
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A transform [](start = 9, length = 11)

Would avoid starting with a transform as the description of an estimator.

How about starting with Featurizes text data by ... #Resolved

/// <include file='doc.xml' path='doc/members/member[@name="TextFeaturizingEstimator "]/*' />
/// <summary>
/// A transform that turns a collection of text documents into numerical feature vectors.
/// The feature vectors are normalized counts of word and/or character ngrams (based on the options supplied) in a given tokenized text.
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a given tokenized text. [](start = 115, length = 26)

Does the input text need to be tokenized? I thought the input was just text.
Maybe removing tokenized will make it more clear. #Resolved

/// * [L-p vector normalization](xref: Microsoft.ML.Transforms.LpNormNormalizingTransformer)
///
/// Features are made of (word/character) n-grams/skip-grams​ and the number of features are equal to the vocabulary size found by analyzing the data.
/// </format>
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can specify the maximum number of ngrams to keep in the advanced options. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could start by: By default, ....
And afterwards say that: the number of features can also be specified by selecting the maximum number of n-gram to keep in the advanced options.


In reply to: 277060289 [](ancestors = 277060289)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about skip-grams otherwise it looks good!

/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes. |
/// | Input column data type | [text](xref:Microsoft.ML.Data.TextDataViewType) |
/// | Output column data type | vector of <xref:System.Single> |
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector [](start = 36, length = 6)

Known-sized Vector #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving it, since this is an output.


In reply to: 277114222 [](ancestors = 277114222)

///
/// By default the features are made of (word/character) n-grams/skip-grams​ and the number of features are equal to the vocabulary size found by analyzing the data.
/// The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned.
/// </format>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 [](start = 7, length = 5)

nit: A few extra spaces.

Could you add the usual See Also comment about examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we doing that? I haven't put it anywhere but LDA, where they would need to look at the previous steps.


In reply to: 277114384 [](ancestors = 277114384)

///
/// By default the features are made of (word/character) n-grams/skip-grams​ and the number of features are equal to the vocabulary size found by analyzing the data.
/// The number of features can also be specified by selecting the maximum number of n-gram to keep in the <see cref="TextFeaturizingEstimator.Options"/>, where the estimator can be further tuned.
/// </format>
Copy link
Contributor

@artidoro artidoro Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's important you could mention that the word tokens can be output to a column.
There is a setting in the options class: OutputTokensColumnName #Resolved

@sfilipi sfilipi merged commit 1b277b5 into dotnet:master Apr 21, 2019
@sfilipi sfilipi deleted the featurizeTextDoc branch April 21, 2019 06:24
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants