-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Exposed ngram extraction options in TextFeaturizer #2911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2911 +/- ##
==========================================
+ Coverage 71.82% 72.19% +0.36%
==========================================
Files 812 796 -16
Lines 142719 142020 -699
Branches 16092 16044 -48
==========================================
+ Hits 102513 102526 +13
+ Misses 35827 35115 -712
Partials 4379 4379
|
/// The maximum number of grams to store in the dictionary, for each level of ngrams, | ||
/// from 1 (in position 0) up to ngramLength (in position ngramLength-1) | ||
/// </summary> | ||
public int[] MaxNumTerms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxNumTerms [](start = 25, length = 11)
@wschin calls it maximumNgramsCount
would be nice to be consistent. #Resolved
So originally we had Factory for StopWordRemover, which I think I change to this boolean field. Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:100 in d3e2841. [](commit_id = d3e2841, deletion_comment = False) |
{ | ||
[Argument(ArgumentType.Required, HelpText = "New column definition (optional form: name:srcs).", Name = "Column", ShortName = "col", SortOrder = 1)] | ||
public Column Columns; | ||
internal Column Columns; | ||
|
||
[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)] | ||
public Language Language = DefaultLanguage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language [](start = 28, length = 8)
Not related to your PR, but whole point of this thing is to set or autodetect language, we left autodetect outside of ML.NET so it's looks completely unnecessary here, but as I said in other comment, this is out of scope of this PR. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public Options() | ||
{ | ||
WordFeatureExtractor = new WordBagEstimator.Options(); | ||
CharFeatureExtractor = new WordBagEstimator.Options() { NgramLength = 3, AllLengths = false }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these fields initialized to null by default? And then fallback to WordFeatureExtractorFactory / CharFeatureExtractorFactory whenever you try to use them and they are not defined. This way code flow from entrypoint / maml will be same as with API. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordFeatureExtractor
and CharFeatureExtractor
both can be null. When they are null its means do not apply the respective transforms. So, there is no way to detect if null was set by user or its a default value.
Right now, WordFeatureExtractorFactory / CharFeatureExtractorFactory is preferred if the maml is used.
In reply to: 264465314 [](ancestors = 264465314)
[Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)] | ||
public INgramExtractorFactoryFactory CharFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; | ||
[Argument(ArgumentType.Multiple, Name = "CharFeatureExtractor", HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)] | ||
internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INgramExtractorFactoryFactory [](start = 21, length = 29)
FactoryFactory :), is it possible just have name INgramExtractorFactory
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point but we already have a class name INgramExtractorFactory
internal interface INgramExtractorFactory |
and there is a relation between
INgramExtractorFactory
and INgramExtractorFactoryFactory
. I leave it for now. Maybe you can create an issue to track it if you think this needs to be renamed?
In reply to: 264465575 [](ancestors = 264465575)
[Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)] | ||
public INgramExtractorFactoryFactory WordFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments(); | ||
[Argument(ArgumentType.Multiple, Name = "WordFeatureExtractor", HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)] | ||
internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INgramExtractorFactoryFactory [](start = 21, length = 29)
same here FactoryFactory :) #Resolved
[Argument(ArgumentType.Multiple, Name = "WordFeatureExtractor", HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)] | ||
internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments(); | ||
|
||
public WordBagEstimator.Options WordFeatureExtractor; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public field needs document. In addition, to internalize IFactory
please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved
[Argument(ArgumentType.Multiple, Name = "CharFeatureExtractor", HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)] | ||
internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false }; | ||
|
||
public WordBagEstimator.Options CharFeatureExtractor; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public field needs document. In addition, to internalize IFactory please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved
I don't know how Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:434 in e03104b. [](commit_id = e03104b, deletion_comment = False) |
|
||
public Options() | ||
{ | ||
WordFeatureExtractor = new WordBagEstimator.Options(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordFeatureExtractor [](start = 16, length = 20)
This will kill the initial value of WordFeatureExtractorFactory
. How about remove new NgramExtractorTransform.NgramExtractorArguments()
in internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments();
? A similar comment applies to CharFeatureExtractor
.
#Resolved
This PR fixes #2802, fixes #2838 and partially addresses #838,.