Skip to content

Exposed ngram extraction options in TextFeaturizer #2911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 13, 2019

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Mar 11, 2019

This PR fixes #2802, fixes #2838 and partially addresses #838,.

@codecov
Copy link

codecov bot commented Mar 11, 2019

Codecov Report

Merging #2911 into master will increase coverage by 0.36%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #2911      +/-   ##
==========================================
+ Coverage   71.82%   72.19%   +0.36%     
==========================================
  Files         812      796      -16     
  Lines      142719   142020     -699     
  Branches    16092    16044      -48     
==========================================
+ Hits       102513   102526      +13     
+ Misses      35827    35115     -712     
  Partials     4379     4379
Flag Coverage Δ
#Debug 72.19% <94.73%> (+0.36%) ⬆️
#production 67.97% <94.23%> (ø) ⬆️
#test 88.31% <100%> (+2.09%) ⬆️
Impacted Files Coverage Δ
...crosoft.ML.Transforms/EntryPoints/TextAnalytics.cs 41.25% <ø> (ø) ⬆️
...rc/Microsoft.ML.Transforms/Text/TextNormalizing.cs 97.11% <ø> (ø) ⬆️
...icrosoft.ML.Functional.Tests/DataTransformation.cs 100% <100%> (ø) ⬆️
...soft.ML.Transforms/Text/WrappedTextTransformers.cs 95.28% <100%> (+0.38%) ⬆️
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.08% <100%> (-0.01%) ⬇️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 88.7% <93.18%> (-0.31%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 84.7% <0%> (-0.21%) ⬇️
...icrosoft.ML.Mkl.Components/MklComponentsCatalog.cs 94.28% <0%> (ø) ⬆️
.../Microsoft.ML.Tests/TrainerEstimators/SdcaTests.cs 100% <0%> (ø) ⬆️
... and 18 more

/// The maximum number of grams to store in the dictionary, for each level of ngrams,
/// from 1 (in position 0) up to ngramLength (in position ngramLength-1)
/// </summary>
public int[] MaxNumTerms;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxNumTerms [](start = 25, length = 11)

@wschin calls it maximumNgramsCount would be nice to be consistent. #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 11, 2019

        // TODO #2802: Update FeaturizeText to allow specifications of word-grams and char-grams.

Can be removed? #Resolved


Refers to: test/Microsoft.ML.Functional.Tests/DataTransformation.cs:138 in d3e2841. [](commit_id = d3e2841, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 11, 2019

        public bool UsePredefinedStopWordRemover = false;

So originally we had Factory for StopWordRemover, which I think I change to this boolean field.
Would be nice to bring that stuff back, but it's out of scope of this PR.
Should I create separate issue or you think you can do it in this PR? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:100 in d3e2841. [](commit_id = d3e2841, deletion_comment = False)

{
[Argument(ArgumentType.Required, HelpText = "New column definition (optional form: name:srcs).", Name = "Column", ShortName = "col", SortOrder = 1)]
public Column Columns;
internal Column Columns;

[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)]
public Language Language = DefaultLanguage;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language [](start = 28, length = 8)

Not related to your PR, but whole point of this thing is to set or autodetect language, we left autodetect outside of ML.NET so it's looks completely unnecessary here, but as I said in other comment, this is out of scope of this PR. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let use this issue for tracking this #838.


In reply to: 264433503 [](ancestors = 264433503)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

public Options()
{
WordFeatureExtractor = new WordBagEstimator.Options();
CharFeatureExtractor = new WordBagEstimator.Options() { NgramLength = 3, AllLengths = false };
Copy link
Member

@ganik ganik Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these fields initialized to null by default? And then fallback to WordFeatureExtractorFactory / CharFeatureExtractorFactory whenever you try to use them and they are not defined. This way code flow from entrypoint / maml will be same as with API. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordFeatureExtractor and CharFeatureExtractor both can be null. When they are null its means do not apply the respective transforms. So, there is no way to detect if null was set by user or its a default value.

Right now, WordFeatureExtractorFactory / CharFeatureExtractorFactory is preferred if the maml is used.


In reply to: 264465314 [](ancestors = 264465314)

[Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)]
public INgramExtractorFactoryFactory CharFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false };
[Argument(ArgumentType.Multiple, Name = "CharFeatureExtractor", HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)]
internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false };
Copy link
Member

@ganik ganik Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INgramExtractorFactoryFactory [](start = 21, length = 29)

FactoryFactory :), is it possible just have name INgramExtractorFactory
#Resolved

Copy link
Contributor Author

@zeahmed zeahmed Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point but we already have a class name INgramExtractorFactory

internal interface INgramExtractorFactory

and there is a relation between INgramExtractorFactory and INgramExtractorFactoryFactory. I leave it for now. Maybe you can create an issue to track it if you think this needs to be renamed?


In reply to: 264465575 [](ancestors = 264465575)

[Argument(ArgumentType.Multiple, HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)]
public INgramExtractorFactoryFactory WordFeatureExtractor = new NgramExtractorTransform.NgramExtractorArguments();
[Argument(ArgumentType.Multiple, Name = "WordFeatureExtractor", HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)]
internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments();
Copy link
Member

@ganik ganik Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INgramExtractorFactoryFactory [](start = 21, length = 29)

same here FactoryFactory :) #Resolved

@zeahmed
Copy link
Contributor Author

zeahmed commented Mar 11, 2019

        public bool UsePredefinedStopWordRemover = false;

Yes, let use this issue for tracking this #838.


In reply to: 471732391 [](ancestors = 471732391)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:100 in d3e2841. [](commit_id = d3e2841, deletion_comment = False)

[Argument(ArgumentType.Multiple, Name = "WordFeatureExtractor", HelpText = "Ngram feature extractor to use for words (WordBag/WordHashBag).", ShortName = "wordExtractor", NullName = "<None>", SortOrder = 11)]
internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments();

public WordBagEstimator.Options WordFeatureExtractor;

Copy link
Member

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public field needs document. In addition, to internalize IFactory please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved

[Argument(ArgumentType.Multiple, Name = "CharFeatureExtractor", HelpText = "Ngram feature extractor to use for characters (WordBag/WordHashBag).", ShortName = "charExtractor", NullName = "<None>", SortOrder = 12)]
internal INgramExtractorFactoryFactory CharFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments() { NgramLength = 3, AllLengths = false };

public WordBagEstimator.Options CharFeatureExtractor;

Copy link
Member

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public field needs document. In addition, to internalize IFactory please follow the pattern Tom and I have built in this PR --- PR description of #2851 #Resolved

@wschin
Copy link
Member

wschin commented Mar 12, 2019

I don't know how CharFeatureExtractor can affect this region because there is no change. Can you make user's CharFeatureExtractor can finally affect view? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:434 in e03104b. [](commit_id = e03104b, deletion_comment = False)


public Options()
{
WordFeatureExtractor = new WordBagEstimator.Options();
Copy link
Member

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordFeatureExtractor [](start = 16, length = 20)

This will kill the initial value of WordFeatureExtractorFactory. How about remove new NgramExtractorTransform.NgramExtractorArguments() in internal INgramExtractorFactoryFactory WordFeatureExtractorFactory = new NgramExtractorTransform.NgramExtractorArguments();? A similar comment applies to CharFeatureExtractor.
#Resolved

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants