-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Made 'StopWordsRemover' in TextFeaturizer configurable again. #2962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2962 +/- ##
==========================================
- Coverage 72.26% 72.25% -0.02%
==========================================
Files 802 800 -2
Lines 142901 142894 -7
Branches 16134 16137 +3
==========================================
- Hits 103272 103245 -27
- Misses 35210 35228 +18
- Partials 4419 4421 +2
|
@@ -96,8 +129,42 @@ public sealed class Options : TransformInputBase | |||
[Argument(ArgumentType.AtMostOnce, HelpText = "Dataset language or 'AutoDetect' to detect language per row.", ShortName = "lang", SortOrder = 3)] | |||
public Language Language = DefaultLanguage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language [](start = 28, length = 8)
it needs to be part of StopWordsRemovingTransformer #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Language
parameter is used by a couple components in TextTransform e.g. it used by tokenizer also. I know currently we are using very naïve whitespace tokenizer which does not use it. However, in future if we develop a sophisticated tokenizer for different language then this parameter will be needed. What do you say?
In reply to: 265808271 [](ancestors = 265808271)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what kind of future lays ahead of us, but for now, this option just look silly.
In reply to: 265812513 [](ancestors = 265812513,265808271)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...so you think that it should not be here? it should instead be in PredefinedStopWordsRemoverOptions
?
In reply to: 266564835 [](ancestors = 266564835,265812513,265808271)
/// <summary> | ||
/// Defines the different type of stop words remover supported. | ||
/// </summary> | ||
public enum StopWordsRemoverType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StopWordsRemoverType [](start = 20, length = 20)
I think it would be much cleaner if we just introduce IStopWordRemover interface and inherit both stop word removers from it and make it part of public interface. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interface make it difficult to find out different type of StopWordsRemover
in the system at API level. When user will use this enum they immediately know that what type of StopWordsRemover
are available and how to use them.
In reply to: 265808886 [](ancestors = 265808886)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done what you proposed for the interface thing.
In reply to: 265811781 [](ancestors = 265811781,265808886)
/// <summary> | ||
/// List of stop words. It is used when <see cref="StopWordsRemover"/> is equal to <see cref="StopWordsRemoverType.UseCustomStopWordsRemover"/>. | ||
/// </summary> | ||
public string[] StopWords; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string [](start = 19, length = 6)
Why not IEnumerable<string>
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that EntryPoint will break because of this. The actual interface is still using string[]
machinelearning/src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs
Line 643 in 91a8703
public string[] Stopwords; |
I have mentioned that in the comments in the my PR. I am waiting for the people to comment on it. So that I can decide what to do.
In reply to: 266099423 [](ancestors = 266099423)
Could you add a test with a custom list of stop words? #Resolved |
/// <summary> | ||
/// Use stop words remover that can removes language-specific list of stop words (most common words) already defined in the system. | ||
/// </summary> | ||
public sealed class CustomStopWordsRemoverOptions : IStopWordsRemoverOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CustomStopWordsRemoverOptions [](start = 28, length = 29)
Why this interface and class is part of text transform and not part of StopWordRemover itself? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is why it's not implemented same way as word ngram and char ngram implemented?
In reply to: 266565265 [](ancestors = 266565265)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be part of StopWordRemover. I will move it.
Word ngram and char ngram are do different options that can be exclusively applied in Text transform (two parameter for two different options). On the contrary, only one type of StopWordRemover can be applied (one parameter to take two different options). Any idea if it can be done better than this?
In reply to: 266565861 [](ancestors = 266565861,266565265)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the useful feedback! |
This PR addresses the StopWordsRemover issue in #838
This PR make available two options to remove stop words.
string[]
is exposed to collect that list.No support is added to load stop words from file. I think if we make No.2 take list as
IEnumerable<string>
instead ofstring[]
then user can load list on their own. Let me know if reviewers have any thought.