-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3156 +/- ##
==========================================
+ Coverage 72.53% 72.53% +<.01%
==========================================
Files 808 807 -1
Lines 144775 144774 -1
Branches 16209 16208 -1
==========================================
+ Hits 105012 105018 +6
+ Misses 35348 35341 -7
Partials 4415 4415
|
var mlContext = new MLContext(); | ||
|
||
// Create an empty data sample list. The 'RemoveDefaultStopWords' does not require training data as | ||
// the estimator ('StopWordsRemovingEstimator') created by 'RemoveDefaultStopWords' API is not a trainable estimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StopWordsRemovingEstimator [](start = 31, length = 26)
any reason why you prefer this format other than <see cref="StopWordsRemovingEstimator">
? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if it works in normal comments. works in xml comments.
In reply to: 271080226 [](ancestors = 271080226)
// as well as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Create an empty data sample list. The 'RemoveDefaultStopWords' does not require training data as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data sample list. [](start = 30, length = 18)
Create an empty list as the dataset. #Resolved
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples); | ||
|
||
// A pipeline for converting text into vector of words. | ||
var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Words", "Text", separators: new[] { ' ' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separators: new[] { ' ' } [](start = 92, length = 25)
Can you add details about what to expect here with default values? One thing we may want to mention is removing any whitespace. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -254,8 +261,9 @@ public static class TextCatalog | |||
/// <example> | |||
/// <format type="text/markdown"> | |||
/// <] | |||
/// ]]></format> | |||
/// [!code-csharp[RemoveStopWords](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveDefaultStopWords.cs)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoveStopWords [](start = 26, length = 15)
RemoveDefaultStopWords
not sure is it important or not #Resolved
// A pipeline for removing stop words from input text/string. | ||
// The pipeline first tokenizes text into words then removes stop words. | ||
var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Words", "Text") | ||
.Append(mlContext.Transforms.Text.RemoveStopWords("WordsWithoutStopWords", "Words", stopwords: new[] { "a", "the", "from", "by" })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from [](start = 132, length = 4)
I would throw few words regarding casing.
imaging you have string "Something CoOl here" and you have stop word remover with "cool", would it remove it? Would it preserve it?
It's a mystery now, but you can show how it works in this sample. #Resolved
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples); | ||
|
||
// A pipeline for removing stop words from input text/string. | ||
// The pipeline first tokenizes text into words then removes stop words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removes [](start = 65, length = 7)
shall we add link to list of stop words?
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/StopWords/English.txt this one is for english. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if you can modify string to include "tHe" or something like that and show it was removed (because we compare by ignoring casing (I hope so) would be nice.
In reply to: 271477611 [](ancestors = 271477611)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the link should go into the documentation instead of sample.
In reply to: 271478137 [](ancestors = 271478137,271477611)
It's me, your friendly casing neighbor. Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/ApplyCustomWordEmbedding.cs:54 in 1bc241d. [](commit_id = 1bc241d, deletion_comment = False) |
Thanks! |
Related to #1209.