Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156

zeahmed · 2019-04-01T19:50:40Z

Related to #1209.

codecov · 2019-04-01T20:31:46Z

Codecov Report

Merging #3156 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3156      +/-   ##
==========================================
+ Coverage   72.53%   72.53%   +<.01%     
==========================================
  Files         808      807       -1     
  Lines      144775   144774       -1     
  Branches    16209    16208       -1     
==========================================
+ Hits       105012   105018       +6     
+ Misses      35348    35341       -7     
  Partials     4415     4415

Flag	Coverage Δ
#Debug	`72.53% <ø> (ø)`	⬆️
#production	`68.12% <ø> (ø)`	⬆️
#test	`88.82% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`86.1% <0%> (-0.16%)`	⬇️
src/Microsoft.ML.Data/Transforms/Normalizer.cs	`86.03% <0%> (ø)`	⬆️
...icrosoft.ML.Functional.Tests/DataTransformation.cs	`100% <0%> (ø)`	⬆️
...s/Api/CookbookSamples/CookbookSamplesDynamicApi.cs	`93.49% <0%> (ø)`	⬆️
...s/Scenarios/Api/CookbookSamples/CookbookSamples.cs	`99.49% <0%> (ø)`	⬆️
test/Microsoft.ML.Functional.Tests/Training.cs	`100% <0%> (ø)`	⬆️
...est/Microsoft.ML.Tests/FeatureContributionTests.cs	`98.55% <0%> (ø)`	⬆️
...oft.ML.Experimental/TransformsCatalogExtensions.cs
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs	`100% <0%> (+20.51%)`	⬆️
... and 1 more

Ivanidzo4ka · 2019-04-01T22:53:45Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveDefaultStopWords.cs

+            var mlContext = new MLContext();
+
+            // Create an empty data sample list. The 'RemoveDefaultStopWords' does not require training data as
+            // the estimator ('StopWordsRemovingEstimator') created by 'RemoveDefaultStopWords' API is not a trainable estimator.


StopWordsRemovingEstimator [](start = 31, length = 26)

any reason why you prefer this format other than <see cref="StopWordsRemovingEstimator"> ? #Pending

I doubt if it works in normal comments. works in xml comments.

In reply to: 271080226 [](ancestors = 271080226)

rogancarr · 2019-04-01T23:51:19Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveDefaultStopWords.cs

+            // as well as the source of randomness.
+            var mlContext = new MLContext();
+
+            // Create an empty data sample list. The 'RemoveDefaultStopWords' does not require training data as


data sample list. [](start = 30, length = 18)

Create an empty list as the dataset. #Resolved

rogancarr · 2019-04-01T23:54:02Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/TokenizeIntoWords.cs

+            var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
+
+            // A pipeline for converting text into vector of words.
+            var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Words", "Text", separators: new[] { ' ' });


separators: new[] { ' ' } [](start = 92, length = 25)

Can you add details about what to expect here with default values? One thing we may want to mention is removing any whitespace. #Resolved

rogancarr

Approved with comments.

Ivanidzo4ka

Ivanidzo4ka · 2019-04-02T20:11:32Z

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

@@ -254,8 +261,9 @@ public static class TextCatalog
        /// <example>
        /// <format type="text/markdown">
        /// <![CDATA[
-        ///  [!code-csharp[FastTree](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/StopWordRemoverTransform.cs)]
-        /// ]]></format>
+        /// [!code-csharp[RemoveStopWords](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveDefaultStopWords.cs)]


RemoveStopWords [](start = 26, length = 15)

RemoveDefaultStopWords
not sure is it important or not #Resolved

Ivanidzo4ka · 2019-04-02T20:13:59Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveStopWords.cs

+            // A pipeline for removing stop words from input text/string.
+            // The pipeline first tokenizes text into words then removes stop words.
+            var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Words", "Text")
+                .Append(mlContext.Transforms.Text.RemoveStopWords("WordsWithoutStopWords", "Words", stopwords: new[] { "a", "the", "from", "by" }));


from [](start = 132, length = 4)

I would throw few words regarding casing.
imaging you have string "Something CoOl here" and you have stop word remover with "cool", would it remove it? Would it preserve it?
It's a mystery now, but you can show how it works in this sample. #Resolved

Ivanidzo4ka · 2019-04-02T20:15:27Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/RemoveDefaultStopWords.cs

+            var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
+
+            // A pipeline for removing stop words from input text/string.
+            // The pipeline first tokenizes text into words then removes stop words.


removes [](start = 65, length = 7)

shall we add link to list of stop words?
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/StopWords/English.txt this one is for english. #Resolved

also if you can modify string to include "tHe" or something like that and show it was removed (because we compare by ignoring casing (I hope so) would be nice.

In reply to: 271477611 [](ancestors = 271477611)

I think the link should go into the documentation instead of sample.

In reply to: 271478137 [](ancestors = 271478137,271477611)

Ivanidzo4ka · 2019-04-02T20:19:28Z

        var data = new TextData() { Text = "This is a great product. I would like to buy it again."  };

It's me, your friendly casing neighbor.
In my example you can saw what xbox got treated even if we specify XBOX in file.

Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/ApplyCustomWordEmbedding.cs:54 in 1bc241d. [](commit_id = 1bc241d, deletion_comment = False)

zeahmed · 2019-04-02T22:06:23Z

Thanks!

…net#3156)

zeahmed added 5 commits March 29, 2019 12:38

Created sample for 'ApplyWordEmbedding' API.

15be23c

Addressed reviewers' comments.

58e2d4b

Deleted old embedding sample.

a3ec5d3

Created samples for TokenizeIntoWords and RemoveStopWords APIs.

64ff946

Merge remote-tracking branch 'upstream/master' into TokenizeIntoWords

9241a21

zeahmed requested review from shmoradims, singlis, sfilipi and rogancarr April 1, 2019 19:50

sfilipi requested a review from natke April 1, 2019 21:09

Ivanidzo4ka reviewed Apr 1, 2019

View reviewed changes

rogancarr reviewed Apr 1, 2019

View reviewed changes

rogancarr approved these changes Apr 1, 2019

View reviewed changes

zeahmed added 2 commits April 1, 2019 17:18

Merge remote-tracking branch 'upstream/master' into TokenizeIntoWords

ad6f967

Addressed reviewers' comments.

1bc241d

sfilipi mentioned this pull request Apr 2, 2019

API reference - Samples for Transforms #1209

Closed

Ivanidzo4ka approved these changes Apr 2, 2019

View reviewed changes

Ivanidzo4ka reviewed Apr 2, 2019

View reviewed changes

Addressed reviewers' comments.

672ade6

zeahmed merged commit 950f133 into dotnet:master Apr 2, 2019

zeahmed added a commit to zeahmed/machinelearning that referenced this pull request Apr 8, 2019

Created samples for TokenizeIntoWords and RemoveStopWords APIs. (dot…

a4a82de

…net#3156)

zeahmed mentioned this pull request Apr 8, 2019

Cherry pick for samples (Text) #3240

Closed

ghost locked as resolved and limited conversation to collaborators Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156

Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156

Uh oh!

zeahmed commented Apr 1, 2019

Uh oh!

codecov bot commented Apr 1, 2019 •

edited

Loading

Uh oh!

Ivanidzo4ka Apr 1, 2019 •

edited by zeahmed

Loading

Uh oh!

zeahmed Apr 2, 2019

Uh oh!

rogancarr Apr 1, 2019 •

edited by zeahmed

Loading

Uh oh!

rogancarr Apr 1, 2019 •

edited by zeahmed

Loading

Uh oh!

rogancarr left a comment •

edited

Loading

Uh oh!

Ivanidzo4ka left a comment

Uh oh!

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading

Uh oh!

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading

Uh oh!

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading

Uh oh!

Ivanidzo4ka Apr 2, 2019

Uh oh!

zeahmed Apr 2, 2019

Uh oh!

Ivanidzo4ka commented Apr 2, 2019

Uh oh!

zeahmed commented Apr 2, 2019

Uh oh!

Uh oh!

Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156

Created samples for TokenizeIntoWords and RemoveStopWords APIs. #3156

Uh oh!

Conversation

zeahmed commented Apr 1, 2019

Uh oh!

codecov bot commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Ivanidzo4ka Apr 1, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeahmed Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 1, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 1, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka left a comment

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 2, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 2, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 2, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

zeahmed Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka commented Apr 2, 2019

Uh oh!

zeahmed commented Apr 2, 2019

Uh oh!

Uh oh!

codecov bot commented Apr 1, 2019 •

edited

Loading

Ivanidzo4ka Apr 1, 2019 •

edited by zeahmed

Loading

rogancarr Apr 1, 2019 •

edited by zeahmed

Loading

rogancarr Apr 1, 2019 •

edited by zeahmed

Loading

rogancarr left a comment •

edited

Loading

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading

Ivanidzo4ka Apr 2, 2019 •

edited by zeahmed

Loading