Created sample for 'LatentDirichletAllocation' API. #3191

zeahmed · 2019-04-03T20:42:39Z

Related to #1209.

Ivanidzo4ka · 2019-04-03T20:53:57Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            // before passing tokens to LatentDirichletAllocation.
+            var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text")
+                .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText"))
+                .Append(mlContext.Transforms.Text.RemoveStopWords("Tokens"))


RemoveStopWords [](start = 50, length = 15)

Funny, this is custom stop words remover with no stop words.
So it does nothing.

I guess we need to remove param from params string[] stopwords) in RemoveStopWords #Resolved

opss...That's why I was not getting the output what I was expecting...:)

In reply to: 271928104 [](ancestors = 271928104)

Ivanidzo4ka · 2019-04-03T21:01:08Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+                .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText"))
+                .Append(mlContext.Transforms.Text.RemoveStopWords("Tokens"))
+                .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
+                .Append(mlContext.Transforms.Text.ProduceNgrams("Tokens"))


ProduceNgrams [](start = 50, length = 13)

Do we actually want to run LDA on top of 2 ngrams since 2 is default value for ProduceNgrams or we should recommend to use ngrams:1 ? #Resolved

2 is fine as it backs off to unigram ( useAllLengths=true) . I think higher is better in case there is a lot of data available.

In reply to: 271930804 [](ancestors = 271930804)

codecov · 2019-04-03T21:46:19Z

Codecov Report

Merging #3191 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3191      +/-   ##
==========================================
- Coverage    72.6%   72.58%   -0.02%     
==========================================
  Files         807      807              
  Lines      144957   144957              
  Branches    16211    16211              
==========================================
- Hits       105240   105215      -25     
- Misses      35298    35326      +28     
+ Partials     4419     4416       -3

Flag	Coverage Δ
#Debug	`72.58% <ø> (-0.02%)`	⬇️
#production	`68.14% <ø> (-0.03%)`	⬇️
#test	`88.88% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs	`70.95% <0%> (-6.99%)`	⬇️
src/Microsoft.ML.Maml/MAML.cs	`24.75% <0%> (-1.46%)`	⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`89.26% <0%> (-0.63%)`	⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`86.26% <0%> (+0.15%)`	⬆️

codecov · 2019-04-03T21:47:11Z

Codecov Report

Merging #3191 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3191      +/-   ##
==========================================
- Coverage    72.6%   72.58%   -0.02%     
==========================================
  Files         807      807              
  Lines      144957   144957              
  Branches    16211    16211              
==========================================
- Hits       105240   105221      -19     
- Misses      35298    35321      +23     
+ Partials     4419     4415       -4

Flag	Coverage Δ
#Debug	`72.58% <ø> (-0.02%)`	⬇️
#production	`68.15% <ø> (-0.02%)`	⬇️
#test	`88.88% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs	`41.66% <ø> (ø)`	⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs	`70.95% <0%> (-6.99%)`	⬇️

rogancarr · 2019-04-04T00:26:15Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            // Create a small dataset as an IEnumerable.
+            var samples = new List<TextData>()
+            {
+                new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },


topic model [](start = 88, length = 11)

models with an s #Resolved

rogancarr · 2019-04-04T00:26:28Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            {
+                new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },
+                new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." },
+                new TextData(){ Text = "I like to eat broccoli and banana." },


banana [](start = 67, length = 6)

bananas #Resolved

rogancarr · 2019-04-04T00:26:36Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+                new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },
+                new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." },
+                new TextData(){ Text = "I like to eat broccoli and banana." },
+                new TextData(){ Text = "I eat a banana in the breakfast." },


in the [](start = 55, length = 6)

for #Resolved

rogancarr · 2019-04-04T00:27:36Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+
+            // A pipeline for featurizing the text/string using LatentDirichletAllocation API.
+            // To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words
+            // before passing tokens to LatentDirichletAllocation.


tokens [](start = 30, length = 6)

"tokens (the individual words, lower cased, with common words removed)"

Many people won't be familiar with the specific language of NLP. #Resolved

rogancarr · 2019-04-04T00:28:28Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            // A pipeline for featurizing the text/string using LatentDirichletAllocation API.
+            // To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words
+            // before passing tokens to LatentDirichletAllocation.
+            var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text")


normText [](start = 68, length = 8)

I would spell it out for the example. #Resolved

rogancarr · 2019-04-04T00:29:53Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            var transformer = pipeline.Fit(dataview);
+
+            // Create the prediction engine to get the LDA features extracted from the text.
+            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(transformer);


predictionEngine [](start = 16, length = 16)

Similar to the other PR, I wonder if we should stay entirely within IDataView and not create a prediction engine. That is, use a TakeRows filter followed by a CreateEnumerable.

This is done because some of the transforms related to text processing such as (NormalizeText. TokenizeIntoWords etc.) don't need training data. In such cases, prediction engine seems more appropriate. But we can definitely have consensus on this. I will follow up.

In reply to: 271981536 [](ancestors = 271981536)

rogancarr

Approved with comments.

🚴

wschin · 2019-04-04T04:28:10Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            //  0.5455  0.1818  0.2727
+        }
+
+        private static void PrintPredictions(TransformedTextData prediction)


Suggested change

private static void PrintPredictions(TransformedTextData prediction)

private static void PrintLdaFeatures(TransformedTextData prediction)

``` #Resolved

wschin · 2019-04-04T04:28:21Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            Console.WriteLine();
+        }
+
+        public class TextData


Suggested change

public class TextData

private class TextData

``` #Resolved

wschin · 2019-04-04T04:28:30Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs

+            public string Text { get; set; }
+        }
+
+        public class TransformedTextData : TextData


Suggested change

public class TransformedTextData : TextData

private class TransformedTextData : TextData

``` #Resolved

Ivanidzo4ka

zeahmed added 2 commits April 3, 2019 13:39

Created sample for 'LatentDirichletAllocation' API.

0853b27

Updated comment.

ba42f31

zeahmed requested review from Ivanidzo4ka, abgoswam, shmoradims, singlis, sfilipi and rogancarr April 3, 2019 20:42

Ivanidzo4ka reviewed Apr 3, 2019

View reviewed changes

Addressed reviewers' comments.

621ed99

sfilipi mentioned this pull request Apr 3, 2019

API reference - Samples for Transforms #1209

Closed

rogancarr reviewed Apr 4, 2019

View reviewed changes

rogancarr approved these changes Apr 4, 2019

View reviewed changes

Addressed reviewers' comments.

cc2d80a

wschin reviewed Apr 4, 2019

View reviewed changes

Addressed reviewers' comments.

883eaa3

Ivanidzo4ka approved these changes Apr 5, 2019

View reviewed changes

zeahmed merged commit a8915f4 into dotnet:master Apr 5, 2019

zeahmed added a commit to zeahmed/machinelearning that referenced this pull request Apr 8, 2019

Created sample for 'LatentDirichletAllocation' API. (dotnet#3191)

d5128a1

zeahmed mentioned this pull request Apr 8, 2019

Cherry pick for samples (Text) #3240

Closed

ghost locked as resolved and limited conversation to collaborators Mar 23, 2022

	private static void PrintPredictions(TransformedTextData prediction)
	private static void PrintLdaFeatures(TransformedTextData prediction)
	``` #Resolved

	public class TransformedTextData : TextData
	private class TransformedTextData : TextData
	``` #Resolved

Created sample for 'LatentDirichletAllocation' API. #3191

Created sample for 'LatentDirichletAllocation' API. #3191

Uh oh!

Conversation

zeahmed commented Apr 3, 2019

Uh oh!

Ivanidzo4ka Apr 3, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeahmed Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka Apr 3, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeahmed Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 3, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rogancarr Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogancarr Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

zeahmed Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

rogancarr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Apr 4, 2019 • edited by zeahmed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ivanidzo4ka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ivanidzo4ka Apr 3, 2019 •

edited by zeahmed

Loading

Ivanidzo4ka Apr 3, 2019 •

edited by zeahmed

Loading

codecov bot commented Apr 3, 2019 •

edited

Loading

rogancarr Apr 4, 2019 •

edited by zeahmed

Loading

rogancarr Apr 4, 2019 •

edited by zeahmed

Loading

rogancarr Apr 4, 2019 •

edited by zeahmed

Loading

rogancarr Apr 4, 2019 •

edited by zeahmed

Loading

rogancarr Apr 4, 2019 •

edited by zeahmed

Loading

rogancarr left a comment •

edited

Loading

wschin Apr 4, 2019 •

edited by zeahmed

Loading

wschin Apr 4, 2019 •

edited by zeahmed

Loading

wschin Apr 4, 2019 •

edited by zeahmed

Loading