Skip to content

Created sample for 'LatentDirichletAllocation' API. #3191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 5, 2019

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Apr 3, 2019

Related to #1209.

// before passing tokens to LatentDirichletAllocation.
var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text")
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText"))
.Append(mlContext.Transforms.Text.RemoveStopWords("Tokens"))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveStopWords [](start = 50, length = 15)

Funny, this is custom stop words remover with no stop words.
So it does nothing.

I guess we need to remove param from params string[] stopwords) in RemoveStopWords #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opss...That's why I was not getting the output what I was expecting...:)


In reply to: 271928104 [](ancestors = 271928104)

.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText"))
.Append(mlContext.Transforms.Text.RemoveStopWords("Tokens"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens"))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProduceNgrams [](start = 50, length = 13)

Do we actually want to run LDA on top of 2 ngrams since 2 is default value for ProduceNgrams or we should recommend to use ngrams:1 ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is fine as it backs off to unigram ( useAllLengths=true) . I think higher is better in case there is a lot of data available.


In reply to: 271930804 [](ancestors = 271930804)

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #3191 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3191      +/-   ##
==========================================
- Coverage    72.6%   72.58%   -0.02%     
==========================================
  Files         807      807              
  Lines      144957   144957              
  Branches    16211    16211              
==========================================
- Hits       105240   105215      -25     
- Misses      35298    35326      +28     
+ Partials     4419     4416       -3
Flag Coverage Δ
#Debug 72.58% <ø> (-0.02%) ⬇️
#production 68.14% <ø> (-0.03%) ⬇️
#test 88.88% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 86.26% <0%> (+0.15%) ⬆️

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #3191 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3191      +/-   ##
==========================================
- Coverage    72.6%   72.58%   -0.02%     
==========================================
  Files         807      807              
  Lines      144957   144957              
  Branches    16211    16211              
==========================================
- Hits       105240   105221      -19     
- Misses      35298    35321      +23     
+ Partials     4419     4415       -4
Flag Coverage Δ
#Debug 72.58% <ø> (-0.02%) ⬇️
#production 68.15% <ø> (-0.02%) ⬇️
#test 88.88% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️

// Create a small dataset as an IEnumerable.
var samples = new List<TextData>()
{
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topic model [](start = 88, length = 11)

models with an s #Resolved

{
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." },
new TextData(){ Text = "I like to eat broccoli and banana." },
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

banana [](start = 67, length = 6)

bananas #Resolved

new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." },
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." },
new TextData(){ Text = "I like to eat broccoli and banana." },
new TextData(){ Text = "I eat a banana in the breakfast." },
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the [](start = 55, length = 6)

for #Resolved


// A pipeline for featurizing the text/string using LatentDirichletAllocation API.
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words
// before passing tokens to LatentDirichletAllocation.
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens [](start = 30, length = 6)

"tokens (the individual words, lower cased, with common words removed)"

Many people won't be familiar with the specific language of NLP. #Resolved

// A pipeline for featurizing the text/string using LatentDirichletAllocation API.
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words
// before passing tokens to LatentDirichletAllocation.
var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text")
Copy link
Contributor

@rogancarr rogancarr Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normText [](start = 68, length = 8)

I would spell it out for the example. #Resolved

var transformer = pipeline.Fit(dataview);

// Create the prediction engine to get the LDA features extracted from the text.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(transformer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictionEngine [](start = 16, length = 16)

Similar to the other PR, I wonder if we should stay entirely within IDataView and not create a prediction engine. That is, use a TakeRows filter followed by a CreateEnumerable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done because some of the transforms related to text processing such as (NormalizeText. TokenizeIntoWords etc.) don't need training data. In such cases, prediction engine seems more appropriate. But we can definitely have consensus on this. I will follow up.


In reply to: 271981536 [](ancestors = 271981536)

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments.

🚴

// 0.5455 0.1818 0.2727
}

private static void PrintPredictions(TransformedTextData prediction)
Copy link
Member

@wschin wschin Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static void PrintPredictions(TransformedTextData prediction)
private static void PrintLdaFeatures(TransformedTextData prediction)
``` #Resolved

Console.WriteLine();
}

public class TextData
Copy link
Member

@wschin wschin Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class TextData
private class TextData
``` #Resolved

public string Text { get; set; }
}

public class TransformedTextData : TextData
Copy link
Member

@wschin wschin Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class TransformedTextData : TextData
private class TransformedTextData : TextData
``` #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit a8915f4 into dotnet:master Apr 5, 2019
zeahmed added a commit to zeahmed/machinelearning that referenced this pull request Apr 8, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants