-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created sample for 'LatentDirichletAllocation' API. #3191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// before passing tokens to LatentDirichletAllocation. | ||
var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text") | ||
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText")) | ||
.Append(mlContext.Transforms.Text.RemoveStopWords("Tokens")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoveStopWords [](start = 50, length = 15)
Funny, this is custom stop words remover with no stop words.
So it does nothing.
I guess we need to remove param
from params string[] stopwords)
in RemoveStopWords
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opss...That's why I was not getting the output what I was expecting...:)
In reply to: 271928104 [](ancestors = 271928104)
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "normText")) | ||
.Append(mlContext.Transforms.Text.RemoveStopWords("Tokens")) | ||
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) | ||
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProduceNgrams [](start = 50, length = 13)
Do we actually want to run LDA on top of 2 ngrams since 2 is default value for ProduceNgrams or we should recommend to use ngrams:1 ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 is fine as it backs off to unigram ( useAllLengths=true) . I think higher is better in case there is a lot of data available.
In reply to: 271930804 [](ancestors = 271930804)
Codecov Report
@@ Coverage Diff @@
## master #3191 +/- ##
==========================================
- Coverage 72.6% 72.58% -0.02%
==========================================
Files 807 807
Lines 144957 144957
Branches 16211 16211
==========================================
- Hits 105240 105215 -25
- Misses 35298 35326 +28
+ Partials 4419 4416 -3
|
Codecov Report
@@ Coverage Diff @@
## master #3191 +/- ##
==========================================
- Coverage 72.6% 72.58% -0.02%
==========================================
Files 807 807
Lines 144957 144957
Branches 16211 16211
==========================================
- Hits 105240 105221 -19
- Misses 35298 35321 +23
+ Partials 4419 4415 -4
|
// Create a small dataset as an IEnumerable. | ||
var samples = new List<TextData>() | ||
{ | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topic model [](start = 88, length = 11)
models with an s #Resolved
{ | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." }, | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." }, | ||
new TextData(){ Text = "I like to eat broccoli and banana." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
banana [](start = 67, length = 6)
bananas #Resolved
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic model." }, | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic model." }, | ||
new TextData(){ Text = "I like to eat broccoli and banana." }, | ||
new TextData(){ Text = "I eat a banana in the breakfast." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the [](start = 55, length = 6)
for #Resolved
|
||
// A pipeline for featurizing the text/string using LatentDirichletAllocation API. | ||
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words | ||
// before passing tokens to LatentDirichletAllocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokens [](start = 30, length = 6)
"tokens (the individual words, lower cased, with common words removed)"
Many people won't be familiar with the specific language of NLP. #Resolved
// A pipeline for featurizing the text/string using LatentDirichletAllocation API. | ||
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words | ||
// before passing tokens to LatentDirichletAllocation. | ||
var pipeline = mlContext.Transforms.Text.NormalizeText("normText", "Text") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normText [](start = 68, length = 8)
I would spell it out for the example. #Resolved
var transformer = pipeline.Fit(dataview); | ||
|
||
// Create the prediction engine to get the LDA features extracted from the text. | ||
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(transformer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predictionEngine [](start = 16, length = 16)
Similar to the other PR, I wonder if we should stay entirely within IDataView and not create a prediction engine. That is, use a TakeRows
filter followed by a CreateEnumerable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done because some of the transforms related to text processing such as (NormalizeText. TokenizeIntoWords etc.) don't need training data. In such cases, prediction engine seems more appropriate. But we can definitely have consensus on this. I will follow up.
In reply to: 271981536 [](ancestors = 271981536)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with comments.
🚴
// 0.5455 0.1818 0.2727 | ||
} | ||
|
||
private static void PrintPredictions(TransformedTextData prediction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static void PrintPredictions(TransformedTextData prediction) | |
private static void PrintLdaFeatures(TransformedTextData prediction) | |
``` #Resolved |
Console.WriteLine(); | ||
} | ||
|
||
public class TextData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class TextData | |
private class TextData | |
``` #Resolved |
public string Text { get; set; } | ||
} | ||
|
||
public class TransformedTextData : TextData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class TransformedTextData : TextData | |
private class TransformedTextData : TextData | |
``` #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #1209.