-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Lda snapping to template #3442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lda snapping to template #3442
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3442 +/- ##
==========================================
+ Coverage 72.76% 72.76% +<.01%
==========================================
Files 808 808
Lines 145452 145452
Branches 16244 16244
==========================================
+ Hits 105839 105843 +4
+ Misses 35193 35189 -4
Partials 4420 4420
|
/// | | | | ||
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes | | ||
/// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data types [](start = 80, length = 11)
just 'key type'
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use only xref? That would make references to Key consistent.
In reply to: 277098008 [](ancestors = 277098008)
/// | -- | -- | | ||
/// | Does this estimator need to look at the data to train its parameters? | Yes | | ||
/// | Input column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType) data types| | ||
/// | Output column data type | Vector or <xref:System.Single>| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or [](start = 43, length = 2)
or -> of ?? #Resolved
/// It can be used to featurize any text fields as low-dimensional topical vectors. | ||
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of | ||
/// optimization techniques. | ||
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ [](start = 6, length = 3)
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm), | ||
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling). | ||
/// | ||
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ml.Net [](start = 15, length = 6)
ML.NET #Resolved
/// If we have the following three lines of text, as data points: | ||
/// * I like to eat bananas. | ||
/// * I eat bananas everyday. | ||
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [](start = 8, length = 3)
should this * be a removed ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter example might be clearer here :) #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a bunch, but none of them gave nice numbers, like this one.
In reply to: 277138988 [](ancestors = 277138988)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats) | ||
/// into a vector of floats over a set of topics. | ||
/// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats) | ||
/// into a vector of floats indicating the similarity of the text with each topic identified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floats [](start = 29, length = 6)
single #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Latent Dirichlet Allocation is a well-known [topic modeling](https://en.wikipedia.org/wiki/Topic_model) algorithm that infers semantic structure from text data, | ||
/// and ultimately helps answer the question on "what is this document about?". | ||
/// It can be used to featurize any text fields as low-dimensional topical vectors. | ||
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
developed in MSR-Asia [](start = 66, length = 21)
does this matter? We probably just say "implementation of LDA that incorporates..."
#Resolved
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling). | ||
/// | ||
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input. | ||
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
than [](start = 129, length = 4)
suggestion: n-grams to supply to the LDA transformer. #Resolved
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA. | ||
/// See the example usage in the SeeAlso section for usage suggestions. | ||
/// | ||
/// If we have the following three lines of text, as data points: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
three [](start = 34, length = 5)
it looks like a line is missing? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is the third bullet point. I substituted the line with 'example"
In reply to: 277109119 [](ancestors = 277109119)
/// * I eat bananas everyday. | ||
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler, | ||
/// and allows a small cluster of machines to tackle very large data and model sizes based on the model scheduling | ||
/// and data parallelism capabilities of the DMTK parameter server.(quoted from [LightLDA](http://www.dmtk.io/lightlda.html)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run-on sentence, can this be reworked? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just an example sentence.. but i see how it can be confusing. Let me pick something else.
In reply to: 277109187 [](ancestors = 277109187)
Expected input column type and expected output column type? #Resolved Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left some feedback. |
? In reply to: 485034243 [](ancestors = 485034243) Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False) |
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling). | ||
/// | ||
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input. | ||
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the suggested steps, please add xref to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason i didn't do it, is because I don't know how to format extension methods in xref format.
I did point them to the sample, which contains the same steps .
In reply to: 277138943 [](ancestors = 277138943)
/// It can be used to featurize any text fields as low-dimensional topical vectors. | ||
/// LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of | ||
/// optimization techniques. | ||
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I millions word vocabulary? #Resolved
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling). | ||
/// | ||
/// In an Ml.Net pipeline, this estimator requires the output of some preprocessing, as its input. | ||
/// A typical pipeline operating on text would require performing text normalization, tokenization and producing n-grams to than supply to LDA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're in here editing anyway, you could remove the "performing" #Resolved
/// If we have the following three lines of text, as data points: | ||
/// * I like to eat bananas. | ||
/// * I eat bananas everyday. | ||
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter example might be clearer here :) #Pending
/// If we have the following three lines of text, as data points: | ||
/// * I like to eat bananas. | ||
/// * I eat bananas everyday. | ||
/// * LightLDA improves the sampling throughput and convergence speed via a novel O(1) metropolis-Hastings sampler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O(1) [](start = 87, length = 4)
/// | ||
/// If we have the following three lines of text, as data points: | ||
/// * I like to eat bananas. | ||
/// * I eat bananas everyday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those sentences required? You provide some input to the very beginning of this transform and then switch to algorithm details. I feel there might be a missing bridge between them.
Also, the descriptions of training algorithm should be put into one single place. This section is somehow repleating information described above
.
The above
means:
/// on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
/// The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
``` #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the sentence as an example text, but i am realizing it is confusing. Let me try to find somethign unrelated.
In reply to: 277141226 [](ancestors = 277141226)
/// | ||
/// To illustrate the effect of this estimator on text, notice the similarity in values of the first and second row, compared to the third, | ||
/// and see how those values are indicative of semantic similarities between those lines. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is missing.
- What is topic?
- What are the values of a
Topic
? - What's the relation between those values and the two inputs
I like to eat bananas.
andI eat bananas everyday.
? - The way to describe an operation has a SOP --- first, describe input, second describe output, finally describe how to (at least conceptually if writing equations is not doable) compute output from input. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The illustration read fine to me. I got that the first two were related to the topic of bananas and the other wasn't. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to specify the accepted input type such as a vector of In reply to: 485062948 [](ancestors = 485062948,485034243) Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:504 in 4797745. [](commit_id = 4797745, deletion_comment = False) |
towards #3204. LDA