Skip to content

Lda snapping to template #3442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ public static class ConversionsExtensionsCatalog
/// are vectors or scalars.</param>
/// <param name="inputColumnName">Name of the column whose data will be hashed.
/// If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// This estimator operates over text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
/// This estimator operates over vectors or scalars of text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
/// <param name="maximumNumberOfInverts">During hashing we construct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the annotations for the new column.Hashing, as such, can map many initial values to one.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ namespace Microsoft.ML.Transforms
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector or primitive numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) and [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
/// | Input column data type | Scalar numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) or [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
/// | Output column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType)|
///
/// The ValueToKeyMappingEstimator builds up term vocabularies(dictionaries) mapping the input values to the keys on the dictionary.
Expand Down
55 changes: 53 additions & 2 deletions src/Microsoft.ML.Transforms/Text/LdaTransform.cs
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ namespace Microsoft.ML.Transforms.Text
//
// See <a href="https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipe.cs"/>
// for an example on how to use LatentDirichletAllocationTransformer.
/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' />
/// <summary>
/// <see cref="ITransformer"/> resulting from fitting a <see cref="LatentDirichletAllocationEstimator"/>.
/// </summary>
public sealed class LatentDirichletAllocationTransformer : OneToOneTransformerBase
{
internal sealed class Options : TransformInputBase
Expand Down Expand Up @@ -936,7 +938,56 @@ private protected override IRowMapper MakeRowMapper(DataViewSchema schema)
=> new Mapper(this, schema);
}

/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' />
/// <summary>
/// The LDA transform implements <a href="https://arxiv.org/abs/1412.1576">LightLDA</a>, a state-of-the-art implementation of Latent Dirichlet Allocation.
/// </summary>
/// <remarks>
/// <format type="text/markdown"><![CDATA[
///
/// ### Estimator Characteristics
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of <xref:System.Single> |
/// | Output column data type | Vector of <xref:System.Single>|
///
/// Latent Dirichlet Allocation is a well-known [topic modeling](https://en.wikipedia.org/wiki/Topic_model) algorithm that infers semantic structure from text data,
/// and ultimately helps answer the question on "what is this document about?".
/// It can be used to featurize any text fields as low-dimensional topical vectors.
/// LightLDA is an extremely efficient implementation of LDA that incorporates a number of
/// optimization techniques.
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million words vocabulary
/// on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
/// The most significant innovation is a super-efficient $O(1)$. [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
///
/// In an ML.NET pipeline, this estimator requires the output of some preprocessing, as its input.
/// A typical pipeline operating on text would require text normalization, tokenization and producing n-grams to supply to the LDA estimator.
/// See the example usage in the See Also section for usage suggestions.
///
/// If we have the following three examples of text, as data points, and use the LDA transform with the number of topics set to 3,
/// we would get the results displayed in the table below. Example documents:
/// * I like to eat bananas.
/// * I eat bananas everyday.
Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those sentences required? You provide some input to the very beginning of this transform and then switch to algorithm details. I feel there might be a missing bridge between them.

Also, the descriptions of training algorithm should be put into one single place. This section is somehow repleating information described above.

The above means:

    ///  on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
    ///  The most significant innovation is a super-efficient O(1) [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
    ///  whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).

``` #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the sentence as an example text, but i am realizing it is confusing. Let me try to find somethign unrelated.


In reply to: 277141226 [](ancestors = 277141226)

/// * First celebrated in 1970, Earth Day now includes events in more than 193 countries,
/// which are now coordinated globally by the Earth Day Network.
///
/// Notice the similarity in values of the first and second row, compared to the third,
/// and see how those values are indicative of similarities between those two (small) bodies of text.
///
Copy link
Member

@wschin wschin Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is missing.

  1. What is topic?
  2. What are the values of a Topic?
  3. What's the relation between those values and the two inputs I like to eat bananas. and I eat bananas everyday.?
  4. The way to describe an operation has a SOP --- first, describe input, second describe output, finally describe how to (at least conceptually if writing equations is not doable) compute output from input. #Pending

Copy link
Contributor

@natke natke Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The illustration read fine to me. I got that the first two were related to the topic of bananas and the other wasn't. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving it then. thanks @natke


In reply to: 277152635 [](ancestors = 277152635)

/// | Topic1 | Topic2 | Topic 3 |
/// | ------- | ------- | ------- |
/// | 0.5714 | 0.0000 | 0.4286 |
/// | 0.5714 | 0.0000 | 0.4286 |
/// | 0.2400 | 0.3200 | 0.4400 |
///
/// For more technical details you can consult the following papers.
/// * [LightLDA: Big Topic Models on Modest Computer Clusters](https://arxiv.org/abs/1412.1576)
/// * [LightLDA](https://github.com/Microsoft/LightLDA)
///
/// ]]></format>
/// </remarks>
/// <seealso cref="TextCatalog.LatentDirichletAllocation(TransformsCatalog.TextTransforms, string, string, int, float, float, int, int, int, int, int, int, int, bool)"/>
public sealed class LatentDirichletAllocationEstimator : IEstimator<LatentDirichletAllocationTransformer>
{
[BestFriend]
Expand Down
11 changes: 7 additions & 4 deletions src/Microsoft.ML.Transforms/Text/TextCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -556,12 +556,15 @@ internal static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog
=> new NgramHashingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(), columns);

/// <summary>
/// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats)
/// into a vector of floats over a set of topics.
/// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats)
/// into a vector of <see cref="System.Single"/> indicating the similarity of the text with each topic identified.
/// </summary>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This estimator outputs a vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// This estimator operates over a vector of <see cref="System.Single"/>.
/// </param>
/// <param name="numberOfTopics">The number of topics.</param>
/// <param name="alphaSum">Dirichlet prior on document-topic vectors.</param>
/// <param name="beta">Dirichlet prior on vocab-topic vectors.</param>
Expand Down
34 changes: 0 additions & 34 deletions src/Microsoft.ML.Transforms/Text/doc.xml
Original file line number Diff line number Diff line change
Expand Up @@ -150,40 +150,6 @@
</example>
</member>

<member name="LightLDA">
<summary>
The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
</summary>
<remarks>
Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data,
and can be used to featurize any text fields as low-dimensional topical vectors.
<para>LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
optimization techniques. See <a href="https://arxiv.org/abs/1412.1576">LightLDA: Big Topic Models on Modest Compute Clusters</a>.
</para>
<para>
With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters).
The most significant innovation is a super-efficient O(1) <a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings sampling algorithm</a>,
whose running cost is (surprisingly) agnostic of model size,
allowing it to converges nearly an order of magnitude faster than other <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers.</a>
</para>
<para>
For more details please see original LightLDA paper, and its open source implementation.
<list type="bullet">
<item><description><a href="https://arxiv.org/abs/1412.1576"> LightLDA: Big Topic Models on Modest Computer Clusters</a></description></item>
<item><description><a href=" https://github.com/Microsoft/LightLDA">LightLDA </a></description></item>
</list>
</para>
</remarks>
</member>
<example name="LightLDA">
<example>
<code language="csharp">
pipeline.Add(new LightLda((&quot;InTextCol&quot; , &quot;OutTextCol&quot;)));
</code>
</example>
</example>

<member name="WordEmbeddings">
<summary>
Word Embeddings transform is a text featurizer which converts vectors of text tokens into sentence vectors using a pre-trained model.
Expand Down