Skip to content

Creation of components through MLContext and cleanup (text related transforms) #2393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 6, 2019

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Feb 3, 2019

Towards #1798 , #1758, #1760

The following transform estimators are being addressed:

  • LatentDirichletAllocationEstimator
  • WordEmbeddingsExtractingEstimator
  • TokenizingByCharactersEstimator
  • WordTokenizingEstimator
  • WordBagEstimator
  • WordHashBagEstimator
  • NgramExtractingEstimator
  • NgramHashingEstimator
  • StopWordsRemovingEstimator
  • CustomStopWordsRemovingEstimator
  • TextNormalizingEstimator

The changes are as follows :

  1. Internalize constructors of estimators and transformers Creation of components through MLContext: advanced options and other feedback #1798
  2. Rename Arguments -> Options Creation of components through MLContext: advanced options and other feedback #1798
  3. Internalize Options when they are not used by public constructor. Arguments class should be made internal when possible #1758
  4. Rename Options objects as options (instead of args or advancedSettings used so far) Creation of components through MLContext: advanced options and other feedback #1798
  5. Move ColumnInfo to the estimators The ColumnInfo structure should live in the estimators, rather than transformers #1760

@codecov
Copy link

codecov bot commented Feb 3, 2019

Codecov Report

Merging #2393 into master will increase coverage by <.01%.
The diff coverage is 87.37%.

@@            Coverage Diff             @@
##           master    #2393      +/-   ##
==========================================
+ Coverage   71.26%   71.26%   +<.01%     
==========================================
  Files         785      785              
  Lines      140946   140939       -7     
  Branches    16108    16108              
==========================================
  Hits       100440   100440              
+ Misses      36039    36031       -8     
- Partials     4467     4468       +1
Flag Coverage Δ
#Debug 71.26% <87.37%> (ø) ⬆️
#production 67.61% <87.52%> (ø) ⬆️
#test 85.32% <84.61%> (ø) ⬆️

@abgoswam abgoswam changed the title Creation of components through MLContext and cleanup (Text Transforms) Creation of components through MLContext and cleanup (several Text related transforms) Feb 3, 2019
@abgoswam abgoswam changed the title Creation of components through MLContext and cleanup (several Text related transforms) Creation of components through MLContext and cleanup (text related transforms) Feb 3, 2019
@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public sealed class Column : OneToOneColumn

can you make this internal as well?
#Resolved


Refers to: src/Microsoft.ML.Transforms/Text/LdaTransform.cs:109 in 92e6f2d. [](commit_id = 92e6f2d, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public sealed class LdaSummary

If possible this one too should become internal. I am not sure where it is used though, so might need to double check. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/LdaTransform.cs:167 in 92e6f2d. [](commit_id = 92e6f2d, deletion_comment = False)

{
Contracts.CheckValue(env, nameof(env));
_host = env.Register(nameof(LatentDirichletAllocationEstimator));
_columns = columns.ToImmutableArray();
}

public sealed class ColumnInfo
{
public readonly string Name;
Copy link
Contributor

@artidoro artidoro Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 12, length = 6)

Could you add a summary tag for each one of these public entries?

And could you add a summary for the ColumnInfo object? #Resolved

@artidoro
Copy link
Contributor

artidoro commented Feb 3, 2019

    public LatentDirichletAllocationTransformer Fit(IDataView input)

Could you add a summary for this method? #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/LdaTransform.cs:1172 in 92e6f2d. [](commit_id = 92e6f2d, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public sealed class LdaSummary

this one is currently used by Static API via the OnFit() delegate, to provide details about the topics discovered by LightLDA.

making it internal would break the Static API


In reply to: 460088862 [](ancestors = 460088862)


Refers to: src/Microsoft.ML.Transforms/Text/LdaTransform.cs:167 in 92e6f2d. [](commit_id = 92e6f2d, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 5, 2019

    /// </summary>

nit: I would end summary here, and wrap next paragraph into <remark>


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:483 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 5, 2019

    public override SchemaShape GetOutputSchema(SchemaShape inputSchema)
        /// Returns the <see cref="SchemaShape"/> of the schema which will be produced by the transformer.
        /// Used for schema propagation and verification in a pipeline.
        /// </summary>``` #Closed

---
Refers to: src/Microsoft.ML.Transforms/Text/TokenizingByCharacters.cs:587 in a81c3d2. [](commit_id = a81c3d23707b61fd97d685e36d6f929deceffb13, deletion_comment = False)

@@ -94,7 +94,7 @@ internal bool TryUnparse(StringBuilder sb)
/// </summary>
public sealed class TokenizeColumn : OneToOneColumn { }
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

also internal #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is internal, so it should be fine


In reply to: 254030739 [](ancestors = 254030739)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted actually. had 0 references :)


In reply to: 254030739 [](ancestors = 254030739)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 5, 2019

}

internal #Closed


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:41 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@@ -220,7 +220,7 @@ internal bool TryUnparse(StringBuilder sb)

/// <summary>
/// This class is a merger of <see cref="ValueToKeyMappingTransformer.Options"/> and
/// <see cref="NgramExtractingTransformer.Arguments"/>, with the allLength option removed.
/// <see cref="NgramExtractingTransformer.Options"/>, with the allLength option removed.
/// </summary>
public abstract class ArgumentsBase
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

internal? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 254030994 [](ancestors = 254030994)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 5, 2019

    public static IDataTransform Create(IHostEnvironment env, NgramExtractorArguments extractorArgs, IDataView input,

can you merge with master, I think I tackle it recently. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:369 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

public readonly string Name;
public readonly string InputColumnName;

public ColumnInfo(string name, string inputColumnName = null)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary to constructor. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 254031689 [](ancestors = 254031689)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 5, 2019

    public class Column : OneToOneColumn

internal #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/WordTokenizing.cs:43 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

InputColumnName = inputColumnName ?? name;
Separators = separators ?? new[] { ' ' };
}
}

public override SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
Copy link
Contributor

@artidoro artidoro Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///


/// Returns the of the schema which will be produced by the transformer.
/// Used for schema propagation and verification in a pipeline.
///
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


In reply to: 254082141 [](ancestors = 254082141)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public sealed class NgramExtractorArguments : ArgumentsBase, INgramExtractorFactoryFactory

I will create a separate issue to scrub all the entrypoint APIs and fix their namings. If you see they derive from ArgumentsBase so need to consider if they should have suffix *Options


In reply to: 460829905 [](ancestors = 460829905)


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:249 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public override SchemaShape GetOutputSchema(SchemaShape inputSchema)

fixed


In reply to: 460788078 [](ancestors = 460788078)


Refers to: src/Microsoft.ML.Transforms/Text/TokenizingByCharacters.cs:587 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

}

fixed i think ?


In reply to: 460788328 [](ancestors = 460788328)


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:41 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static IDataTransform Create(IHostEnvironment env, NgramExtractorArguments extractorArgs, IDataView input,

fixed. looks like u had missed it. ;)


In reply to: 460788737 [](ancestors = 460788737)


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:369 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public class Column : OneToOneColumn

fixed


In reply to: 460789322 [](ancestors = 460789322)


Refers to: src/Microsoft.ML.Transforms/Text/WordTokenizing.cs:43 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public void Dispose()

implementing IDisposable, so cannot make this internal.


In reply to: 460827459 [](ancestors = 460827459)


Refers to: src/Microsoft.ML.Transforms/Text/LdaTransform.cs:684 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static bool IsColumnTypeValid(ColumnType type) =>

fixed


In reply to: 460835302 [](ancestors = 460835302)


Refers to: src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs:551 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static bool IsColumnTypeValid(ColumnType type) => (type.GetItemType() is TextType);

fixed


In reply to: 460836062 [](ancestors = 460836062)


Refers to: src/Microsoft.ML.Transforms/Text/TextNormalizing.cs:453 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static bool IsColumnTypeValid(ColumnType type) => type.GetItemType() is TextType;

fixed


In reply to: 460836382 [](ancestors = 460836382)


Refers to: src/Microsoft.ML.Transforms/Text/TokenizingByCharacters.cs:557 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    {

fixed


In reply to: 460836546 [](ancestors = 460836546)


Refers to: src/Microsoft.ML.Transforms/Text/TextNormalizing.cs:501 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    {

fixed


In reply to: 460836602 [](ancestors = 460836602)


Refers to: src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs:1077 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    {

fixed


In reply to: 460836637 [](ancestors = 460836637)


Refers to: src/Microsoft.ML.Transforms/Text/StopWordsRemovingTransformer.cs:587 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public static VersionInfo GetVersionInfo()

fixed


In reply to: 460837490 [](ancestors = 460837490)


Refers to: src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs:79 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    }

fixed


In reply to: 460838030 [](ancestors = 460838030)


Refers to: src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs:635 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

    public abstract class ArgumentsBase : TransformInputBase

fixed


In reply to: 460838528 [](ancestors = 460838528)


Refers to: src/Microsoft.ML.Transforms/Text/WordTokenizing.cs:67 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@abgoswam
Copy link
Member Author

abgoswam commented Feb 5, 2019

fixed


In reply to: 460838678 [](ancestors = 460838678)


Refers to: src/Microsoft.ML.Transforms/Text/WordTokenizing.cs:409 in aae84eb. [](commit_id = aae84eb, deletion_comment = False)

@@ -42,7 +42,7 @@ internal sealed class ExtractorColumn : ManyToOneColumn

internal static class WordBagBuildingTransformer
{
public sealed class Column : ManyToOneColumn
internal sealed class Column : ManyToOneColumn
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

it's already part of internal class, you no longer need to make it subclass internal :)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// </summary>
public readonly string Name;
/// <summary>
/// Name of column to transform. If set to <see langword="null"/>, the value of the <cref see="Name"/> will be used as source.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Name of column to transform. If set to , the value of the will be used as source. [](start = 11, length = 127)

as Zeeshan A pointed in other PRs, you need to left only first sentence. It's always populated with value. (Only if it's property in ColumnInfo)
I would suggest to sweep whole PR. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure what u mean .

scrub whole PR for what ?


In reply to: 254093004 [](ancestors = 254093004)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add anywhere else that comment to InputColumnName inside ColumnInfo, you need to change it.
If you didn't add it in other places, you just need to fix this one.


In reply to: 254094414 [](ancestors = 254094414,254093004)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should just say

        /// <summary>
        /// Name of column to transform. 
        /// </summary>

?


In reply to: 254094755 [](ancestors = 254094755,254094414,254093004)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artidoro what do you put in them? Sorry, it fills like billions of same PRs, and it's hard to find for me, but I know what Artidoro address that in his PR recently.


In reply to: 254096126 [](ancestors = 254096126,254094755,254094414,254093004)

Copy link
Contributor

@artidoro artidoro Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's how I fixed it! #Resolved

@abgoswam abgoswam requested a review from sfilipi February 5, 2019 23:55
@Ivanidzo4ka
Copy link
Contributor

    public static IDataTransform Create(IHostEnvironment env, NgramExtractorArguments extractorArgs, IDataView input,

Well, it's part of internal class now, so I decide it's not worth time to change methods modifiers :)


In reply to: 460846657 [](ancestors = 460846657,460788737)


Refers to: src/Microsoft.ML.Transforms/Text/WordBagTransform.cs:369 in a81c3d2. [](commit_id = a81c3d2, deletion_comment = False)

@@ -473,7 +473,7 @@ private void TextFeaturizationOn(string dataPath)
BagOfTrichar: r.Message.TokenizeIntoCharacters().ToNgrams(ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf),

// NLP pipeline 4: word embeddings.
Embeddings: r.Message.NormalizeText().TokenizeText().WordEmbeddings(WordEmbeddingsExtractingTransformer.PretrainedModelKind.GloVeTwitter25D)
Embeddings: r.Message.NormalizeText().TokenizeText().WordEmbeddings(WordEmbeddingsExtractingEstimator.PretrainedModelKind.GloVeTwitter25D)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizeText [](start = 42, length = 13)

If you update any of cookbooks you also need to update https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md

I wish we had way to autogenerate it somehow... #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abgoswam abgoswam merged commit 753f158 into dotnet:master Feb 6, 2019
@abgoswam abgoswam deleted the abgoswam/transform_estimator_api_lda branch February 20, 2019 16:58
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants