Skip to content

Converted listed text transforms into transformers/estimators. #953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Sep 21, 2018

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Sep 19, 2018

This PR fixes #950.

The following transforms were converted in this PR.

  • Stopwords Remover Transform
  • Text Normalizer Transform
  • Word Bag Transform
  • Word Hash Bag Transform
  • Ngram Transform
  • Ngram Hash Transform


var est = new WordBagEstimator(Env, "text", "bag_of_words").
Append(new WordHashBagEstimator(Env, "text", "bag_of_wordshash"));
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic);
Copy link
Contributor Author

@zeahmed zeahmed Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)

This method is failing when the size of output vector is determined by the incoming IDataView. The method calls the fit method to learn the output schema where EmptyDataView is passed. When fitting on EmptyDataView, the size is zero which is making the output vector a variable sized vector...:)

@Zruty0, Need to find a way to handle this case. #Resolved

Copy link
Contributor Author

@zeahmed zeahmed Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix how the ngram output vector size is computed i.e. need to ensure that vector size is at least one at the following line.

types[iinfo] = new VectorType(NumberType.Float, _ngramMaps[iinfo].Count);


In reply to: 218967284 [](ancestors = 218967284)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this in this PR?


In reply to: 218989146 [](ancestors = 218989146,218967284)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, NgramTransform is computing metadata based on the size of ngrams. I need to deeply look into this to enable it for EmptyDataView. I am thinking to open an issue against this. What do you think?


In reply to: 219285825 [](ancestors = 219285825,218989146,218967284)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created #969 for tracking this.


In reply to: 219287182 [](ancestors = 219287182,219285825,218989146,218967284)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue on empty inputs? Is it possible to test at least the rest?


In reply to: 219315899 [](ancestors = 219315899,219287182,219285825,218989146,218967284)

.Append(new TermEstimator(Env, "text", "terms"))
.Append(new NgramEstimator(Env, "terms", "ngrams"))
.Append(new NgramHashEstimator(Env, "terms", "ngramshash"));
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic);
Copy link
Contributor Author

@zeahmed zeahmed Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)

same. see the comment on the test above on line 143. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created #969 for tracking this.


In reply to: 218967678 [](ancestors = 218967678)

/// <param name="ordered">Whether the position of each source column should be included in the hash (when there are multiple source columns).</param>
/// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param>
/// <returns></returns>
public static Vector<float> BagofHashedWords(this Scalar<string> input,
Copy link
Contributor Author

@zeahmed zeahmed Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BagofHashedWords [](start = 36, length = 16)

@TomFinley, @sfilipi, @Zruty0: Does this name sound good? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ToBagOfHashedWords


In reply to: 218967973 [](ancestors = 218967973)

@Zruty0 Zruty0 mentioned this pull request Sep 20, 2018
/// Stopword remover removes language-specific lists of stop words (most common words)
/// This is usually applied after tokenizing text, so it compares individual tokens
/// (case-insensitive comparison) to the stopwords.
/// </summary>
Copy link
Contributor Author

@zeahmed zeahmed Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to move these XML comments into doc.xml. #Resolved

Copy link
Member

@sfilipi sfilipi Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I put the doc.xml there, initially, it was temporarely, so that the docs can be on the runtime, and the entry point classes. the thought then was to remove them, after the CSharpApi would go. Will take another survey on whether people actually prefer to have the docs. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then I keep the comments as-is for now.


In reply to: 219248683 [](ancestors = 219248683)

var est = data.MakeNewEstimator()
.Append(r => (
r.label,
normalized_text: r.text.Normalize(),
Copy link
Contributor

@Zruty0 Zruty0 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalize [](start = 44, length = 9)

NormalizeText? Because we have Normalize with a completely different meaning, working on floats #Closed

var est = data.MakeNewEstimator()
.Append(r => (
r.label,
bagofword: r.text.BagofWords(),
Copy link
Contributor

@Zruty0 Zruty0 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BagofWords [](start = 38, length = 10)

ToBagOfWords #Closed

/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param>
/// <param name="weighting">Statistical measure used to evaluate how important a word is to a document in a corpus.</param>
/// <returns></returns>
Copy link
Contributor

@Zruty0 Zruty0 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 12, length = 19)

remove empty #Closed

}

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given text.
Copy link
Contributor

@Zruty0 Zruty0 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 92, length = 4)

it's not a text. Mention that it typically is applied to the output of tokenizers #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

}

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given tokenized text.
Copy link
Contributor

@Zruty0 Zruty0 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 78, length = 1)

stray space #Resolved

@sfilipi
Copy link
Member

sfilipi commented Sep 21, 2018

    public TermEstimator(IHostEnvironment env, params TermTransform.ColumnInfo[] columns)

internal or private? #WontFix


Refers to: src/Microsoft.ML.Data/Transforms/TermEstimator.cs:38 in 4659b80. [](commit_id = 4659b80, deletion_comment = False)

@@ -8,6 +8,8 @@
using Microsoft.ML.Runtime.Data;
using System;
using System.Collections.Generic;
using static Microsoft.ML.Runtime.TextAnalytics.StopWordsRemoverTransform;
Copy link
Member

@sfilipi sfilipi Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctrl+R+G #Resolved

Copy link
Contributor Author

@zeahmed zeahmed Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sorted already...:)


In reply to: 219550492 [](ancestors = 219550492)

@zeahmed
Copy link
Contributor Author

zeahmed commented Sep 21, 2018

    public TermEstimator(IHostEnvironment env, params TermTransform.ColumnInfo[] columns)

Actually, I cannot do it internal because its being used in CategoricalTransform which is in another assembly.


In reply to: 423589115 [](ancestors = 423589115)


Refers to: src/Microsoft.ML.Data/Transforms/TermEstimator.cs:38 in 4659b80. [](commit_id = 4659b80, deletion_comment = False)

int skipLength = 0,
bool allLengths = true,
int maxNumTerms = 10000000,
NgramTransform.WeightingCriteria weighting = NgramTransform.WeightingCriteria.Tf) => new OutPipelineColumn(input, ngramLength, skipLength, allLengths, maxNumTerms, weighting);
Copy link
Contributor

@artidoro artidoro Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more readable if you go to a new line here? #Resolved

text: ctx.LoadText(1)), hasHeader: true)
.Read(new MultiFileSource(sentimentDataPath));

var invalidData = TextLoader.CreateReader(Env, ctx => (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is invalid data the same as the valid one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not valid. Here text is tried to be loaded as float.


In reply to: 219582043 [](ancestors = 219582043)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit 9a6c384 into dotnet:master Sep 21, 2018
@zeahmed zeahmed deleted the texttransforms_2_estimators branch January 30, 2019 21:31
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert all text transforms into transformers/estimators.
4 participants