-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Converted listed text transforms into transformers/estimators. #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…learning into feature/transform-wrapper
…to texttransforms_2_estimators
|
||
var est = new WordBagEstimator(Env, "text", "bag_of_words"). | ||
Append(new WordHashBagEstimator(Env, "text", "bag_of_wordshash")); | ||
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)
This method is failing when the size of output vector is determined by the incoming IDataView. The method calls the fit method to learn the output schema where EmptyDataView
is passed. When fitting on EmptyDataView
, the size is zero which is making the output vector a variable sized vector...:)
@Zruty0, Need to find a way to handle this case. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix how the ngram output vector size is computed i.e. need to ensure that vector size is at least one at the following line.
types[iinfo] = new VectorType(NumberType.Float, _ngramMaps[iinfo].Count); |
In reply to: 218967284 [](ancestors = 218967284)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, NgramTransform is computing metadata based on the size of ngrams. I need to deeply look into this to enable it for EmptyDataView. I am thinking to open an issue against this. What do you think?
In reply to: 219285825 [](ancestors = 219285825,218989146,218967284)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an issue on empty inputs? Is it possible to test at least the rest?
In reply to: 219315899 [](ancestors = 219315899,219287182,219285825,218989146,218967284)
.Append(new TermEstimator(Env, "text", "terms")) | ||
.Append(new NgramEstimator(Env, "terms", "ngrams")) | ||
.Append(new NgramHashEstimator(Env, "terms", "ngramshash")); | ||
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TestEstimatorCore(est, data.AsDynamic, invalidInput: invalidData.AsDynamic); [](start = 12, length = 78)
same. see the comment on the test above on line 143. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="ordered">Whether the position of each source column should be included in the hash (when there are multiple source columns).</param> | ||
/// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param> | ||
/// <returns></returns> | ||
public static Vector<float> BagofHashedWords(this Scalar<string> input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BagofHashedWords [](start = 36, length = 16)
@TomFinley, @sfilipi, @Zruty0: Does this name sound good? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Stopword remover removes language-specific lists of stop words (most common words) | ||
/// This is usually applied after tokenizing text, so it compares individual tokens | ||
/// (case-insensitive comparison) to the stopwords. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking to move these XML comments into doc.xml. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I put the doc.xml there, initially, it was temporarely, so that the docs can be on the runtime, and the entry point classes. the thought then was to remove them, after the CSharpApi would go. Will take another survey on whether people actually prefer to have the docs. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var est = data.MakeNewEstimator() | ||
.Append(r => ( | ||
r.label, | ||
normalized_text: r.text.Normalize(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalize [](start = 44, length = 9)
NormalizeText? Because we have Normalize with a completely different meaning, working on floats #Closed
var est = data.MakeNewEstimator() | ||
.Append(r => ( | ||
r.label, | ||
bagofword: r.text.BagofWords(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BagofWords [](start = 38, length = 10)
ToBagOfWords #Closed
/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param> | ||
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param> | ||
/// <param name="weighting">Statistical measure used to evaluate how important a word is to a document in a corpus.</param> | ||
/// <returns></returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 19)
remove empty #Closed
} | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text [](start = 92, length = 4)
it's not a text. Mention that it typically is applied to the output of tokenizers #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words ) in a given tokenized text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 78, length = 1)
stray space #Resolved
@@ -8,6 +8,8 @@ | |||
using Microsoft.ML.Runtime.Data; | |||
using System; | |||
using System.Collections.Generic; | |||
using static Microsoft.ML.Runtime.TextAnalytics.StopWordsRemoverTransform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctrl+R+G #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I cannot do it In reply to: 423589115 [](ancestors = 423589115) Refers to: src/Microsoft.ML.Data/Transforms/TermEstimator.cs:38 in 4659b80. [](commit_id = 4659b80, deletion_comment = False) |
int skipLength = 0, | ||
bool allLengths = true, | ||
int maxNumTerms = 10000000, | ||
NgramTransform.WeightingCriteria weighting = NgramTransform.WeightingCriteria.Tf) => new OutPipelineColumn(input, ngramLength, skipLength, allLengths, maxNumTerms, weighting); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more readable if you go to a new line here? #Resolved
text: ctx.LoadText(1)), hasHeader: true) | ||
.Read(new MultiFileSource(sentimentDataPath)); | ||
|
||
var invalidData = TextLoader.CreateReader(Env, ctx => ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is invalid data the same as the valid one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not valid. Here text is tried to be loaded as float.
In reply to: 219582043 [](ancestors = 219582043)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes #950.
The following transforms were converted in this PR.