-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Scrub n-gram hashing and n-gram #2898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8b4285c
to
e58f879
Compare
86b3aad
to
062d70c
Compare
@@ -492,15 +492,15 @@ public bool Equals(Reconciler other) | |||
/// <param name="ngramLength">Ngram length.</param> | |||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | |||
/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param> | |||
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we trying to move away from Terms
. Ngrams is more appropriate #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do Ngrams
. #Resolved
/// </summary> | ||
/// <param name="input">The column to apply to.</param> | ||
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> | ||
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to align with @artidoro changes in his hashing scrubbing. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed hashBits
to numberOfHashBits
what's better? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numberOfBits
in a hashing algorithm defaults to the number of output bits, I think. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it a bit, I think it might be better to name it numberOfHashBits
everywhere. numberOfBits
works well for HashingTransformer
which only does hashing, but sometimes, like in this case, there are settings that are relevant to hashing and some that are relevant to ngrams. I think it might be clearer for users to know that we are talking about the number of bits used for hashing. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, we should rename maximumNumberOfIterations to maximumNumberOfTrainingIterations, ordered to orderedHashing, seed to hashSeed, right?
In reply to: 264369474 [](ancestors = 264369474)
@@ -585,13 +585,13 @@ public bool Equals(Reconciler other) | |||
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one. | |||
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. | |||
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param> | |||
public static Vector<float> ToNgramsHash(this VarVector<Key<uint, string>> input, | |||
int hashBits = 16, | |||
public static Vector<float> ApplyNgramHashing(this VarVector<Key<uint, string>> input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ApplyNgramHashing [](start = 36, length = 17)
in dynamic it's Produce
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #2898 +/- ##
==========================================
+ Coverage 72.19% 72.2% +0.01%
==========================================
Files 796 796
Lines 142019 142023 +4
Branches 16044 16046 +2
==========================================
+ Hits 102526 102549 +23
+ Misses 35115 35090 -25
- Partials 4378 4384 +6
|
@@ -492,15 +492,15 @@ public bool Equals(Reconciler other) | |||
/// <param name="ngramLength">Ngram length.</param> | |||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | |||
/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param> | |||
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param> | |||
/// <param name="maximumTermCount">Maximum number of ngrams to store in the dictionary.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngrams [](start = 61, length = 6)
whatever you decide, should keep it consistent with the description. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -253,7 +253,7 @@ private static SequencePool[] Train(IHostEnvironment env, NgramExtractingEstimat | |||
// Note: GetNgramIdFinderAdd will control how many ngrams of a specific length will | |||
// be added (using lims[iinfo]), therefore we set slotLim to the maximum | |||
helpers[iinfo] = new NgramBufferBuilder(ngramLength, skipLength, Utils.ArrayMaxSize, | |||
GetNgramIdFinderAdd(env, counts[iinfo], columns[iinfo].Limits, ngramMaps[iinfo], transformInfos[iinfo].RequireIdf)); | |||
GetNgramIdFinderAdd(env, counts[iinfo], columns[iinfo].MaximumTermCounts, ngramMaps[iinfo], transformInfos[iinfo].RequireIdf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaximumTermCounts [](start = 79, length = 17)
should Term be plural here: 'MaximumTermsCounts' #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -334,20 +334,20 @@ private static void AssertValid(IHostEnvironment env, int[] counts, ImmutableArr | |||
env.Assert(count == pool.Count); | |||
} | |||
|
|||
private static NgramIdFinder GetNgramIdFinderAdd(IHostEnvironment env, int[] counts, ImmutableArray<int> lims, SequencePool pool, bool requireIdf) | |||
private static NgramIdFinder GetNgramIdFinderAdd(IHostEnvironment env, int[] counts, IReadOnlyList<int> lims, SequencePool pool, bool requireIdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IReadOnlyList [](start = 93, length = 13)
curious, why change this?@ #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with tree's public interface and other places.
In reply to: 264022662 [](ancestors = 264022662)
@@ -809,10 +809,14 @@ public sealed class ColumnOptions | |||
/// <summary>The weighting criteria.</summary> | |||
public readonly WeightingCriteria Weighting; | |||
/// <summary> | |||
/// Underlying state of <see cref="MaximumTermCounts"/>. | |||
/// </summary> | |||
private readonly ImmutableArray<int> _maximumTermCounts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_maximumTermCounts [](start = 49, length = 18)
similar question, why not keep the public readonly ImmutableArray, but we're introducing another field? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We encourage the uses of IReadOnlyList
since the new interface of trees was merged.
In reply to: 264024489 [](ancestors = 264024489)
@@ -467,7 +467,7 @@ private void TextFeaturizationOn(string dataPath) | |||
BagOfBigrams: r.Message.NormalizeText().ToBagofHashedWords(ngramLength: 2, allLengths: false), | |||
|
|||
// NLP pipeline 3: bag of tri-character sequences with TF-IDF weighting. | |||
BagOfTrichar: r.Message.TokenizeIntoCharacters().ToNgrams(ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf), | |||
BagOfTrichar: r.Message.TokenizeIntoCharacters().ProduceNgrams(ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProduceNgrams [](start = 69, length = 13)
Don't forget to update cookbook.md! #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such an example. Maybe the author didn't add it because it's static.
In reply to: 264406110 [](ancestors = 264406110)
Any chance you can fix this sample? Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:202 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False) |
I would also add check what if NgramLenth=1 then SkipLenght=0; Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False) |
I will add if (NgramLength == 1 && SkipLength != 0)
throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be zero when the maximum n-gram's length is one."); In reply to: 471726071 [](ancestors = 471726071) Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False) |
Yep. In reply to: 471725225 [](ancestors = 471725225) Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:202 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False) |
a405e01
to
4ffa12b
Compare
In reply to: 471755627 [](ancestors = 471755627,471726071) Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False) |
Just change if (NgramLength == 1 && SkipLength != 0)
throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be one when the maximum n-gram's length is one."); to if (NgramLength == 1 && SkipLength != 0)
throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be zero when the maximum n-gram's length is one."); Does my new description look better? In reply to: 471774742 [](ancestors = 471774742,471755627,471726071) Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False) |
don't use In reply to: 471778439 [](ancestors = 471778439,471774742,471755627,471726071) Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -353,7 +353,7 @@ private static IDataTransform Create(IHostEnvironment env, Options options, IDat | |||
item.NgramLength ?? options.NgramLength, | |||
item.SkipLength ?? options.SkipLength, | |||
item.AllLengths ?? options.AllLengths, | |||
item.HashBits ?? options.HashBits, | |||
item.NumberOfBits ?? options.NumberOfBits, | |||
item.Seed ?? options.Seed, | |||
item.Ordered ?? options.Ordered, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordered [](start = 29, length = 7)
UseOrderedHashing #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b2fb038
to
97469bc
Compare
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values. | ||
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one. | ||
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. | ||
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param> | ||
public static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog.TextTransforms catalog, | ||
string outputColumnName, | ||
string inputColumnName = null, | ||
int hashBits = NgramHashingEstimator.Defaults.HashBits, | ||
int numberOfBits = NgramHashingEstimator.Defaults.NumberOfBits, | ||
int ngramLength = NgramHashingEstimator.Defaults.NgramLength, | ||
int skipLength = NgramHashingEstimator.Defaults.SkipLength, | ||
bool allLengths = NgramHashingEstimator.Defaults.AllLengths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allLengths [](start = 17, length = 10)
useAllLengths
? (here an in other places with this argument) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -209,31 +208,10 @@ public static class TextCatalog | |||
int ngramLength = NgramExtractingEstimator.Defaults.NgramLength, | |||
int skipLength = NgramExtractingEstimator.Defaults.SkipLength, | |||
bool allLengths = NgramExtractingEstimator.Defaults.AllLengths, | |||
int maxNumTerms = NgramExtractingEstimator.Defaults.MaxNumTerms, | |||
int maximumNgramsCounts = NgramExtractingEstimator.Defaults.MaximumNgramsCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maximumNgramsCounts [](start = 16, length = 19)
Should be maximumNgramsCount
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I think you need to merge some conflicts, but it looks good! Just a few comments
@artidoro, hundreds of conflicts..... |
One step closer to #2832. This PR only polishes NgramHashingTransform.