Skip to content

Scrub n-gram hashing and n-gram #2898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 13, 2019
Merged

Conversation

wschin
Copy link
Member

@wschin wschin commented Mar 9, 2019

One step closer to #2832. This PR only polishes NgramHashingTransform.

@wschin wschin self-assigned this Mar 9, 2019
@wschin wschin requested review from Ivanidzo4ka and sfilipi March 9, 2019 01:02
@wschin wschin force-pushed the scrub-hash-ngrams branch from 8b4285c to e58f879 Compare March 9, 2019 01:18
@wschin wschin changed the title Scrub n-gram hashing Scrub n-gram hashing and n-gram Mar 9, 2019
@wschin wschin force-pushed the scrub-hash-ngrams branch from 86b3aad to 062d70c Compare March 9, 2019 01:31
@@ -492,15 +492,15 @@ public bool Equals(Reconciler other)
/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we trying to move away from Terms. Ngrams is more appropriate #Closed

Copy link
Member Author

@wschin wschin Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do Ngrams. #Resolved

/// </summary>
/// <param name="input">The column to apply to.</param>
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to align with @artidoro changes in his hashing scrubbing. #Resolved

Copy link
Contributor

@artidoro artidoro Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed hashBits to numberOfHashBits what's better? #Resolved

Copy link
Member Author

@wschin wschin Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numberOfBits in a hashing algorithm defaults to the number of output bits, I think. #Resolved

Copy link
Contributor

@artidoro artidoro Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it a bit, I think it might be better to name it numberOfHashBits everywhere. numberOfBits works well for HashingTransformer which only does hashing, but sometimes, like in this case, there are settings that are relevant to hashing and some that are relevant to ngrams. I think it might be clearer for users to know that we are talking about the number of bits used for hashing. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we should rename maximumNumberOfIterations to maximumNumberOfTrainingIterations, ordered to orderedHashing, seed to hashSeed, right?


In reply to: 264369474 [](ancestors = 264369474)

@@ -585,13 +585,13 @@ public bool Equals(Reconciler other)
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public static Vector<float> ToNgramsHash(this VarVector<Key<uint, string>> input,
int hashBits = 16,
public static Vector<float> ApplyNgramHashing(this VarVector<Key<uint, string>> input,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApplyNgramHashing [](start = 36, length = 17)

in dynamic it's Produce #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 263980246 [](ancestors = 263980246)

@codecov
Copy link

codecov bot commented Mar 9, 2019

Codecov Report

Merging #2898 into master will increase coverage by 0.01%.
The diff coverage is 72.48%.

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
+ Coverage   72.19%    72.2%   +0.01%     
==========================================
  Files         796      796              
  Lines      142019   142023       +4     
  Branches    16044    16046       +2     
==========================================
+ Hits       102526   102549      +23     
+ Misses      35115    35090      -25     
- Partials     4378     4384       +6
Flag Coverage Δ
#Debug 72.2% <72.48%> (+0.01%) ⬆️
#production 67.99% <70.07%> (+0.01%) ⬆️
#test 88.3% <100%> (ø) ⬆️
Impacted Files Coverage Δ
...s/Scenarios/Api/CookbookSamples/CookbookSamples.cs 99.49% <100%> (ø) ⬆️
...icrosoft.ML.Functional.Tests/DataTransformation.cs 100% <100%> (ø) ⬆️
...rosoft.ML.StaticPipelineTesting/StaticPipeTests.cs 95.27% <100%> (ø) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 88.7% <100%> (ø) ⬆️
...s/Api/CookbookSamples/CookbookSamplesDynamicApi.cs 93.62% <100%> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/NgramTransform.cs 87.42% <71.05%> (-0.32%) ⬇️
...rc/Microsoft.ML.StaticPipe/TextStaticExtensions.cs 77.55% <75%> (ø) ⬆️
...soft.ML.Transforms/Text/WrappedTextTransformers.cs 93.63% <75%> (-1.65%) ⬇️
...soft.ML.Transforms/Text/NgramHashingTransformer.cs 88.77% <77.27%> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 53.84% <8.33%> (+2.93%) ⬆️
... and 5 more

@@ -492,15 +492,15 @@ public bool Equals(Reconciler other)
/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
/// <param name="maxNumTerms">Maximum number of ngrams to store in the dictionary.</param>
/// <param name="maximumTermCount">Maximum number of ngrams to store in the dictionary.</param>
Copy link
Member

@sfilipi sfilipi Mar 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngrams [](start = 61, length = 6)

whatever you decide, should keep it consistent with the description. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!


In reply to: 264022515 [](ancestors = 264022515)

@@ -253,7 +253,7 @@ private static SequencePool[] Train(IHostEnvironment env, NgramExtractingEstimat
// Note: GetNgramIdFinderAdd will control how many ngrams of a specific length will
// be added (using lims[iinfo]), therefore we set slotLim to the maximum
helpers[iinfo] = new NgramBufferBuilder(ngramLength, skipLength, Utils.ArrayMaxSize,
GetNgramIdFinderAdd(env, counts[iinfo], columns[iinfo].Limits, ngramMaps[iinfo], transformInfos[iinfo].RequireIdf));
GetNgramIdFinderAdd(env, counts[iinfo], columns[iinfo].MaximumTermCounts, ngramMaps[iinfo], transformInfos[iinfo].RequireIdf));
Copy link
Member

@sfilipi sfilipi Mar 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaximumTermCounts [](start = 79, length = 17)

should Term be plural here: 'MaximumTermsCounts' #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


In reply to: 264022560 [](ancestors = 264022560)

@@ -334,20 +334,20 @@ private static void AssertValid(IHostEnvironment env, int[] counts, ImmutableArr
env.Assert(count == pool.Count);
}

private static NgramIdFinder GetNgramIdFinderAdd(IHostEnvironment env, int[] counts, ImmutableArray<int> lims, SequencePool pool, bool requireIdf)
private static NgramIdFinder GetNgramIdFinderAdd(IHostEnvironment env, int[] counts, IReadOnlyList<int> lims, SequencePool pool, bool requireIdf)
Copy link
Member

@sfilipi sfilipi Mar 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IReadOnlyList [](start = 93, length = 13)

curious, why change this?@ #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with tree's public interface and other places.


In reply to: 264022662 [](ancestors = 264022662)

@@ -809,10 +809,14 @@ public sealed class ColumnOptions
/// <summary>The weighting criteria.</summary>
public readonly WeightingCriteria Weighting;
/// <summary>
/// Underlying state of <see cref="MaximumTermCounts"/>.
/// </summary>
private readonly ImmutableArray<int> _maximumTermCounts;
Copy link
Member

@sfilipi sfilipi Mar 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_maximumTermCounts [](start = 49, length = 18)

similar question, why not keep the public readonly ImmutableArray, but we're introducing another field? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We encourage the uses of IReadOnlyList since the new interface of trees was merged.


In reply to: 264024489 [](ancestors = 264024489)

@@ -467,7 +467,7 @@ private void TextFeaturizationOn(string dataPath)
BagOfBigrams: r.Message.NormalizeText().ToBagofHashedWords(ngramLength: 2, allLengths: false),

// NLP pipeline 3: bag of tri-character sequences with TF-IDF weighting.
BagOfTrichar: r.Message.TokenizeIntoCharacters().ToNgrams(ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf),
BagOfTrichar: r.Message.TokenizeIntoCharacters().ProduceNgrams(ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf),
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProduceNgrams [](start = 69, length = 13)

Don't forget to update cookbook.md! #WontFix

Copy link
Member Author

@wschin wschin Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such an example. Maybe the author didn't add it because it's static.


In reply to: 264406110 [](ancestors = 264406110)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 11, 2019

    /// [!code-csharp[LpNormalize](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/NgramExtraction.cs?range=1-5,11-74)]

Any chance you can fix this sample?
it's broken right now.
this is what it has:
var charsTwoGramColumn = transformedData_twochars.GetColumn<VBuffer<float>>(transformedData_onechars.Schema["CharsUnigrams"]);
what it should have:
var charsTwoGramColumn = transformedData_twochars.GetColumn<VBuffer<float>>(transformedData_twochars.Schema["CharsTwograms"]); #Closed


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:202 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 11, 2019

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

I would also add check what if NgramLenth=1 then SkipLenght=0;
It doesn't make sense to skip elements if your ngram is only 1. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 11, 2019

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

I will add

                if (NgramLength == 1 && SkipLength != 0)
                    throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be zero when the maximum n-gram's length is one.");

In reply to: 471726071 [](ancestors = 471726071)


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 11, 2019

    /// [!code-csharp[LpNormalize](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/NgramExtraction.cs?range=1-5,11-74)]

Yep.


In reply to: 471725225 [](ancestors = 471725225)


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:202 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False)

@wschin wschin force-pushed the scrub-hash-ngrams branch from a405e01 to 4ffa12b Compare March 11, 2019 22:32
@Ivanidzo4ka
Copy link
Contributor

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

SkipLenght can be used only if ngramLenght more that one
condition should be
if (ngramLenght==1 && skipLenght>0)


In reply to: 471755627 [](ancestors = 471755627,471726071)


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 11, 2019

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

Just change

                if (NgramLength == 1 && SkipLength != 0)
                    throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be one when the maximum n-gram's length is one.");

to

                if (NgramLength == 1 && SkipLength != 0)
                    throw Contracts.ExceptUserArg(nameof(skipLength), $"Number of skips can only be zero when the maximum n-gram's length is one.");

Does my new description look better?


In reply to: 471774742 [](ancestors = 471774742,471755627,471726071)


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

don't use maximum next to n-gram's length
Nameof(skipLenght) can only be zero when the nameof(NgramLenght) set to one.


In reply to: 471778439 [](ancestors = 471778439,471774742,471755627,471726071)


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 4ffa12b. [](commit_id = 4ffa12b, deletion_comment = False)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wschin
Copy link
Member Author

wschin commented Mar 11, 2019

            if (NgramLength + SkipLength > NgramBufferBuilder.MaxSkipNgramLength)

Ok.


In reply to: 471779390 [](ancestors = 471779390,471778439,471774742,471755627,471726071)


Refers to: src/Microsoft.ML.Transforms/Text/NgramTransform.cs:854 in 758e3fc. [](commit_id = 758e3fc, deletion_comment = False)

@@ -353,7 +353,7 @@ private static IDataTransform Create(IHostEnvironment env, Options options, IDat
item.NgramLength ?? options.NgramLength,
item.SkipLength ?? options.SkipLength,
item.AllLengths ?? options.AllLengths,
item.HashBits ?? options.HashBits,
item.NumberOfBits ?? options.NumberOfBits,
item.Seed ?? options.Seed,
item.Ordered ?? options.Ordered,
Copy link
Member Author

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordered [](start = 29, length = 7)

UseOrderedHashing #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in Iteration 9.


In reply to: 264917459 [](ancestors = 264917459)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 12, 2019

        public readonly int InvertHash;

would be nice to align with @artidoro changes in HashingEstimator. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs:902 in d1d2e66. [](commit_id = d1d2e66, deletion_comment = False)

@wschin wschin force-pushed the scrub-hash-ngrams branch from b2fb038 to 97469bc Compare March 12, 2019 23:52
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,
string inputColumnName = null,
int hashBits = NgramHashingEstimator.Defaults.HashBits,
int numberOfBits = NgramHashingEstimator.Defaults.NumberOfBits,
int ngramLength = NgramHashingEstimator.Defaults.NgramLength,
int skipLength = NgramHashingEstimator.Defaults.SkipLength,
bool allLengths = NgramHashingEstimator.Defaults.AllLengths,
Copy link
Contributor

@artidoro artidoro Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allLengths [](start = 17, length = 10)

useAllLengths? (here an in other places with this argument) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another 1hr assignment. Just finished it..


In reply to: 264949611 [](ancestors = 264949611)

@@ -209,31 +208,10 @@ public static class TextCatalog
int ngramLength = NgramExtractingEstimator.Defaults.NgramLength,
int skipLength = NgramExtractingEstimator.Defaults.SkipLength,
bool allLengths = NgramExtractingEstimator.Defaults.AllLengths,
int maxNumTerms = NgramExtractingEstimator.Defaults.MaxNumTerms,
int maximumNgramsCounts = NgramExtractingEstimator.Defaults.MaximumNgramsCount,
Copy link
Contributor

@artidoro artidoro Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximumNgramsCounts [](start = 16, length = 19)

Should be maximumNgramsCount #Resolved

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think you need to merge some conflicts, but it looks good! Just a few comments

@wschin
Copy link
Member Author

wschin commented Mar 13, 2019

@artidoro, hundreds of conflicts.....

@wschin wschin merged commit 9d9a3d9 into dotnet:master Mar 13, 2019
@wschin wschin deleted the scrub-hash-ngrams branch March 13, 2019 06:55
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants