Skip to content

Hash estimator #944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 20, 2018
Merged

Conversation

Ivanidzo4ka
Copy link
Contributor

@Zruty0 you can pick up this one


[assembly: LoadableClass(typeof(IRowMapper), typeof(HashTransform), null, typeof(SignatureLoadRowMapper),
"Hash Transform", HashTransform.LoaderSignature)]

namespace Microsoft.ML.Runtime.Data
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Runtime.Data [](start = 10, length = 25)

let's get into habit of setting the correct namespace and naming.

  1. namespace Microsoft.ML.Transforms
  2. Estimators should be named the same as the Legacy ones as appropriate, with no Estimator suffix. In this case, it should be HashConverter.
  3. Transformers should be named the same as estimators, with a Transformer suffix: HashConverterTransformer. #Closed

ctx.Writer.WriteBoolByte(Ordered);
}
}

private static string TestType(ColumnType type)
public static string TestType(ColumnType type)
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

why public? If it's public, it has to be documented #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I see now, you are using the same code in the estimator. I think you should turn it into public static bool IsColumnTypeValid(ColumnType type) and internal const string ExpectedColumnType to feed to the exceptions


In reply to: 218906546 [](ancestors = 218906546)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature is easier to document, and actually helpful to the consumer. Might as well also move it to the estimator.


In reply to: 218910693 [](ancestors = 218910693,218906546)

TextModelHelper.LoadAll(Host, ctx, Infos.Length, out _keyValues, out _kvTypes);
SetMetadata();
var type = inputSchema.GetColumnType(srcCol);
string reason = TestType(type);
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestType [](start = 28, length = 8)

In this case, I think we should make it just part of CheckInputColumn and throw ExceptSchemaMismatch. #Closed

_exes[iinfo].Save(ctx);

TextModelHelper.SaveAll(Host, ctx, Infos.Length, _keyValues);
Contracts.CheckValue(columns, nameof(columns));
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckValue [](start = 22, length = 10)

CheckNonEmpty? #Closed


// <prefix handled in static Create method>
// <base>
// Exes
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exes [](start = 15, length = 4)

rename #Closed

public HashTransform(IHostEnvironment env, Arguments args, IDataView input)
: base(Contracts.CheckRef(env, nameof(env)), RegistrationName, env.CheckRef(args, nameof(args)).Column,
input, TestType)
public HashTransform(IHostEnvironment env, IDataView input, ColumnInfo[] columns) :
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashTransform [](start = 15, length = 13)

can I have a convenience ctor with only one column pair? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have it in estimator, why you want it in transformer?


In reply to: 218907955 [](ancestors = 218907955)

Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I want users to be able to create transformer directly with no input.
Can we have two ctors? internal one that you have now, and also public HashTransform(env, columns) that checks that invertHash is not used in any of the columns?

I do NOT want users to be able to train transformers directly without calling Fit.


In reply to: 218909081 [](ancestors = 218909081,218907955)

@@ -1258,5 +1284,65 @@ public override void Process()
}
}
}

public sealed class HashEstimator : IEstimator<HashTransform>
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashEstimator [](start = 24, length = 13)

summary comments for the estimator and its ctors #Resolved

foreach (var col in _columns)
{
if (col.InvertHash < -1)
throw _host.ExceptParam(nameof(columns), "Value too small, must be -1 or larger");
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptParam [](start = 32, length = 11)

why is it not part of ctor for ColumnInfo? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had same thoughts, mainly because we don't have IExceptionContext in ColumnInfo, but yeah


In reply to: 218908992 [](ancestors = 218908992)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I would be ok with losing context info here for the sake of localizing the validation


In reply to: 218909546 [](ancestors = 218909546,218908992)

metadata.Add(slotMeta);
if (colInfo.InvertHash != 0)
metadata.Add(new SchemaShape.Column(MetadataUtils.Kinds.KeyValues, SchemaShape.Column.VectorKind.Vector, TextType.Instance, false));
result[colInfo.Output] = new SchemaShape.Column(colInfo.Output, col.ItemType.IsVector ? SchemaShape.Column.VectorKind.Vector : SchemaShape.Column.VectorKind.Scalar, NumberType.U4, true, new SchemaShape(metadata.ToArray()));
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ToArray() [](start = 226, length = 10)

no need #Closed

: this(env, new Arguments() {
Column = new[] { new Column() { Source = source ?? name, Name = name } },
HashBits = hashBits, InvertHash = invertHash }, input)
internal ColumnType GetOutputType(ISchema inputSchema, ColumnInfo column)
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

why internal? #Closed

@@ -332,7 +332,7 @@ private string GetSplitColumn(IChannel ch, IDataView input, ref IDataView output
var hashargs = new HashTransform.Arguments();
hashargs.Column = new[] { new HashTransform.Column { Source = origStratCol, Name = stratificationColumn } };
hashargs.HashBits = 30;
output = new HashTransform(Host, hashargs, input);
output = HashTransform.Create(Host, hashargs, input);
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create [](start = 43, length = 6)

Can we try to limit the use of Create to dependency injection? You should just do new HashTransform(Host, origStratCol, stratificationColumn, hashBits: 30).Transform(input) #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of lines 332-335


In reply to: 218911189 [](ancestors = 218911189)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

}
internal const string ExpectedColumnType = "Expected Text, Key, Single or Double item type";

public HashConverter(IHostEnvironment env, string name, string source = null,
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string name, string source = null [](start = 51, length = 33)

lets follow the pattern and have string inputColumn, string outputColumn = null #Resolved


public HashConverter(IHostEnvironment env, string name, string source = null,
int hashBits = Defaults.HashBits, int invertHash = Defaults.InvertHash)
: this(env, new HashConverterTransformer.ColumnInfo(name, source ?? name, hashBits: hashBits, invertHash: invertHash))
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name, source ?? name [](start = 64, length = 20)

you mixed them up I think #Resolved

@@ -14,6 +14,7 @@
using Microsoft.ML.Runtime.Model;
using Microsoft.ML.Runtime.TextAnalytics;
using Xunit;
using Microsoft.ML.Transforms;
Copy link
Contributor

@Zruty0 Zruty0 Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using [](start = 0, length = 5)

(placeholder for Tom's comment) #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Ivan, Pete asked me to tell you to sort your usings. :)


In reply to: 218936325 [](ancestors = 218936325)


var dataView = ComponentCreation.CreateDataView(Env, data);
var pipe = new HashConverter(Env, new[] {
new HashConverterTransformer.ColumnInfo("A", "HashA", invertHash:1, hashBits:10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashConverterTransformer [](start = 20, length = 24)

hmm, now that I think of it, we should probably have ColumnInfo be part of the estimator as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this to be part of this PR or as separate PR?


In reply to: 218936515 [](ancestors = 218936515)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR, if not too cumbersome


In reply to: 218941758 [](ancestors = 218941758,218936515)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we have this pattern in all estimators. Wouldn't it be better to move stuff in one PR, rather than in separate? Or you want me to update all estimators as well?


In reply to: 218944749 [](ancestors = 218944749,218941758,218936515)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather prefer to keep one style in code, and if it needed to flip everything, other than have two styles in code.


In reply to: 218945227 [](ancestors = 218945227,218944749,218941758,218936515)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Let's do it in a different PR for all at once, where applicable


In reply to: 218954028 [](ancestors = 218954028,218945227,218944749,218941758,218936515)

@@ -28,19 +36,8 @@ namespace Microsoft.ML.Runtime.Data
/// it hashes each slot separately.
/// It can hash either text values or key values.
/// </summary>
public sealed class HashTransform : OneToOneTransformBase, ITransformTemplate
public sealed class HashConverterTransformer : OneToOneTransformerBase
Copy link
Contributor

@TomFinley TomFinley Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashConverterTransformer [](start = 24, length = 24)

What is this hash converter?

I do not like this usage of "converter." You have "hash converter transformer". But to transform is also a form of conversion. So it's like you are saying "this is a hash transformer transformer" which makes no sense. What is wrong with "Hash Estimator" and "Hash Transformer"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. namespace Microsoft.ML.Transforms
  2. Estimators should be named the same as the Legacy ones as appropriate, with no Estimator suffix. In this case, it should be HashConverter.
  3. Transformers should be named the same as estimators, with a Transformer suffix: HashConverterTransformer.
    I'm bad with words, so I don't have strong opinion on this.

In reply to: 218939448 [](ancestors = 218939448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I not is smithword.
I thought that we should just adopt the naming of the legacy components, assuming that some thought has been put into naming these. Legacy has HashConverter as one step (it uses HashJoin, but I think it's irrelevant).


In reply to: 218942124 [](ancestors = 218942124,218939448)

Copy link
Contributor

@TomFinley TomFinley Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The estimators and (sometimes) transformers are things directly instantiated by users. To be instantiated by users they must be found. In order to be reasonably found the best way I know of is that they have a uniform consistent naming policy, and a common policy that people understand in .NET that I've seen many, mnay frameworks adopt is that an implementor of interface IFoo or some base class FooBase. will have Foo somewhere in the name. This also jibes with how we've written our names everywhere else. E.g.: IDataTransform (nearly?) all transforms have Transform in their name.

So for that reason, the advice in this policy that "with no Estimator suffix" is something that ought to be reviewed.


In reply to: 218945320 [](ancestors = 218945320,218942124,218939448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what you said. Still, I think there is value in preserving 'verb-like' nouns like 'ColumnConcatenator' and 'TextTokenizer'. Having something like ColumnConcatenatorEstimator sounds bad to me. Do you have a better suggestion for the naming policy?


In reply to: 218948931 [](ancestors = 218948931,218945320,218942124,218939448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Any implementor of IEstimator should be suffixed with Esimator, and implementor of ITransformer ought to be suffixed with Transformer. This will make them easy to find.

"Sounds bad to me" is a little hard for me to address. Could I possibly convince you to be more specific? Is the problem that it's too long, or you don't like the two -or suffix words in a row? If too long, maybe shorten it. If it's the two -or suffix words in a row, then I'd suggest maybe making the first one not that. E.g., we can change ColumnConcatenatorEstimator to ColumnConcatenatingEstimator or ColumnConcatenationEstimator or something.


In reply to: 218975244 [](ancestors = 218975244,218948931,218945320,218942124,218939448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that there are two -or suffixes, and that the name is long.
ColumnConcatenationEstimator sounds better, but still long.

I don't like the idea that every estimator should be suffixed with Estimator. What about trainers? Do we also suffix them with Estimator? But they are also trainers, so maybe TrainerEstimator?
StochasticDualCoordinateAscentTrainerEstimator, for example? we also need to throw in something about the task, so StochasticDualCoordinateAscentMulticlassTrainerEstimator.


In reply to: 219182371 [](ancestors = 219182371,218975244,218948931,218945320,218942124,218939448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a naming convention to distinguish among the different classes of objects.


In reply to: 219232649 [](ancestors = 219232649,219182371,218975244,218948931,218945320,218942124,218939448)

@@ -1258,5 +1294,63 @@ public override void Process()
}
}
}

public sealed class HashConverter : IEstimator<HashConverterTransformer>
Copy link
Contributor

@TomFinley TomFinley Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashConverter [](start = 24, length = 13)

I would prefer if we could maintain a policy that estimators tend to have "estimator" at the end of their names, and transformers tend to have "transformer" at the end of their names. Otherwise picking out what is which will be utterly bewildering for users... #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus of course, again, I think "Hash converter" is not the best name ever.


In reply to: 218940371 [](ancestors = 218940371)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I got sign off from Pete, with this Converter, I guess I can sneakily rename them back to estimators to get sign off from you :)


In reply to: 218940589 [](ancestors = 218940589,218940371)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that!


In reply to: 218943451 [](ancestors = 218943451,218940589,218940371)

@Ivanidzo4ka Ivanidzo4ka changed the title WIP hash estimator hash estimator Sep 19, 2018
@Ivanidzo4ka Ivanidzo4ka changed the title hash estimator Hash estimator Sep 19, 2018
@Ivanidzo4ka Ivanidzo4ka added the API Issues pertaining the friendly API label Sep 19, 2018
Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka merged commit 95cedf5 into dotnet:master Sep 20, 2018
@Zruty0 Zruty0 mentioned this pull request Sep 24, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants