-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hash estimator #944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash estimator #944
Conversation
|
||
[assembly: LoadableClass(typeof(IRowMapper), typeof(HashTransform), null, typeof(SignatureLoadRowMapper), | ||
"Hash Transform", HashTransform.LoaderSignature)] | ||
|
||
namespace Microsoft.ML.Runtime.Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Runtime.Data [](start = 10, length = 25)
let's get into habit of setting the correct namespace and naming.
namespace Microsoft.ML.Transforms
- Estimators should be named the same as the
Legacy
ones as appropriate, with no Estimator suffix. In this case, it should beHashConverter
. - Transformers should be named the same as estimators, with a
Transformer
suffix:HashConverterTransformer
. #Closed
ctx.Writer.WriteBoolByte(Ordered); | ||
} | ||
} | ||
|
||
private static string TestType(ColumnType type) | ||
public static string TestType(ColumnType type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
why public? If it's public, it has to be documented #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I see now, you are using the same code in the estimator. I think you should turn it into public static bool IsColumnTypeValid(ColumnType type)
and internal const string ExpectedColumnType
to feed to the exceptions
In reply to: 218906546 [](ancestors = 218906546)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this signature is easier to document, and actually helpful to the consumer. Might as well also move it to the estimator.
In reply to: 218910693 [](ancestors = 218910693,218906546)
TextModelHelper.LoadAll(Host, ctx, Infos.Length, out _keyValues, out _kvTypes); | ||
SetMetadata(); | ||
var type = inputSchema.GetColumnType(srcCol); | ||
string reason = TestType(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestType [](start = 28, length = 8)
In this case, I think we should make it just part of CheckInputColumn
and throw ExceptSchemaMismatch
. #Closed
_exes[iinfo].Save(ctx); | ||
|
||
TextModelHelper.SaveAll(Host, ctx, Infos.Length, _keyValues); | ||
Contracts.CheckValue(columns, nameof(columns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckValue [](start = 22, length = 10)
CheckNonEmpty? #Closed
|
||
// <prefix handled in static Create method> | ||
// <base> | ||
// Exes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exes [](start = 15, length = 4)
rename #Closed
public HashTransform(IHostEnvironment env, Arguments args, IDataView input) | ||
: base(Contracts.CheckRef(env, nameof(env)), RegistrationName, env.CheckRef(args, nameof(args)).Column, | ||
input, TestType) | ||
public HashTransform(IHostEnvironment env, IDataView input, ColumnInfo[] columns) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashTransform [](start = 15, length = 13)
can I have a convenience ctor with only one column pair? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have it in estimator, why you want it in transformer?
In reply to: 218907955 [](ancestors = 218907955)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I want users to be able to create transformer directly with no input
.
Can we have two ctors? internal
one that you have now, and also public HashTransform(env, columns)
that checks that invertHash
is not used in any of the columns?
I do NOT want users to be able to train transformers directly without calling Fit
.
In reply to: 218909081 [](ancestors = 218909081,218907955)
@@ -1258,5 +1284,65 @@ public override void Process() | |||
} | |||
} | |||
} | |||
|
|||
public sealed class HashEstimator : IEstimator<HashTransform> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashEstimator [](start = 24, length = 13)
summary comments for the estimator and its ctors #Resolved
foreach (var col in _columns) | ||
{ | ||
if (col.InvertHash < -1) | ||
throw _host.ExceptParam(nameof(columns), "Value too small, must be -1 or larger"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExceptParam [](start = 32, length = 11)
why is it not part of ctor for ColumnInfo
? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had same thoughts, mainly because we don't have IExceptionContext in ColumnInfo, but yeah
In reply to: 218908992 [](ancestors = 218908992)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I would be ok with losing context info here for the sake of localizing the validation
In reply to: 218909546 [](ancestors = 218909546,218908992)
metadata.Add(slotMeta); | ||
if (colInfo.InvertHash != 0) | ||
metadata.Add(new SchemaShape.Column(MetadataUtils.Kinds.KeyValues, SchemaShape.Column.VectorKind.Vector, TextType.Instance, false)); | ||
result[colInfo.Output] = new SchemaShape.Column(colInfo.Output, col.ItemType.IsVector ? SchemaShape.Column.VectorKind.Vector : SchemaShape.Column.VectorKind.Scalar, NumberType.U4, true, new SchemaShape(metadata.ToArray())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ToArray() [](start = 226, length = 10)
no need #Closed
: this(env, new Arguments() { | ||
Column = new[] { new Column() { Source = source ?? name, Name = name } }, | ||
HashBits = hashBits, InvertHash = invertHash }, input) | ||
internal ColumnType GetOutputType(ISchema inputSchema, ColumnInfo column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
why internal? #Closed
@@ -332,7 +332,7 @@ private string GetSplitColumn(IChannel ch, IDataView input, ref IDataView output | |||
var hashargs = new HashTransform.Arguments(); | |||
hashargs.Column = new[] { new HashTransform.Column { Source = origStratCol, Name = stratificationColumn } }; | |||
hashargs.HashBits = 30; | |||
output = new HashTransform(Host, hashargs, input); | |||
output = HashTransform.Create(Host, hashargs, input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create [](start = 43, length = 6)
Can we try to limit the use of Create
to dependency injection? You should just do new HashTransform(Host, origStratCol, stratificationColumn, hashBits: 30).Transform(input)
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
internal const string ExpectedColumnType = "Expected Text, Key, Single or Double item type"; | ||
|
||
public HashConverter(IHostEnvironment env, string name, string source = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string name, string source = null [](start = 51, length = 33)
lets follow the pattern and have string inputColumn, string outputColumn = null
#Resolved
|
||
public HashConverter(IHostEnvironment env, string name, string source = null, | ||
int hashBits = Defaults.HashBits, int invertHash = Defaults.InvertHash) | ||
: this(env, new HashConverterTransformer.ColumnInfo(name, source ?? name, hashBits: hashBits, invertHash: invertHash)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name, source ?? name [](start = 64, length = 20)
you mixed them up I think #Resolved
@@ -14,6 +14,7 @@ | |||
using Microsoft.ML.Runtime.Model; | |||
using Microsoft.ML.Runtime.TextAnalytics; | |||
using Xunit; | |||
using Microsoft.ML.Transforms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using [](start = 0, length = 5)
(placeholder for Tom's comment) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Ivan, Pete asked me to tell you to sort your usings. :)
In reply to: 218936325 [](ancestors = 218936325)
|
||
var dataView = ComponentCreation.CreateDataView(Env, data); | ||
var pipe = new HashConverter(Env, new[] { | ||
new HashConverterTransformer.ColumnInfo("A", "HashA", invertHash:1, hashBits:10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashConverterTransformer [](start = 20, length = 24)
hmm, now that I think of it, we should probably have ColumnInfo
be part of the estimator as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this to be part of this PR or as separate PR?
In reply to: 218936515 [](ancestors = 218936515)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we have this pattern in all estimators. Wouldn't it be better to move stuff in one PR, rather than in separate? Or you want me to update all estimators as well?
In reply to: 218944749 [](ancestors = 218944749,218941758,218936515)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather prefer to keep one style in code, and if it needed to flip everything, other than have two styles in code.
In reply to: 218945227 [](ancestors = 218945227,218944749,218941758,218936515)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. Let's do it in a different PR for all at once, where applicable
In reply to: 218954028 [](ancestors = 218954028,218945227,218944749,218941758,218936515)
@@ -28,19 +36,8 @@ namespace Microsoft.ML.Runtime.Data | |||
/// it hashes each slot separately. | |||
/// It can hash either text values or key values. | |||
/// </summary> | |||
public sealed class HashTransform : OneToOneTransformBase, ITransformTemplate | |||
public sealed class HashConverterTransformer : OneToOneTransformerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashConverterTransformer [](start = 24, length = 24)
What is this hash converter?
I do not like this usage of "converter." You have "hash converter transformer". But to transform is also a form of conversion. So it's like you are saying "this is a hash transformer transformer" which makes no sense. What is wrong with "Hash Estimator" and "Hash Transformer"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- namespace Microsoft.ML.Transforms
- Estimators should be named the same as the Legacy ones as appropriate, with no Estimator suffix. In this case, it should be HashConverter.
- Transformers should be named the same as estimators, with a Transformer suffix: HashConverterTransformer.
I'm bad with words, so I don't have strong opinion on this.
In reply to: 218939448 [](ancestors = 218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I not is smithword.
I thought that we should just adopt the naming of the legacy components, assuming that some thought has been put into naming these. Legacy has HashConverter
as one step (it uses HashJoin, but I think it's irrelevant).
In reply to: 218942124 [](ancestors = 218942124,218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimators and (sometimes) transformers are things directly instantiated by users. To be instantiated by users they must be found. In order to be reasonably found the best way I know of is that they have a uniform consistent naming policy, and a common policy that people understand in .NET that I've seen many, mnay frameworks adopt is that an implementor of interface IFoo
or some base class FooBase
. will have Foo
somewhere in the name. This also jibes with how we've written our names everywhere else. E.g.: IDataTransform
(nearly?) all transforms have Transform
in their name.
So for that reason, the advice in this policy that "with no Estimator suffix" is something that ought to be reviewed.
In reply to: 218945320 [](ancestors = 218945320,218942124,218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with what you said. Still, I think there is value in preserving 'verb-like' nouns like 'ColumnConcatenator' and 'TextTokenizer'. Having something like ColumnConcatenatorEstimator
sounds bad to me. Do you have a better suggestion for the naming policy?
In reply to: 218948931 [](ancestors = 218948931,218945320,218942124,218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Any implementor of IEstimator
should be suffixed with Esimator
, and implementor of ITransformer
ought to be suffixed with Transformer
. This will make them easy to find.
"Sounds bad to me" is a little hard for me to address. Could I possibly convince you to be more specific? Is the problem that it's too long, or you don't like the two -or
suffix words in a row? If too long, maybe shorten it. If it's the two -or
suffix words in a row, then I'd suggest maybe making the first one not that. E.g., we can change ColumnConcatenatorEstimator
to ColumnConcatenatingEstimator
or ColumnConcatenationEstimator
or something.
In reply to: 218975244 [](ancestors = 218975244,218948931,218945320,218942124,218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that there are two -or
suffixes, and that the name is long.
ColumnConcatenationEstimator
sounds better, but still long.
I don't like the idea that every estimator should be suffixed with Estimator
. What about trainers? Do we also suffix them with Estimator
? But they are also trainers, so maybe TrainerEstimator
?
StochasticDualCoordinateAscentTrainerEstimator
, for example? we also need to throw in something about the task, so StochasticDualCoordinateAscentMulticlassTrainerEstimator
.
In reply to: 219182371 [](ancestors = 219182371,218975244,218948931,218945320,218942124,218939448)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a naming convention to distinguish among the different classes of objects.
In reply to: 219232649 [](ancestors = 219232649,219182371,218975244,218948931,218945320,218942124,218939448)
@@ -1258,5 +1294,63 @@ public override void Process() | |||
} | |||
} | |||
} | |||
|
|||
public sealed class HashConverter : IEstimator<HashConverterTransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashConverter [](start = 24, length = 13)
I would prefer if we could maintain a policy that estimators tend to have "estimator" at the end of their names, and transformers tend to have "transformer" at the end of their names. Otherwise picking out what is which will be utterly bewildering for users... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus of course, again, I think "Hash converter" is not the best name ever.
In reply to: 218940371 [](ancestors = 218940371)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I got sign off from Pete, with this Converter, I guess I can sneakily rename them back to estimators to get sign off from you :)
In reply to: 218940589 [](ancestors = 218940589,218940371)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zruty0 you can pick up this one