-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Categorical estimator #899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical estimator #899
Conversation
#> | ||
/// </summary> | ||
/// <param name="input">The input column.</param> | ||
/// <param name="outputKind"></param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put description #Closed
/// We are inputting a key type with values, and in that case the dictionary is considered to be built over the | ||
/// values of the keys, rather than the keys themselves. This also mean the key-values learned for the output | ||
/// will be a subset of the key-values in the input. | ||
<# } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace term summary to categorical #Closed
/// <param name="maxItems">The maximum number of items.</param> | ||
|
||
/// <returns>The key-valued column.</returns> | ||
public static <#=isVarVector?"Var":""#>Vector<float> OneHotEncoding<#=inputIsKey?"<T>":""#>(this <#=omitInputArity?"":arityName+"<"#><#=inputIsKey?"Key<T, ":""#><#=typeName#>><#=inputIsKey&&!isScalar?">":""#> input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isVarVector [](start = 25, length = 11)
I need to handle situation in which VarVector becomes Vector if user choose bagging output.
Only possible solution coming to my mind so far is to have OneHotEncodingBagged and OneHotEncoding methods and expose enumeration without bagging option #Closed
new CategoricalTransform.Column { Name = "CatFeatures", Source = "CatFeatures" } | ||
} | ||
}, loader); | ||
IDataView trans = CategoricalTransform.Create(env, loader, new CategoricalEstimator.ColumnInfo("CatFeatures", "CatFeaures")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatFeaures [](start = 127, length = 10)
CatFeatures #Closed
} | ||
|
||
|
||
[Fact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines #Closed
@@ -14,6 +14,8 @@ | |||
using Microsoft.ML.Runtime.EntryPoints; | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using Microsoft.ML.Runtime.Internal.Internallearn; | |||
using Microsoft.ML.Core.Data; | |||
using Microsoft.ML.Data.StaticPipe.Runtime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order #Closed
col.SetTerms(column.Terms); | ||
columns.Add(col); | ||
} | ||
return Create(env, input, columns.ToArray()) as IDataTransform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as IDataTransform [](start = 56, length = 18)
I don't like it, but I don't see any other options. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class ColumnInfo : TermTransform.ColumnInfo | ||
{ | ||
public readonly CategoricalTransform.OutputKind OutputKind; | ||
public ColumnInfo(string input, string output, CategoricalTransform.OutputKind outputKind = Defaults.OutKind, int maxNumTerms = TermEstimator.Defaults.MaxNumTerms, TermTransform.SortOrder sort = TermEstimator.Defaults.Sort, string[] term = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CategoricalTransform.OutputKind outputKind = Defaults.OutKin [](start = 59, length = 60)
shorter lines #Closed
} | ||
var col = new KeyToVectorTransform.ColumnInfo(column.Output, column.Output, bag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyToVectorTransform.ColumnInfo(column.Output, column.Output, bag) [](start = 30, length = 66)
can be just tuple #Closed
{ | ||
// I am not certain I see a good way to cover the distinct types beyond complete enumeration. | ||
// Raw generics would allow illegal possible inputs, e.g., Scalar<Bitmap>. So, this is a partial | ||
// class, and all the public facing extension methods for each possible type are in a T4 generated result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it #Closed
} | ||
} | ||
|
||
private interface IOneHotCol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneHot [](start = 27, length = 6)
Categorical probably is better internal name #Closed
@@ -13,6 +13,7 @@ | |||
using Microsoft.ML.Runtime.EntryPoints; | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using Microsoft.ML.Runtime.Internal.Internallearn; | |||
using System.Collections.Generic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order #Closed
args); | ||
} | ||
} | ||
|
||
public static IDataTransform CreateTransformCore(CategoricalTransform.OutputKind argsOutputKind, OneToOneColumn[] columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
Don't have to be public (and static, but I doubt its worth effort to refactor it if we plan to replace this code with estimators soon) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we still have this CategoricalHash transform which put HashTransform instead of TermTransform, so this is why I moved it into CatHash. I would like to kill it, but we gonna do it anyway, but first we need to convert HashTransform to estimators.
In reply to: 217236901 [](ancestors = 217236901,217229386)
using System.Linq; | ||
|
||
namespace Microsoft.ML.Runtime.Data | ||
{ | ||
public sealed class TermEstimator : IEstimator<TermTransform> | ||
{ | ||
public static class Defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults [](start = 28, length = 8)
Is this a right way to do? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bag = true; | ||
break; | ||
} | ||
var col = new KeyToVectorTransform.Column(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column [](start = 55, length = 6)
object initializer? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's gonna be dead after hash transform become estimator.
In reply to: 217236161 [](ancestors = 217236161)
{ | ||
public static partial class CategoricalStaticExtensions | ||
{ | ||
// Do not edit this file directly. Rather, it is generated out of TermStaticExtensions.tt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Term [](start = 74, length = 4)
typo #Closed
return Create(env, input, columns.ToArray()) as IDataTransform; | ||
} | ||
|
||
public static IDataView Create(IHostEnvironment env, IDataView input, params CategoricalEstimator.ColumnInfo[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create [](start = 32, length = 6)
that doesn't serve any purpose, does it? Just make the callers do line 149, it's cleaner #Closed
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) | ||
{ | ||
return _estimatorChain.GetOutputSchema(inputSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return [](start = 12, length = 6)
=>
#Closed
@@ -301,4 +307,109 @@ public static CommonOutputs.TransformOutput KeyToText(IHostEnvironment env, KeyT | |||
return new CommonOutputs.TransformOutput { Model = new TransformModel(env, xf, input.Data), OutputData = xf }; | |||
} | |||
} | |||
|
|||
public enum OneHotOutputKind : byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneHotOutputKind [](start = 16, length = 16)
why is it at the top level? #Closed
new CategoricalTransform.Column { Name = "CatFeatures", Source = "CatFeatures" } | ||
} | ||
}, loader); | ||
IDataView trans = CategoricalTransform.Create(env, loader, new CategoricalEstimator.ColumnInfo("CatFeatures", "CatFeaures")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CategoricalTransform [](start = 34, length = 20)
can we have a convenience ctor here that takes params string[] columns or something? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
it is NOT for public facing API. IDataTransform is to be deprecated. It is not too much work to write line 129 #Resolved Refers to: src/Microsoft.ML.Transforms/CategoricalTransform.cs:119 in 114fec9. [](commit_id = 114fec9, deletion_comment = False) |
|
||
/// <summary> | ||
/// The categorical transform operates on text columns. During construction, it passes through the data to build a dictionary of categories. | ||
/// It does not do any parsing to the text; for each row it sees, the whole text appearing in the input column is defined as a category. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this is implementation, and should not be present #Resolved
/// The output of the categorical transform is an indicator vector. Each slot in this vector corresponds to a category in the dictionary | ||
/// (thus, its length is the size of the built dictionary). In each row, it contains a 1 in the slot corresponding to the category in that row, | ||
/// and 0 in the rest of the slots. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 14, length = 7)
add params description #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param> | ||
/// <param name="order">How Id for each value would be assigined: by occurence or by value.</param> | ||
/// <param name="maxItems">Maximum number of ids to keep during data scanning.</param> | ||
public static Vector<float> OneHotEncoding(this Scalar<string> input, OneHotOutputKind outputKind = DefOut, KeyValueOrder order = DefSort, int maxItems = DefMax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneHotEncoding [](start = 36, length = 14)
It strikes me that the most useful thing that we could get out of this via an onFit
delegate would be the mapping, so people can know what the features actually mean. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be the same type that is from the OnTerm
transform... then once that is properly implemented both with benefit.
In reply to: 217569940 [](ancestors = 217569940)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added onFit delegate, let me know is this enough for now.
In reply to: 217570555 [](ancestors = 217570555,217569940)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Converts the categorical value into an indicator array by building a dictionary of categories based on the data and using the id in the dictionary as the index in the array | ||
/// </summary> | ||
/// <param name="input">Incoming data.</param> | ||
/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiarray, array [](start = 77, length = 17)
In the scalar case there is no distinction between these two, so presenting this as an option is at least confusing. Is it worthwhile to clarify this? #Closed
/// </summary> | ||
/// <param name="input">Incoming data.</param> | ||
/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param> | ||
/// <param name="order">How Id for each value would be assigined: by occurence or by value.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
occurence [](start = 77, length = 9)
occurrence, not occurence? #Closed
@@ -65,4 +65,8 @@ | |||
</EmbeddedResource> | |||
</ItemGroup> | |||
|
|||
<ItemGroup> | |||
<Service Include="{508349b6-6b84-4df5-91f0-309beebad82d}" /> | |||
</ItemGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. What's this for? #Closed
CategoricalHashTransform.Arguments catHashArgs = null) | ||
} | ||
|
||
public sealed class CategoricalEstimator : IEstimator<ITransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEstimator [](start = 47, length = 24)
Should not the categorical estimator have a more specific type? We have something that we want to publish, do we not? We want something that is (ultimately) capable of exposing whatever ToKey
mapping was learnt in the first step. This strikes me as the only real reason why we would actually have the complication of an implemented type in the first place, am I wrong? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converts Categorical transform to Estimator chain