Skip to content

Categorical estimator #899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 18, 2018

Conversation

Ivanidzo4ka
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka commented Sep 12, 2018

Converts Categorical transform to Estimator chain

@Ivanidzo4ka Ivanidzo4ka self-assigned this Sep 12, 2018
@Ivanidzo4ka Ivanidzo4ka added the API Issues pertaining the friendly API label Sep 12, 2018
@Ivanidzo4ka Ivanidzo4ka added this to the 0918 milestone Sep 12, 2018
#>
/// </summary>
/// <param name="input">The input column.</param>
/// <param name="outputKind"></param>
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put description #Closed

/// We are inputting a key type with values, and in that case the dictionary is considered to be built over the
/// values of the keys, rather than the keys themselves. This also mean the key-values learned for the output
/// will be a subset of the key-values in the input.
<# }
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace term summary to categorical #Closed

/// <param name="maxItems">The maximum number of items.</param>

/// <returns>The key-valued column.</returns>
public static <#=isVarVector?"Var":""#>Vector<float> OneHotEncoding<#=inputIsKey?"<T>":""#>(this <#=omitInputArity?"":arityName+"<"#><#=inputIsKey?"Key<T, ":""#><#=typeName#>><#=inputIsKey&&!isScalar?">":""#> input,
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isVarVector [](start = 25, length = 11)

I need to handle situation in which VarVector becomes Vector if user choose bagging output.
Only possible solution coming to my mind so far is to have OneHotEncodingBagged and OneHotEncoding methods and expose enumeration without bagging option #Closed

new CategoricalTransform.Column { Name = "CatFeatures", Source = "CatFeatures" }
}
}, loader);
IDataView trans = CategoricalTransform.Create(env, loader, new CategoricalEstimator.ColumnInfo("CatFeatures", "CatFeaures"));
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatFeaures [](start = 127, length = 10)

CatFeatures #Closed

}


[Fact]
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines #Closed

@@ -14,6 +14,8 @@
using Microsoft.ML.Runtime.EntryPoints;
using Microsoft.ML.Runtime.Internal.Utilities;
using Microsoft.ML.Runtime.Internal.Internallearn;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data.StaticPipe.Runtime;
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order #Closed

@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented Sep 13, 2018

using Float = System.Single;

can be removed #Closed


Refers to: src/Microsoft.ML.Transforms/CategoricalTransform.cs:5 in ac91db8. [](commit_id = ac91db8, deletion_comment = False)

col.SetTerms(column.Terms);
columns.Add(col);
}
return Create(env, input, columns.ToArray()) as IDataTransform;
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as IDataTransform [](start = 56, length = 18)

I don't like it, but I don't see any other options. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we have to do this for now


In reply to: 217228738 [](ancestors = 217228738)

public class ColumnInfo : TermTransform.ColumnInfo
{
public readonly CategoricalTransform.OutputKind OutputKind;
public ColumnInfo(string input, string output, CategoricalTransform.OutputKind outputKind = Defaults.OutKind, int maxNumTerms = TermEstimator.Defaults.MaxNumTerms, TermTransform.SortOrder sort = TermEstimator.Defaults.Sort, string[] term = null)
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategoricalTransform.OutputKind outputKind = Defaults.OutKin [](start = 59, length = 60)

shorter lines #Closed

}
var col = new KeyToVectorTransform.ColumnInfo(column.Output, column.Output, bag);
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyToVectorTransform.ColumnInfo(column.Output, column.Output, bag) [](start = 30, length = 66)

can be just tuple #Closed

{
// I am not certain I see a good way to cover the distinct types beyond complete enumeration.
// Raw generics would allow illegal possible inputs, e.g., Scalar<Bitmap>. So, this is a partial
// class, and all the public facing extension methods for each possible type are in a T4 generated result.
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it #Closed

}
}

private interface IOneHotCol
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneHot [](start = 27, length = 6)

Categorical probably is better internal name #Closed

@@ -13,6 +13,7 @@
using Microsoft.ML.Runtime.EntryPoints;
using Microsoft.ML.Runtime.Internal.Utilities;
using Microsoft.ML.Runtime.Internal.Internallearn;
using System.Collections.Generic;
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order #Closed

args);
}
}

public static IDataTransform CreateTransformCore(CategoricalTransform.OutputKind argsOutputKind, OneToOneColumn[] columns,
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

Don't have to be public (and static, but I doubt its worth effort to refactor it if we plan to replace this code with estimators soon) #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed now?


In reply to: 217229386 [](ancestors = 217229386)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we still have this CategoricalHash transform which put HashTransform instead of TermTransform, so this is why I moved it into CatHash. I would like to kill it, but we gonna do it anyway, but first we need to convert HashTransform to estimators.


In reply to: 217236901 [](ancestors = 217236901,217229386)

using System.Linq;

namespace Microsoft.ML.Runtime.Data
{
public sealed class TermEstimator : IEstimator<TermTransform>
{
public static class Defaults
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults [](start = 28, length = 8)

Is this a right way to do? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me


In reply to: 217229486 [](ancestors = 217229486)

bag = true;
break;
}
var col = new KeyToVectorTransform.Column();
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column [](start = 55, length = 6)

object initializer? #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gonna be dead after hash transform become estimator.


In reply to: 217236161 [](ancestors = 217236161)

{
public static partial class CategoricalStaticExtensions
{
// Do not edit this file directly. Rather, it is generated out of TermStaticExtensions.tt.
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Term [](start = 74, length = 4)

typo #Closed

return Create(env, input, columns.ToArray()) as IDataTransform;
}

public static IDataView Create(IHostEnvironment env, IDataView input, params CategoricalEstimator.ColumnInfo[] columns)
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create [](start = 32, length = 6)

that doesn't serve any purpose, does it? Just make the callers do line 149, it's cleaner #Closed


public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
return _estimatorChain.GetOutputSchema(inputSchema);
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return [](start = 12, length = 6)

=> #Closed

@@ -301,4 +307,109 @@ public static CommonOutputs.TransformOutput KeyToText(IHostEnvironment env, KeyT
return new CommonOutputs.TransformOutput { Model = new TransformModel(env, xf, input.Data), OutputData = xf };
}
}

public enum OneHotOutputKind : byte
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneHotOutputKind [](start = 16, length = 16)

why is it at the top level? #Closed

new CategoricalTransform.Column { Name = "CatFeatures", Source = "CatFeatures" }
}
}, loader);
IDataView trans = CategoricalTransform.Create(env, loader, new CategoricalEstimator.ColumnInfo("CatFeatures", "CatFeaures"));
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategoricalTransform [](start = 34, length = 20)

can we have a convenience ctor here that takes params string[] columns or something? #Closed

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@Ivanidzo4ka Ivanidzo4ka changed the title WIP Categorical estimator Categorical estimator Sep 13, 2018
@Zruty0
Copy link
Contributor

Zruty0 commented Sep 13, 2018

    /// A helper method to create <see cref="CategoricalTransform"/> for public facing API.

it is NOT for public facing API. IDataTransform is to be deprecated. It is not too much work to write line 129 #Resolved


Refers to: src/Microsoft.ML.Transforms/CategoricalTransform.cs:119 in 114fec9. [](commit_id = 114fec9, deletion_comment = False)


/// <summary>
/// The categorical transform operates on text columns. During construction, it passes through the data to build a dictionary of categories.
/// It does not do any parsing to the text; for each row it sees, the whole text appearing in the input column is defined as a category.
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is implementation, and should not be present #Resolved

/// The output of the categorical transform is an indicator vector. Each slot in this vector corresponds to a category in the dictionary
/// (thus, its length is the size of the built dictionary). In each row, it contains a 1 in the slot corresponding to the category in that row,
/// and 0 in the rest of the slots.
/// </summary>
Copy link
Contributor

@Zruty0 Zruty0 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary [](start = 14, length = 7)

add params description #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param>
/// <param name="order">How Id for each value would be assigined: by occurence or by value.</param>
/// <param name="maxItems">Maximum number of ids to keep during data scanning.</param>
public static Vector<float> OneHotEncoding(this Scalar<string> input, OneHotOutputKind outputKind = DefOut, KeyValueOrder order = DefSort, int maxItems = DefMax)
Copy link
Contributor

@TomFinley TomFinley Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneHotEncoding [](start = 36, length = 14)

It strikes me that the most useful thing that we could get out of this via an onFit delegate would be the mapping, so people can know what the features actually mean. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the same type that is from the OnTerm transform... then once that is properly implemented both with benefit.


In reply to: 217569940 [](ancestors = 217569940)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added onFit delegate, let me know is this enough for now.


In reply to: 217570555 [](ancestors = 217570555,217569940)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thanks Ivan.


In reply to: 218176803 [](ancestors = 218176803,217570555,217569940)

/// Converts the categorical value into an indicator array by building a dictionary of categories based on the data and using the id in the dictionary as the index in the array
/// </summary>
/// <param name="input">Incoming data.</param>
/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param>
Copy link
Contributor

@TomFinley TomFinley Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiarray, array [](start = 77, length = 17)

In the scalar case there is no distinction between these two, so presenting this as an option is at least confusing. Is it worthwhile to clarify this? #Closed

/// </summary>
/// <param name="input">Incoming data.</param>
/// <param name="outputKind">Specify output type of indicator array: Multiarray, array or binary encoded data.</param>
/// <param name="order">How Id for each value would be assigined: by occurence or by value.</param>
Copy link
Contributor

@TomFinley TomFinley Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

occurence [](start = 77, length = 9)

occurrence, not occurence? #Closed

@@ -65,4 +65,8 @@
</EmbeddedResource>
</ItemGroup>

<ItemGroup>
<Service Include="{508349b6-6b84-4df5-91f0-309beebad82d}" />
</ItemGroup>
Copy link
Contributor

@TomFinley TomFinley Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. What's this for? #Closed

CategoricalHashTransform.Arguments catHashArgs = null)
}

public sealed class CategoricalEstimator : IEstimator<ITransformer>
Copy link
Contributor

@TomFinley TomFinley Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEstimator [](start = 47, length = 24)

Should not the categorical estimator have a more specific type? We have something that we want to publish, do we not? We want something that is (ultimately) capable of exposing whatever ToKey mapping was learnt in the first step. This strikes me as the only real reason why we would actually have the complication of an implemented type in the first place, am I wrong? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this look good for you?


In reply to: 217570810 [](ancestors = 217570810)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks.


In reply to: 218176602 [](ancestors = 218176602,217570810)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka merged commit 5e2ed11 into dotnet:master Sep 18, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants