Skip to content

No loader on APIs for ValueToKey/OneHotEncoding #2245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 27, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions src/Microsoft.ML.Data/Transforms/ColumnSelecting.cs
Original file line number Diff line number Diff line change
@@ -199,13 +199,13 @@ public ColumnSelectingTransformer(IHostEnvironment env, string[] keepColumns, st
_host.CheckValueOrNull(keepColumns);
_host.CheckValueOrNull(dropColumns);

bool keepValid = keepColumns != null && keepColumns.Count() > 0;
bool dropValid = dropColumns != null && dropColumns.Count() > 0;
bool keepValid = Utils.Size(keepColumns) > 0;
bool dropValid = Utils.Size(dropColumns) > 0;

// Check that both are not valid
_host.Check(!(keepValid && dropValid), "Both keepColumns and dropColumns are set, only one can be specified.");
_host.Check(!(keepValid && dropValid), "Both " + nameof(keepColumns) + " and " + nameof(dropColumns) + " are set. Exactly one can be specified.");
// Check that both are invalid
_host.Check(!(!keepValid && !dropValid), "Neither keepColumns or dropColumns is set, one must be specified.");
_host.Check(!(!keepValid && !dropValid), "Neither " + nameof(keepColumns) + " and " + nameof(dropColumns) + " is set. Exactly one must be specified.");

_selectedColumns = (keepValid) ? keepColumns : dropColumns;
KeepColumns = keepValid;
@@ -558,7 +558,7 @@ private static int[] BuildOutputToInputMap(IEnumerable<string> selectedColumns,
// given an input of ABC and dropping column B will result in AC.
// In drop mode, we drop all columns with the specified names and keep all the rest,
// ignoring the keepHidden argument.
for(int colIdx = 0; colIdx < inputSchema.Count; colIdx++)
for (int colIdx = 0; colIdx < inputSchema.Count; colIdx++)
{
if (selectedColumns.Contains(inputSchema[colIdx].Name))
continue;
15 changes: 6 additions & 9 deletions src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs
Original file line number Diff line number Diff line change
@@ -112,19 +112,16 @@ public static ValueToKeyMappingEstimator MapValueToKey(this TransformsCatalog.Co
=> new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), inputColumn, outputColumn, maxNumTerms, sort);

/// <summary>
/// Converts value types into <see cref="KeyType"/> loading the keys to use from <paramref name="file"/>.
/// Converts value types into <see cref="KeyType"/>, optionally loading the keys to use from <paramref name="keyData"/>.
/// </summary>
/// <param name="catalog">The categorical transform's catalog.</param>
/// <param name="columns">The data columns to map to keys.</param>
/// <param name="file">The path of the file containing the terms.</param>
/// <param name="termsColumn"></param>
/// <param name="loaderFactory"></param>
/// <param name="keyData">The data view containing the terms. If specified, this should be a single column data
/// view, and the key-values will be taken from taht column. If unspecified, the key-values will be determined
/// from the input data upon fitting.</param>
public static ValueToKeyMappingEstimator MapValueToKey(this TransformsCatalog.ConversionTransforms catalog,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueToKeyMappingEstimator [](start = 22, length = 26)

I've tried to look it up, but github doesn't allow you to smoothly explore file changes if file get renamed.
Why we even have this public constructor?
We already have ValueMapping which accepts IEnumerable for keys and ZeeshanA adding another constructor which accepts IDataView.
Considering all this restriction required on IDataView, don't you think it's easier to use ValueMapping transform?

It just a question, not a PR blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ivan, I don't understand this question, could you clarify?

Are you asking why there is a public constructor on the estimators? I did not change the public/non-public nature of this constructor of the estimator. If you are pointing out that the estimator constructors should be non-public since we prefer things be created through MLContext, I agree, and I believe there is already an issue tracking this, #2100. However, the issue I am addressing here is orthogonal to that. I agree it should be done. I see no one is assigned to it.

Also I am not sure what you are talking about w.r.t. another constructor. Certainly I see no other constructor here. I am trying to address the issue I see that whoever did this original estimator API decided, for whatever reason, that using component factories and IDataLoader was a great idea, as discussed in the issue. The most obvious way to maintain the same sort of functionality (loading something from a file) is to provide the constructor I have provided. Are you disagreeing with that?

Also I did not rename the file?

But maybe I am misunderstanding your question. Perhaps you could clarify.

ValueToKeyMappingTransformer.ColumnInfo[] columns,
string file = null,
string termsColumn = null,
IComponentFactory<IMultiStreamSource, IDataLoader> loaderFactory = null)
=> new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, file, termsColumn, loaderFactory);
ValueToKeyMappingTransformer.ColumnInfo[] columns, IDataView keyData = null)
=> new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, keyData);

/// <summary>
/// Maps specified keys to specified values
27 changes: 15 additions & 12 deletions src/Microsoft.ML.Data/Transforms/ValueToKeyMappingEstimator.cs
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@
namespace Microsoft.ML.Transforms.Conversions
{
/// <include file='doc.xml' path='doc/members/member[@name="ValueToKeyMappingEstimator"]/*' />
public sealed class ValueToKeyMappingEstimator: IEstimator<ValueToKeyMappingTransformer>
public sealed class ValueToKeyMappingEstimator : IEstimator<ValueToKeyMappingTransformer>
{
public static class Defaults
{
@@ -19,9 +19,7 @@ public static class Defaults

private readonly IHost _host;
private readonly ValueToKeyMappingTransformer.ColumnInfo[] _columns;
private readonly string _file;
private readonly string _termsColumn;
private readonly IComponentFactory<IMultiStreamSource, IDataLoader> _loaderFactory;
private readonly IDataView _keyData;

/// <summary>
/// Initializes a new instance of <see cref="ValueToKeyMappingEstimator"/>.
@@ -33,23 +31,28 @@ public static class Defaults
/// <param name="sort">How items should be ordered when vectorized. If <see cref="ValueToKeyMappingTransformer.SortOrder.Occurrence"/> choosen they will be in the order encountered.
/// If <see cref="ValueToKeyMappingTransformer.SortOrder.Value"/>, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>
public ValueToKeyMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, int maxNumTerms = Defaults.MaxNumTerms, ValueToKeyMappingTransformer.SortOrder sort = Defaults.Sort) :
this(env, new [] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, maxNumTerms, sort) })
this(env, new[] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, maxNumTerms, sort) })
{
}

public ValueToKeyMappingEstimator(IHostEnvironment env, ValueToKeyMappingTransformer.ColumnInfo[] columns,
string file = null, string termsColumn = null,
IComponentFactory<IMultiStreamSource, IDataLoader> loaderFactory = null)
public ValueToKeyMappingEstimator(IHostEnvironment env, ValueToKeyMappingTransformer.ColumnInfo[] columns, IDataView keyData = null)
{
Contracts.CheckValue(env, nameof(env));
_host = env.Register(nameof(ValueToKeyMappingEstimator));
_host.CheckNonEmpty(columns, nameof(columns));
_host.CheckValueOrNull(keyData);
if (keyData != null && keyData.Schema.Count != 1)
{
throw _host.ExceptParam(nameof(keyData), "If specified, this data view should contain only a single column " +
$"containing the terms to map, but this had {keyData.Schema.Count} columns.");

}

_columns = columns;
_file = file;
_termsColumn = termsColumn;
_loaderFactory = loaderFactory;
_keyData = keyData;
}

public ValueToKeyMappingTransformer Fit(IDataView input) => new ValueToKeyMappingTransformer(_host, input, _columns, _file, _termsColumn, _loaderFactory);
public ValueToKeyMappingTransformer Fit(IDataView input) => new ValueToKeyMappingTransformer(_host, input, _columns, _keyData, false);

public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
Loading