Skip to content

KeyToVector estimators #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 12, 2018

Conversation

Ivanidzo4ka
Copy link
Contributor

Move KeyToVector to estimators land

@@ -232,7 +232,7 @@ public static IDataTransform Create(IHostEnvironment env, Arguments args, IDataV
Column = cols.ToArray()
};

transform = new KeyToVectorTransform(h, keyToVecArgs, input);
transform =KeyToVectorTransform.Create(h, keyToVecArgs, input);
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyToVectorTransform [](start = 31, length = 20)

move to non arguments creator #Closed

Bag = bag;
}
}
internal sealed class ColInfo
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

new line #Resolved

{
if (bag || info.TypeSrc.ValueCount == 1)
InputSchema.TryGetColumnIndex(_infos[i].Source, out int srcCol);
//IVAN: Simplify
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IVAN [](start = 18, length = 4)

get rid of //IVAN #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get rid of Ivan. He's such a useful, competent person to have around.


In reply to: 216100588 [](ancestors = 216100588)

// Output is the concatenation of the multiple output indicator vectors.
concat = true;
type = new VectorType(NumberType.Float, info.TypeSrc.ValueCount, size);
var type = new VectorType(NumberType.Float, _parent._valueCounts[i], _parent._sizes[i]);
if (typeNames != null && type.VectorSize > 0)
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorSize > 0 [](start = 50, length = 14)

IsKnownSizeVector #Resolved


// These arrays are parallel to Infos.
// * _bag indicates whether vector inputs should have their output indicator vectors added
// (instead of concatenated). This is faithful to what the user specified in the Arguments
Copy link
Member

@sfilipi sfilipi Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the comment? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comment for bags, everything is no longer exist in code


In reply to: 216121559 [](ancestors = 216121559)

}

public override bool CanSavePfa => true;
public override bool CanSaveOnnx => true;
// Factory method for SignatureLoadModel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Factor [](start = 8, length = 9)

Convert it to XML doc, since it is a public method (and if there is anything more to add)

public static IDataTransform Create(IHostEnvironment env, IDataView input, params ColumnInfo[] columns) =>
new KeyToVectorTransform(env, input, columns).MakeDataTransform(input);

// Factory method for SignatureDataTransform.
Copy link
Member

@sfilipi sfilipi Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Fa [](start = 8, length = 5)

same here. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rather make component catalog to support internal Create methods


In reply to: 216121776 [](ancestors = 216121776)

return new VersionInfo(
modelSignature: "KEY2VECT",
//verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Convert to Estimators
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verWrittenCur: 0x00010002, // Convert to Estimators [](start = 16, length = 51)

As far as I know it has not been necessary to break backwards compatibility so far, why is it necessary here? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more backward incompatible.


In reply to: 216140582 [](ancestors = 216140582)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this @Ivanidzo4ka . Key to vector, as the final step for much of our categorical and text processing pipelines, is one of the most important transforms in our repertoire. By breaking backwards compatibility, we will break many prior models, which does not seem like a good move. Let us try to imagine a way, as we have with all other transforms as far as I am aware, to have the old models successfully load as transformers now.

@Ivanidzo4ka Ivanidzo4ka added the API Issues pertaining the friendly API label Sep 9, 2018
@Ivanidzo4ka Ivanidzo4ka added this to the 0918 milestone Sep 9, 2018
ComputeType(this, Source.Schema, i, Infos[i], _bag[i], Metadata,
out _types[i], out _concat[i]);
if (!schema.TryGetColumnIndex(ColumnPairs[i].input, out int colSrc))
throw Host.ExceptUserArg(nameof(ColumnPairs), "Source column '{0}' not found", ColumnPairs[i].input);
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptUserArg [](start = 31, length = 13)

use ExceptSchemaMismatch #Closed

}
}

public sealed class KeyToVectorEstimator : IEstimator<KeyToVectorTransform>
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEstimator [](start = 47, length = 10)

we agreed to make this estimator a trivial estimator, and not enforce the match between train-time schema and transform-time schema #Closed


var pipe = new KeyToVectorEstimator(Env, new KeyToVectorTransform.ColumnInfo("TermA", "CatA", false),
new KeyToVectorTransform.ColumnInfo("TermB", "CatB", true));
TestEstimatorCore(pipe, dataView);
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestEstimatorCore [](start = 12, length = 17)

Make sure to call Done() when you are done #Closed

var result = pipe.Fit(dataView).Transform(dataView);
ValidateMetadata(result);
}
void ValidateMetadata(IDataView result)
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void [](start = 8, length = 4)

spacing, and explicit private #Closed

}

[Fact]
void TestCommandLine()
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void [](start = 8, length = 4)

explicit public as well #Closed

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

SaveBase(ctx);
// for each added column
// int: keyCount
// int: valueCount
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove now? #Closed

@@ -241,6 +241,13 @@ private static (string input, string output)[] GetColumnPairs(ColumnInfo[] colum
return columns.Select(x => (x.Input, x.Output)).ToArray();
}

internal static string TestIsKnownDataKind(ColumnType type)
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static [](start = 17, length = 6)

don't have to be static #Resolved


var result = pipe.Fit(dataView).Transform(dataView);
ValidateMetadata(result);
}
Copy link
Contributor

@Zruty0 Zruty0 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} [](start = 7, length = 2)

Done() #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what for?


In reply to: 216488081 [](ancestors = 216488081)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the contract. Just in case you want to eventually add a baseline comparison here, it's better to end every test in TestDataPipeBase with Done()


In reply to: 216488662 [](ancestors = 216488662,216488081)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// </summary>
private ValueGetter<VBuffer<Float>> MakeGetterInd(IRow input, int iinfo)
public static Vector<TValue> ToVector<TKey, TValue>(this Key<TKey, TValue> input, bool bag = DefaultBag)
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TValue [](start = 29, length = 6)

Should be float #Closed

}

[Fact]
public void KeyToVectorPigsty()
Copy link
Contributor

@Zruty0 Zruty0 Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pigsty [](start = 31, length = 6)

Rename to 'static' #Resolved

/// <summary>
/// Extension methods for the static-pipeline over <see cref="PipelineColumn"/> objects.
/// </summary>
public static class KeyToVectorExtensions
Copy link
Contributor

@Zruty0 Zruty0 Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector [](start = 29, length = 6)

BinaryVector ? #Resolved

}

private void EncodeValueToBinary(BufferBuilder<float> bldr, uint value, int bitsToConsider, int startIndex)
/// <summary>
/// Takes a column of key type of known cardinality and produces a binary encoded indicator vector of zeros and ones.
Copy link
Contributor

@Zruty0 Zruty0 Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary encoded indicator vector of zeros and ones [](start = 75, length = 49)

a vector of bits representing the key in binary form. #Resolved

@TomFinley TomFinley dismissed their stale review September 11, 2018 19:48

revoking review

/// </summary>
private ValueGetter<VBuffer<Float>> MakeGetterInd(IRow input, int iinfo)
public static Vector<float> ToVector<TKey, TValue>(this Key<TKey, TValue> input, bool bag = DefaultBag)
Copy link
Contributor

@TomFinley TomFinley Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool bag = DefaultBag [](start = 89, length = 21)

Is bagging a sensible parameter on a scalar value? Nothing is being combined with anything else -- there's only one thing. #Closed

/// <summary>
/// Takes a column of key type of known cardinality and produces an indicator vector of floats.
/// </summary>
public static Vector<float> ToVector<TKey, TValue>(this Vector<Key<TKey, TValue>> input, bool bag = DefaultBag)
Copy link
Contributor

@TomFinley TomFinley Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static Vector ToVector<TKey, TValue>(this Vector<Key<TKey, TValue>> input, bool bag = DefaultBag) [](start = 8, length = 111)

One of the most important usages of KeyToVector comes in the VarVector input case (e.g., text processing). It is crucially important that we support that scenario. In that case if you wanted to support bagging vs. not, you will need two overloads. I suggest to name one of them ToVector (no bagging, returns VarVector<float>), and the other to name it ToBaggedVector (bagging, returns Vector<float>).

Because of this distinction is necessary for the VarVector input case (which is again really important), I suggest we retain the distiction for consitency sake for the fixed size vector case, for reasons of user consistency. #Resolved

}

/// <summary>
/// Takes a column of key type of known cardinality and produces an indicator vector of floats.
Copy link
Contributor

@TomFinley TomFinley Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Takes a column of key type of known cardinality and produces an indicator vector of floats. [](start = 12, length = 91)

This description is insufficient. Do you feel like you could write better documentation, more in line with the other Pigsty extension methods? It is important for our public API that we get that right.

A sufficient description would describe what is happening to the data. Also crucial is a discussion on what happens in bagged vs. not. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better?


In reply to: 216801051 [](ancestors = 216801051)

/// </summary>
private ValueGetter<VBuffer<Float>> MakeGetterInd(IRow input, int iinfo)
public static Vector<float> ToVector<TKey, TValue>(this Key<TKey, TValue> input, bool bag = DefaultBag)
Copy link
Contributor

@TomFinley TomFinley Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key<TKey, TValue> [](start = 64, length = 17)

Hi @Ivanidzo4ka. I suggest you change the input type from Key<TKey, TValue> to just plain old Key<T>. Whether or not there is key-value metadata is immaterial to the functioning of this transform, and Key<T,V> descends from Key<T> anyway. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment only for this case, or in general across all methods?


In reply to: 216801604 [](ancestors = 216801604)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything ought to allow a value of Key<T> and Key<T,V>, by some mechanism. I had, however, forgotten about Vector<Key<T,V>> not being a Vector<Key<T>>... this will require a separate overload I think.


In reply to: 216812660 [](ancestors = 216812660,216801604)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added Key and Key<TKey,TValue> extensions


In reply to: 216841257 [](ancestors = 216841257,216812660,216801604)

/// </summary>
private ValueGetter<VBuffer<Float>> MakeGetterBag(IRow input, int iinfo)
/// <summary>
/// Extension methods for the static-pipeline over <see cref="PipelineColumn"/> objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static-pipeline over objects [](start = 34, length = 57)

I I think you meant static pipelines over the key-to-vector transform, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this line from other code in hope what this is common pattern. Is it not the case?


In reply to: 216801837 [](ancestors = 216801837)

Copy link
Contributor

@TomFinley TomFinley Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is I or someone did not provide the most helpful message. But if so, I do not view it as terribly consequential -- after using it day in day out for the better part of a decade I could not honestly tell you I have ever once read the documentation of whatever class provides the LINQ extension methods, though assuredly I have used them many thousands of times. And if I can do so for this, what chance would I or anyone have of seeing this doc? Indeed you could put something incredibly NSFW here and I'm almost certain no user would ever know. And this writing is not wrong, just kinda unspecific. So I guess it doesn't matter.


In reply to: 216812383 [](ancestors = 216812383,216801837)

@TomFinley
Copy link
Contributor

Hi @[email protected] thanks for this, just a few more necessary changes.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Ivanidzo4ka , this appears quite nice.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ivanidzo4ka. This is an essential piece of the puzzle, arguably our most important mechanism for featurization.

@TomFinley TomFinley merged commit 356cad4 into dotnet:master Sep 12, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants