-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Multicolumn mapping for some estimators #3066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d08a2f9
to
0896590
Compare
/// <summary> | ||
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source. | ||
/// </summary> | ||
public readonly string InputColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public readonly string InputColumnName [](start = 7, length = 39)
I'm slightly confuse.
We get rid of ColumnOptions because they were immutable, and now we add another immutable class... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will invite you to check with Tom about this, I am just executing what he asked
In reply to: 268333577 [](ancestors = 268333577)
/// <summary> | ||
/// Specifies input and output column names for a transformation. | ||
/// </summary> | ||
public sealed class InputOutputColumnPair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputOutputColumnPair [](start = 24, length = 21)
what is difference between this one and ColumnOptions
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much, I would have used the bellow if it were me, but Tom asked it to be different, for some reason!
In reply to: 268334077 [](ancestors = 268334077)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a very good reason. ColumnOptions
is a struct meant to serve a specific transformer base, and that is involved in the type heirarchy, and in particular something that captured all of the individual settings and state for each mapping. Our goals here were comparatively more modest: we just needed to . Well designed code does what it is designed to do, and in the simplest possible way. There is no need for this to be part of an elaborate type hierarchy -- this was in fact the mistake that led to the issue #2884 being filed, that we'd conflated two distinct techniques for the extremely bad reasoning that they both had to do with "column." (And, of course, if something deals with the same sort of object, obviously they belong in the same type heirarchy, right?)
In reply to: 268334389 [](ancestors = 268334389,268334077)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I'm not talking about ColumnOptions
inside Estimator, (for example HashingEstimator.ColumnOptions
.
I'm talking about ColumnOptions
in this exact file few lines below (line 41)
Only difference I see is Input/output column names is public in this one instead of private in other, and name of class.
Ok, two difference, one below has implicit converter from tuple.
Can we delete ColumnOptions
in this file?
In reply to: 268337035 [](ancestors = 268337035,268334389,268334077)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. What Artidoro and I had discussed was actually somewhat different. (Or we were talking about two separate things without realizing it.)
In reply to: 268339442 [](ancestors = 268339442,268337035,268334389,268334077)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we were talking about something different, glad we are on the same page.
In reply to: 268760195 [](ancestors = 268760195,268339442,268337035,268334389,268334077)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me. The ColumnOptions
I had been talking about (which is to say, the vast majority of things with that name) are as I said meant to serve a different purpose.
We should probably rename them back to ColumnInfo
at some point but this can be delayed as they are internal... ummm except one. Whoops. Opened #3078. :) Aside from that, yeah.
In reply to: 268771510 [](ancestors = 268771510,268760195,268339442,268337035,268334389,268334077)
Since @ivanbasov asked for it the third commit shows the how this would look like if we removed the Let me know what looks better! I will either eliminate this commit or keep it. Notice that even in commit 3 |
Codecov Report
@@ Coverage Diff @@
## master #3066 +/- ##
==========================================
- Coverage 72.53% 72.51% -0.03%
==========================================
Files 806 806
Lines 144282 144642 +360
Branches 16183 16197 +14
==========================================
+ Hits 104661 104889 +228
- Misses 35217 35342 +125
- Partials 4404 4411 +7
|
/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param> | ||
/// <param name="outputKind">The expected kind of the output column.</param> | ||
public static TypeConvertingEstimator ConvertType(this TransformsCatalog.ConversionTransforms catalog, | ||
InputOutputColumnPair[] columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns [](start = 36, length = 7)
Don't you need to check for null for the columns arg to avoid null reference exception? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the checks thanks for pointing out. I only fixed the extensions that are public. Not those that are internal.
In reply to: 268724517 [](ancestors = 268724517)
=> new KeyToValueMappingEstimator(CatalogUtils.GetEnvironment(catalog), ColumnOptions.ConvertToValueTuples(columns)); | ||
/// <param name="catalog">The conversion transform's catalog.</param> | ||
/// <param name="columns">Specifies the names of the columns on which to apply the transformation.</param> | ||
public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, InputOutputColumnPair[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns [](start = 140, length = 7)
also check for null #Resolved
/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names. | ||
/// </summary> | ||
public static implicit operator ColumnOptions((string outputColumnName, string inputColumnName) value) | ||
public InputOutputColumnPair(string outputColumnName, string inputColumnName = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public InputOutputColumnPair(string outputColumnName, string inputColumnName = null) [](start = 8, length = 84)
Add another overload for the case when input name = output name. That would make it a lot clearer vs setting one to null. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I believe that everywhere in the codebase we have the same pattern string outputColumnName, string inputColumnName = null
. This is found in all the mlContext extensions for transforms. I can add it here, but I think it would make more sense to stick to the general pattern.
In reply to: 268726871 [](ancestors = 268726871)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <summary> | ||
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source. | ||
/// </summary> | ||
public readonly string InputColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputColumnName [](start = 31, length = 15)
Shouldn't it be in reverse - the input be set, but output be optional and equal to input if out is null?? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = ValueToKeyMappingEstimator.Defaults.Ordinality, | ||
IDataView keyData = null) | ||
{ | ||
var columnOptions = columns.Select(x => new OneHotEncodingEstimator.ColumnOptions(x.OutputColumnName, x.InputColumnName, outputKind, maximumNumberOfKeys, keyOrdinality)).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns [](start = 32, length = 7)
check for null here and elsewhere #Resolved
("out1", "VectorFloat"), | ||
("out2", "VectorDouble") | ||
columns: new[] { | ||
new InputOutputColumnPair("out1", "VectorFloat"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputOutputColumnPair [](start = 28, length = 21)
why do we have to use the more verbose initializer here? Ideally we should use the old syntax is possible as it is a lot more compact. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the other one uses tuples. We have decided not to have tuples in the public surface any longer.
In reply to: 268729946 [](ancestors = 268729946)
/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names. | ||
/// </summary> | ||
public static implicit operator ColumnOptions((string outputColumnName, string inputColumnName) value) | ||
public InputOutputColumnPair(string outputColumnName, string inputColumnName = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputColumnName [](start = 44, length = 16)
Check non-empty on outputColumName
probably. #Resolved
/// <summary> | ||
/// Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref="OutputColumnName"/> will be used as source. | ||
/// </summary> | ||
public readonly string InputColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputColumnName [](start = 31, length = 15)
Should these be properties? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @artidoro !! I see the central thing still isn't using properties, but I am not certain that is absolutely essential. Might be nice though if you get to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing, I am updating now with the latest changes. |
Adding multicolumn mapping for some estimators (as per list by @TomFinley and @glebuk):
Leaving out:
Let me know if I should add more estimators.
Fixes #3068
Related to #2884