-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Replaces ChooseColumnsTransform and DropColumnsTransform with SelectColumnsTransform #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
replacing them with SelectColumnsTransform. These changes include: * Updates to SelectColumnsTransform to respect ordering when keeping columns. For example, if the input is ABC and CB is selected, the output will be CB. * Updates to code that used Choose or Drop columns, replacing with SelectColumns. * Updates to baseline output for tests to pass * Re-enabled the SavePipeline tests This fixes dotnet#1342 These changes are also related to dotnet#754
The default behavior of ChooseTransform was the opposite of this. We should consider whether we want to change it here as well. Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False) |
// finding the associated index that should be used. | ||
foreach(var columnName in selectedColumns) | ||
{ | ||
if (columnDict.TryGetValue(columnName, out List<int> columnList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TryGetValue [](start = 39, length = 11)
I think this should fail if the schema doesn't contain a column by that name. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to in general or specifically in this case? In general there is an ignoreMissing parameter that when it is true will ignore any columns specified not being in the input. This is checked earlier before the Mapper is created.
In reply to: 228247404 [](ancestors = 228247404)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I didn't notice the IgnoreMissing argument. In that case, are we sure we want it to default to true?
In reply to: 228257636 [](ancestors = 228257636,228247404)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it defaults to true, then that means that if the user makes a mistake in the column name, they will not get an error message about it.
In reply to: 228264118 [](ancestors = 228264118,228257636,228247404)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zruty0 - Yael makes a great point! What are your thoughts on defaulting ignoreMissing to false? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will default ignoreMissing to false. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreMissing now defaults to false.
In reply to: 228264779 [](ancestors = 228264779,228264118,228257636,228247404)
I feel the same -- Pete preferred this to be KeepHidden as the default. So there are places in code where I have to explicitly specify the drop. Lets bring Pete into the conversation. In reply to: 433115221 [](ancestors = 433115221) Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False) |
@@ -384,6 +384,27 @@ public static IDataView Create(IHostEnvironment env, ModelLoadContext ctx, IData | |||
return transform.Transform(input); | |||
} | |||
|
|||
public static IDataTransform CreateKeep(IHostEnvironment env, IDataView input, params string[] keepColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDataTransform CreateKeep [](start = 22, length = 25)
Do you have any place where it requires to be IDataTransform and not IDataView?
If not, I would suggest you to not use arguments class and just create SelectColumnsTransform and pass required parameters into constructor and then call .Transform on input.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the signature is this:
public SelectColumnsTransform(IHostEnvironment env, string[] keepColumns, string[] dropColumns, bool keepHidden=true, bool ignoreMissing=true)
where either the keep columns or drop columns can be set. So I used the CreateKeep and CreateDrop so that its really clear as to what the select is doing.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts I have on this:
- Have CreateKeep and CreateDrop return a SelectColumnsTransform. Then you can call Transform on it with the input ( SelectColumnsTransform.CreateKeep(...).Transform(input).
- Change the signature for SelectColumnsTransform to instead take an enumerated type of Keep or Drop (i.e. SelectColumnsTransform(env, Select.Keep, cols,...) or SelectColumnsTransform(env, Select.Drop, cols,...)
I like the #2 option as it would fall more naturally with how we tend to work with Transforms and would fall inline with what you are mentioning.
Thoughts? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Ivan, I misunderstood - so the feedback is to not use Arguments inside the function and return an IDataView instead of IDataTransform. Got it, I will make those changes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to remove the usage of Arguments however I was not able to move to an IDataView. There are other transforms that will use this transform and expect to return an IDataTransform.
In reply to: 229034034 [](ancestors = 229034034)
FYI - the build failed on the mac debug build but passed on all other platforms which leads me to believe this is one of the intermittent test failures that we see on the mac. I am following up with @TomFinley to confirm since he is investigating this issue. #Resolved |
@singlis can you update the branch so that we get a fresh build? #Resolved |
yes - will do. #Resolved |
var view = new ChooseColumnsTransform(host, args, lookup); | ||
args.Column = copyColumns.ToArray(); | ||
|
||
var view = CopyColumnsTransform.Create(host, args, lookup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyColumnsTransform [](start = 23, length = 20)
can we use a constructor instead of args-based Create
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember a specific reason. Probably, let's do the least destructive conversion, so preserve whatever In reply to: 433130838 [](ancestors = 433130838,433115221) Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False) |
default behavior of ChooseColumns. - Addressed other feedback such as removing use of Argument classes.
@@ -179,14 +179,14 @@ public sealed class Arguments : TransformInputBase | |||
public string[] DropColumns; | |||
|
|||
[Argument(ArgumentType.AtMostOnce, HelpText = "Specifies whether to keep or remove hidden columns.", ShortName = "hidden", SortOrder = 3)] | |||
public bool KeepHidden = true; | |||
public bool KeepHidden = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false [](start = 37, length = 5)
nit: We usually have Defaults class with const properties, which ease a bit spreading of parameters across code. #Resolved
using Xunit; | ||
using Xunit.Abstractions; | ||
|
||
namespace Microsoft.ML.Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Tests [](start = 10, length = 18)
nit: it's Microsoft.ML.Tests.Transformers in other tests. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Select with drop respects keepHidden, defaults to false. - When loading in a KeepColumns transform, defaults keepHidden to true for back compat. - Added additional tests
Replaces ChooseColumnsTransform and DropColumnsTransform with SelectColumnsTransform.
These changes include:
columns. For example, if the input is ABC and CB is selected, the output
will be CB.
SelectColumns.
This fixes #1342
These changes are also related to #754