Skip to content

Replaces ChooseColumnsTransform and DropColumnsTransform with SelectColumnsTransform #1371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 30, 2018

Conversation

singlis
Copy link
Member

@singlis singlis commented Oct 25, 2018

Replaces ChooseColumnsTransform and DropColumnsTransform with SelectColumnsTransform.

These changes include:

  • Updates to SelectColumnsTransform to respect ordering when keeping
    columns. For example, if the input is ABC and CB is selected, the output
    will be CB.
  • Updates to code that used Choose or Drop columns, replacing with
    SelectColumns.
  • Updates to baseline output for tests to pass
  • Re-enabled the SavePipeline tests

This fixes #1342
These changes are also related to #754

replacing them with SelectColumnsTransform. These changes include:
* Updates to SelectColumnsTransform to respect ordering when keeping
columns. For example, if the input is ABC and CB is selected, the output
will be CB.
* Updates to code that used Choose or Drop columns, replacing with
SelectColumns.
* Updates to baseline output for tests to pass
* Re-enabled the SavePipeline tests

This fixes dotnet#1342
These changes are also related to dotnet#754
@yaeldekel
Copy link

yaeldekel commented Oct 25, 2018

        public bool KeepHidden = true;

The default behavior of ChooseTransform was the opposite of this. We should consider whether we want to change it here as well.
Also, in all the components that use Choose/Drop columns (such as NAHandle, TermLookup, etc.) we should verify that the behavior stays the same, i.e. that no extra columns are dropped/kept.
#Resolved


Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False)

@yaeldekel
Copy link

yaeldekel commented Oct 25, 2018

        //env.CheckDecode(cbFloat == sizeof(Float));

Should this be uncommented? #Resolved


Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:235 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False)

// finding the associated index that should be used.
foreach(var columnName in selectedColumns)
{
if (columnDict.TryGetValue(columnName, out List<int> columnList))
Copy link

@yaeldekel yaeldekel Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryGetValue [](start = 39, length = 11)

I think this should fail if the schema doesn't contain a column by that name. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to in general or specifically in this case? In general there is an ignoreMissing parameter that when it is true will ignore any columns specified not being in the input. This is checked earlier before the Mapper is created.


In reply to: 228247404 [](ancestors = 228247404)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I didn't notice the IgnoreMissing argument. In that case, are we sure we want it to default to true?


In reply to: 228257636 [](ancestors = 228257636,228247404)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it defaults to true, then that means that if the user makes a mistake in the column name, they will not get an error message about it.


In reply to: 228264118 [](ancestors = 228264118,228257636,228247404)

Copy link
Member Author

@singlis singlis Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zruty0 - Yael makes a great point! What are your thoughts on defaulting ignoreMissing to false? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree


In reply to: 228266038 [](ancestors = 228266038)

Copy link
Member Author

@singlis singlis Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will default ignoreMissing to false. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreMissing now defaults to false.


In reply to: 228264779 [](ancestors = 228264779,228264118,228257636,228247404)

@singlis
Copy link
Member Author

singlis commented Oct 25, 2018

        public bool KeepHidden = true;

I feel the same -- Pete preferred this to be KeepHidden as the default. So there are places in code where I have to explicitly specify the drop. Lets bring Pete into the conversation.
@Zruty0 - Can you explain further on why KeepHidden should default to true instead of false?


In reply to: 433115221 [](ancestors = 433115221)


Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False)

@singlis
Copy link
Member Author

singlis commented Oct 25, 2018

        //env.CheckDecode(cbFloat == sizeof(Float));

Yes


In reply to: 433118204 [](ancestors = 433118204)


Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:235 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False)

@@ -384,6 +384,27 @@ public static IDataView Create(IHostEnvironment env, ModelLoadContext ctx, IData
return transform.Transform(input);
}

public static IDataTransform CreateKeep(IHostEnvironment env, IDataView input, params string[] keepColumns)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDataTransform CreateKeep [](start = 22, length = 25)

Do you have any place where it requires to be IDataTransform and not IDataView?
If not, I would suggest you to not use arguments class and just create SelectColumnsTransform and pass required parameters into constructor and then call .Transform on input.
#Resolved

Copy link
Member Author

@singlis singlis Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the signature is this:

public SelectColumnsTransform(IHostEnvironment env, string[] keepColumns, string[] dropColumns, bool keepHidden=true, bool ignoreMissing=true)

where either the keep columns or drop columns can be set. So I used the CreateKeep and CreateDrop so that its really clear as to what the select is doing.
#Resolved

Copy link
Member Author

@singlis singlis Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts I have on this:

  1. Have CreateKeep and CreateDrop return a SelectColumnsTransform. Then you can call Transform on it with the input ( SelectColumnsTransform.CreateKeep(...).Transform(input).
  2. Change the signature for SelectColumnsTransform to instead take an enumerated type of Keep or Drop (i.e. SelectColumnsTransform(env, Select.Keep, cols,...) or SelectColumnsTransform(env, Select.Drop, cols,...)

I like the #2 option as it would fall more naturally with how we tend to work with Transforms and would fall inline with what you are mentioning.

Thoughts? #Resolved

Copy link
Member Author

@singlis singlis Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Ivan, I misunderstood - so the feedback is to not use Arguments inside the function and return an IDataView instead of IDataTransform. Got it, I will make those changes. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to remove the usage of Arguments however I was not able to move to an IDataView. There are other transforms that will use this transform and expect to return an IDataTransform.


In reply to: 229034034 [](ancestors = 229034034)

@singlis
Copy link
Member Author

singlis commented Oct 25, 2018

FYI - the build failed on the mac debug build but passed on all other platforms which leads me to believe this is one of the intermittent test failures that we see on the mac. I am following up with @TomFinley to confirm since he is investigating this issue. #Resolved

@shauheen
Copy link
Contributor

shauheen commented Oct 26, 2018

@singlis can you update the branch so that we get a fresh build? #Resolved

@singlis
Copy link
Member Author

singlis commented Oct 26, 2018

yes - will do. #Resolved

var view = new ChooseColumnsTransform(host, args, lookup);
args.Column = copyColumns.ToArray();

var view = CopyColumnsTransform.Create(host, args, lookup);
Copy link
Contributor

@Zruty0 Zruty0 Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyColumnsTransform [](start = 23, length = 20)

can we use a constructor instead of args-based Create ? #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 29, 2018

        public bool KeepHidden = true;

I don't remember a specific reason. Probably, let's do the least destructive conversion, so preserve whatever ChooseColumns was doing by default


In reply to: 433130838 [](ancestors = 433130838,433115221)


Refers to: src/Microsoft.ML.Data/Transforms/SelectColumnsTransform.cs:182 in 99e8aa5. [](commit_id = 99e8aa5, deletion_comment = False)

 default behavior of ChooseColumns.
 - Addressed other feedback such as removing use of Argument classes.
@@ -179,14 +179,14 @@ public sealed class Arguments : TransformInputBase
public string[] DropColumns;

[Argument(ArgumentType.AtMostOnce, HelpText = "Specifies whether to keep or remove hidden columns.", ShortName = "hidden", SortOrder = 3)]
public bool KeepHidden = true;
public bool KeepHidden = false;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false [](start = 37, length = 5)

nit: We usually have Defaults class with const properties, which ease a bit spreading of parameters across code. #Resolved

using Xunit;
using Xunit.Abstractions;

namespace Microsoft.ML.Tests
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft.ML.Tests [](start = 10, length = 18)

nit: it's Microsoft.ML.Tests.Transformers in other tests. #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

 - Select with drop respects keepHidden, defaults to false.
 - When loading in a KeepColumns transform, defaults keepHidden to true
 for back compat.
 - Added additional tests
@singlis singlis merged commit c2b5e76 into dotnet:master Oct 30, 2018
@singlis singlis deleted the singlis/drop4 branch March 7, 2019 00:54
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change unit tests to use SelectColumnsTransform instead of ChooseColumnsTransform
5 participants