Skip to content

Samples for FeatureSelection transform estimators #3184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 5, 2019

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Apr 3, 2019

Towards #1209

The PR makes the following changes

  • Adds sample for the SelectFeaturesBasedOnCount transform estimator.
  • Adds sample for the SelectFeaturesBasedOnMutualInformation transform estimator.
  • Delete old sample.

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #3184 into master will increase coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3184      +/-   ##
==========================================
+ Coverage   72.54%    72.6%   +0.06%     
==========================================
  Files         807      807              
  Lines      144774   145077     +303     
  Branches    16208    16213       +5     
==========================================
+ Hits       105021   105332     +311     
+ Misses      35339    35326      -13     
- Partials     4414     4419       +5
Flag Coverage Δ
#Debug 72.6% <ø> (+0.06%) ⬆️
#production 68.14% <ø> (+0.01%) ⬆️
#test 88.92% <ø> (+0.09%) ⬆️
Impacted Files Coverage Δ
...Microsoft.ML.Transforms/FeatureSelectionCatalog.cs 60% <ø> (ø) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0%> (-20.52%) ⬇️
src/Microsoft.ML.DataView/KeyDataViewType.cs 74.57% <0%> (-3.76%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
src/Microsoft.ML.Data/Transforms/ValueMapping.cs 84.26% <0%> (-0.14%) ⬇️
test/Microsoft.ML.Tests/ImagesTests.cs 98.69% <0%> (-0.13%) ⬇️
src/Microsoft.ML.Transforms/CategoricalCatalog.cs 68.42% <0%> (ø) ⬆️
...Microsoft.ML.Tests/Transformers/NormalizerTests.cs 100% <0%> (ø) ⬆️
...crosoft.ML.Tests/Transformers/ValueMappingTests.cs 100% <0%> (ø) ⬆️
... and 7 more

@@ -24,7 +24,7 @@ public static class FeatureSelectionCatalog
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[SelectFeaturesBasedOnMutualInformation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/FeatureSelectionTransform.cs?range=1-4,10-121)]
/// [!code-csharp[SelectFeaturesBasedOnMutualInformation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/SelectFeaturesBasedOnMutualInformation.cs)]
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic/S [](start = 118, length = 9)

Dynamic/Transforms/ #Resolved

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[SelectFeaturesBasedOnMutualInformation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/SelectFeaturesBasedOnMutualInformation.cs)]
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic [](start = 118, length = 7)

Dynamic/Transforms/ #Resolved

{
// Downloading a classification dataset from github.com/dotnet/machinelearning.
// It will be stored in the same path as the executable
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset();
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); [](start = 9, length = 81)

can it be done with a small in memory dataset? #Resolved

pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation(
new InputOutputColumnPair[] { new InputOutputColumnPair("GroupB"), new InputOutputColumnPair("GroupC") },
labelColumnName: "Label",
slotsInOutput:4);
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slotsInOutput:4 [](start = 16, length = 15)

one line comment about what this does. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment in line 28 should clarify this


In reply to: 271808888 [](ancestors = 271808888)

// 3 7 1
// 3 1 1

// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember [](start = 80, length = 8)

remove #Resolved

{
public float[] GroupB { get; set; }

public float[] GroupC { get; set; }
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space #Resolved

string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset();

// Data Preview
// 1. Label 0=benign, 1=malignant
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  					 [](start = 26, length = 7)

He use tabs! Where is my pitchfork! #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) lol


In reply to: 271907898 [](ancestors = 271907898)

// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default
// values than the specified count. This transformation can be used to remove slots with too many missing values.
var pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount(
outputColumnName: "FeaturesSelectedGroupB", inputColumnName: "GroupB", count: 695);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

695 [](start = 94, length = 3)

Where this number coming from?
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment in line 29 should clarify this now, also the in-memory example should make it more intuitive


In reply to: 271911491 [](ancestors = 271911491)

// 5 7
// 1 2
// 1 3
// 3 2
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with current data it's not obvious at all. Can we switch to some small in memory sample rather than unknown dataset? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. moved over to use a small in-memory dataset instead


In reply to: 271912555 [](ancestors = 271912555)

// We will use the SelectFeaturesBasedOnCount transform estimator, to retain only those slots which have
// at least 'count' non-default values per slot.

// Multi column example : This pipeline uses two columns for transformation
Copy link

@shmoradims shmoradims Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipeline uses two columns for transformation [](start = 38, length = 49)

i think this is clearer: this pipeline transform two columns using the same options.

just want to make sure it's clear that columns are transformed independently and are not mixed #Resolved

Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abgoswam abgoswam requested a review from rogancarr April 4, 2019 21:27
// We define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature
// vector based on highest mutual information between that slot and a specified label.

var pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation(
Copy link
Member Author

@abgoswam abgoswam Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectFeaturesBasedOnMutualInformation [](start = 65, length = 38)

this sample is only for API reference, so small in-memory dataset suffices for this example.

we should have a "tutorial" to show the computation of MI..something along the lines of

https://www.researchgate.net/post/How_can_i_calculate_Mutual_Information_theory_from_a_simple_dataset

Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @natke was looking into how this transform works.


In reply to: 272375725 [](ancestors = 272375725)

foreach (var item in convertedData)
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector));
// 4 NaN 6 A WA Male
// 4 5 6 A Female
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 [](start = 63, length = 5)

align for just here.. no need to make it match exactly to the output and have it looked tabbing off. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is it a separate column?

It helps to print the headers.


In reply to: 272391190 [](ancestors = 272391190)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alignment here seems good to me... for text this slot is empty or null ..

not sure if u meant something else


In reply to: 272391190 [](ancestors = 272391190)


// We will use the SelectFeaturesBasedOnCount to retain only those slots which have at least 'count' non-default values per slot.

// Usage on numeric column.
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space #Resolved

// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data.
var transformedData = pipeline.Fit(data).Transform(data);

Console.WriteLine("Contents of column 'NumericVector'");
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console.WriteLine("Contents of column 'NumericVector'"); [](start = 11, length = 57)

convert to just comment. #Resolved

Console.Write($"{row[i]}\t");
Console.WriteLine();
}
// 4 6
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// 4 6 [](start = 12, length = 12)

headers #Resolved

for (var i = 0; i < row.Length; i++)
Console.Write($"{row[i]}\t");
Console.WriteLine();
}
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it polish it a bit if you made a little helper for this, since it is being used twice? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of then is a float[] while the other is string[] .. did not want to over-engineer this...

is there anything specific you had in mind, or can we keep it as is ?


In reply to: 272392068 [](ancestors = 272392068)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object :)


In reply to: 272394600 [](ancestors = 272394600,272392068)


Console.WriteLine("Contents of two columns 'NumericVector' and 'StringVector'.");
foreach (var item in convertedData)
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector));
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\t [](start = 63, length = 2)

think for vectors we are using comma ',' as the separator. #Resolved


// Usage on text column.
pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount(
outputColumnName: "StringVector", count: 3);
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably append to the previous pipeline, and show the prints once. Compacts the sample. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. will do.


In reply to: 272396211 [](ancestors = 272396211)


Console.WriteLine("Contents of two columns 'NumericVector' and 'StringVector'.");
foreach (var item in rawData)
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector));
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\t [](start = 63, length = 2)

same comment, ',' #Resolved


Console.WriteLine("Contents of column 'NumericVector'");
PrintDataColumn(transformedData, "NumericVector");
// 4 0
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 [](start = 15, length = 1)

Curious, why is it dropping 6, but keeping 4? It is not obvious to me. Is it because slotsInOutput is 2? A comment about that might help. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to improve this once we have found a better sample. i have noted this in the issue #1209


In reply to: 272397744 [](ancestors = 272397744)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:


Console.WriteLine("Contents of columns 'Label', 'NumericVectorA' and 'NumericVectorB'.");
foreach (var item in rawData)
Console.WriteLine("{0}\t\t{1}\t\t{2}", item.Label, string.Join(" ", item.NumericVectorA), string.Join(" ", item.NumericVectorB));
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 80, length = 1)

',' #Resolved

var rawData = GetData();
var data = mlContext.Data.LoadFromEnumerable(rawData);

var convertedData = mlContext.Data.CreateEnumerable<InputData>(data, true);
Copy link
Contributor

@zeahmed zeahmed Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertedData [](start = 16, length = 13)

Why do you convert that back to Enumerable? rawData is already there. #Resolved

foreach (var item in convertedData)
Console.WriteLine("{0}\t\t{1}", string.Join(" ", item.NumericVectorA), string.Join(" ", item.NumericVectorB));

// Here, we see SelectFeaturesBasedOnMutualInformation selected 4 slots.
Copy link
Member

@sfilipi sfilipi Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 slot [](start = 76, length = 6)

the 4 slots that had most in common with the respective value in the Label column, maybe? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, these 4 slots carried the most MI with Label.... we should have a better tutorial for this though


In reply to: 272398839 [](ancestors = 272398839)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@abgoswam abgoswam merged commit 8130567 into dotnet:master Apr 5, 2019
abgoswam added a commit to abgoswam/machinelearning that referenced this pull request Apr 5, 2019
* samples for FeatureSelection transform estimators

* fix review comments

* fix review comments

* review comments

* take care of review comments

* fix copy paste output error
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants