-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Samples for FeatureSelection transform estimators #3184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samples for FeatureSelection transform estimators #3184
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3184 +/- ##
==========================================
+ Coverage 72.54% 72.6% +0.06%
==========================================
Files 807 807
Lines 144774 145077 +303
Branches 16208 16213 +5
==========================================
+ Hits 105021 105332 +311
+ Misses 35339 35326 -13
- Partials 4414 4419 +5
|
@@ -24,7 +24,7 @@ public static class FeatureSelectionCatalog | |||
/// <example> | |||
/// <format type="text/markdown"> | |||
/// <] | |||
/// [!code-csharp[SelectFeaturesBasedOnMutualInformation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/SelectFeaturesBasedOnMutualInformation.cs)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic/S [](start = 118, length = 9)
Dynamic/Transforms/ #Resolved
/// <example> | ||
/// <format type="text/markdown"> | ||
/// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic [](start = 118, length = 7)
Dynamic/Transforms/ #Resolved
{ | ||
// Downloading a classification dataset from github.com/dotnet/machinelearning. | ||
// It will be stored in the same path as the executable | ||
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); [](start = 9, length = 81)
can it be done with a small in memory dataset? #Resolved
pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation( | ||
new InputOutputColumnPair[] { new InputOutputColumnPair("GroupB"), new InputOutputColumnPair("GroupC") }, | ||
labelColumnName: "Label", | ||
slotsInOutput:4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slotsInOutput:4 [](start = 16, length = 15)
one line comment about what this does. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 3 7 1 | ||
// 3 1 1 | ||
|
||
// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember [](start = 80, length = 8)
remove #Resolved
{ | ||
public float[] GroupB { get; set; } | ||
|
||
public float[] GroupC { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space #Resolved
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); | ||
|
||
// Data Preview | ||
// 1. Label 0=benign, 1=malignant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 26, length = 7)
He use tabs! Where is my pitchfork! #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default | ||
// values than the specified count. This transformation can be used to remove slots with too many missing values. | ||
var pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount( | ||
outputColumnName: "FeaturesSelectedGroupB", inputColumnName: "GroupB", count: 695); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
695 [](start = 94, length = 3)
Where this number coming from?
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment in line 29 should clarify this now, also the in-memory example should make it more intuitive
In reply to: 271911491 [](ancestors = 271911491)
// 5 7 | ||
// 1 2 | ||
// 1 3 | ||
// 3 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with current data it's not obvious at all. Can we switch to some small in memory sample rather than unknown dataset? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. moved over to use a small in-memory dataset instead
In reply to: 271912555 [](ancestors = 271912555)
// We will use the SelectFeaturesBasedOnCount transform estimator, to retain only those slots which have | ||
// at least 'count' non-default values per slot. | ||
|
||
// Multi column example : This pipeline uses two columns for transformation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pipeline uses two columns for transformation [](start = 38, length = 49)
i think this is clearer: this pipeline transform two columns using the same options.
just want to make sure it's clear that columns are transformed independently and are not mixed #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature | ||
// vector based on highest mutual information between that slot and a specified label. | ||
|
||
var pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SelectFeaturesBasedOnMutualInformation [](start = 65, length = 38)
this sample is only for API reference, so small in-memory dataset suffices for this example.
we should have a "tutorial" to show the computation of MI..something along the lines of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach (var item in convertedData) | ||
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector)); | ||
// 4 NaN 6 A WA Male | ||
// 4 5 6 A Female |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 63, length = 5)
align for just here.. no need to make it match exactly to the output and have it looked tabbing off. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is it a separate column?
It helps to print the headers.
In reply to: 272391190 [](ancestors = 272391190)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alignment here seems good to me... for text this slot is empty or null ..
not sure if u meant something else
In reply to: 272391190 [](ancestors = 272391190)
|
||
// We will use the SelectFeaturesBasedOnCount to retain only those slots which have at least 'count' non-default values per slot. | ||
|
||
// Usage on numeric column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space #Resolved
// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data. | ||
var transformedData = pipeline.Fit(data).Transform(data); | ||
|
||
Console.WriteLine("Contents of column 'NumericVector'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Console.WriteLine("Contents of column 'NumericVector'"); [](start = 11, length = 57)
convert to just comment. #Resolved
Console.Write($"{row[i]}\t"); | ||
Console.WriteLine(); | ||
} | ||
// 4 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 4 6 [](start = 12, length = 12)
headers #Resolved
for (var i = 0; i < row.Length; i++) | ||
Console.Write($"{row[i]}\t"); | ||
Console.WriteLine(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it polish it a bit if you made a little helper for this, since it is being used twice? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of then is a float[] while the other is string[] .. did not want to over-engineer this...
is there anything specific you had in mind, or can we keep it as is ?
In reply to: 272392068 [](ancestors = 272392068)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Console.WriteLine("Contents of two columns 'NumericVector' and 'StringVector'."); | ||
foreach (var item in convertedData) | ||
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\t [](start = 63, length = 2)
think for vectors we are using comma ',' as the separator. #Resolved
|
||
// Usage on text column. | ||
pipeline = mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount( | ||
outputColumnName: "StringVector", count: 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably append to the previous pipeline, and show the prints once. Compacts the sample. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Console.WriteLine("Contents of two columns 'NumericVector' and 'StringVector'."); | ||
foreach (var item in rawData) | ||
Console.WriteLine("{0}\t\t\t{1}", string.Join("\t", item.NumericVector), string.Join("\t", item.StringVector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\t [](start = 63, length = 2)
same comment, ',' #Resolved
|
||
Console.WriteLine("Contents of column 'NumericVector'"); | ||
PrintDataColumn(transformedData, "NumericVector"); | ||
// 4 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 [](start = 15, length = 1)
Curious, why is it dropping 6, but keeping 4? It is not obvious to me. Is it because slotsInOutput is 2? A comment about that might help. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Console.WriteLine("Contents of columns 'Label', 'NumericVectorA' and 'NumericVectorB'."); | ||
foreach (var item in rawData) | ||
Console.WriteLine("{0}\t\t{1}\t\t{2}", item.Label, string.Join(" ", item.NumericVectorA), string.Join(" ", item.NumericVectorB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 80, length = 1)
',' #Resolved
var rawData = GetData(); | ||
var data = mlContext.Data.LoadFromEnumerable(rawData); | ||
|
||
var convertedData = mlContext.Data.CreateEnumerable<InputData>(data, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convertedData [](start = 16, length = 13)
Why do you convert that back to Enumerable? rawData
is already there. #Resolved
foreach (var item in convertedData) | ||
Console.WriteLine("{0}\t\t{1}", string.Join(" ", item.NumericVectorA), string.Join(" ", item.NumericVectorB)); | ||
|
||
// Here, we see SelectFeaturesBasedOnMutualInformation selected 4 slots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 slot [](start = 76, length = 6)
the 4 slots that had most in common with the respective value in the Label column, maybe? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, these 4 slots carried the most MI with Label.... we should have a better tutorial for this though
In reply to: 272398839 [](ancestors = 272398839)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* samples for FeatureSelection transform estimators * fix review comments * fix review comments * review comments * take care of review comments * fix copy paste output error
Towards #1209
The PR makes the following changes
SelectFeaturesBasedOnCount
transform estimator.SelectFeaturesBasedOnMutualInformation
transform estimator.