-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Towards #3204 -FeatureSelection #3424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,7 +19,54 @@ | |
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
/// <include file='doc.xml' path='doc/members/member[@name="MutualInformationFeatureSelection"]/*' /> | ||
/// <summary> | ||
/// Selects the top k slots across all specified columns ordered by their mutual information with the label column | ||
/// (what you can learn about the label by observing the value of the specified column). | ||
/// </summary> | ||
/// <remarks> | ||
/// <format type="text/markdown">< data types.| | ||
/// | Output column data type | Same as the input column.| | ||
/// | ||
/// Formally, the mutual information can be written as: | ||
/// | ||
/// MI(X,Y) = E[log(P(x,y)) - log(P(x)) - log(P(y))] | ||
/// | ||
/// where the expectation E is taken over the joint distribution of X and Y. | ||
/// Here P(x, y) is the joint probability density function of X and Y, P(x) and P(y) are the marginal probability density functions of X and Y respectively. | ||
/// In general, a higher mutual information between the dependent variable(or label) and an independent variable(or feature) means | ||
/// that the label has higher mutual dependence over that feature. | ||
/// It keeps the top slots in output features with the largest mutual information with the label. | ||
/// | ||
/// For example, for the following Features and Label column, if we specify that we want the top 2 slots(vector elements) that have the higher correlation | ||
/// with the label column, the output of applying this Estimator would keep the first and the third slots only, because their values | ||
/// are more correlated with the values in the Label column. | ||
/// | ||
/// | Label | Features | | ||
/// | -- | -- | | ||
/// |True |4,6,0 | | ||
/// |False|0,7,5 | | ||
/// |True |4,7,0 | | ||
/// |False|0,7,0 | | ||
/// | ||
/// This is how the dataset above would look, after fitting the estimator, and transforming the data with the resulting transformer: | ||
/// | ||
/// | Label | Features | | ||
/// | -- | -- | | ||
/// |True |4,0 | | ||
/// |False|0,5 | | ||
/// |True |4,0 | | ||
/// |False|0,5 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need these examples? we already have similar samples. #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is really hard to understand without them, i think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the samples. It helped me understand when I read it. #Resolved |
||
/// | ||
/// ]]></format> | ||
/// </remarks> | ||
/// <seealso cref="FeatureSelectionCatalog.SelectFeaturesBasedOnMutualInformation(TransformsCatalog.FeatureSelectionTransforms, InputOutputColumnPair[], string, int, int)"/> | ||
/// <seealso cref="FeatureSelectionCatalog.SelectFeaturesBasedOnMutualInformation(TransformsCatalog.FeatureSelectionTransforms, string, string, string, int, int)"/> | ||
public sealed class MutualInformationFeatureSelectingEstimator : IEstimator<ITransformer> | ||
{ | ||
internal const string Summary = | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create new issue for later: we need to specify for each input type what's the default value. for e.g. it's not clear if for text, default is null, empty string, or whitespaces. same goes with key type. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created: #3443
In reply to: 277027260 [](ancestors = 277027260)