Skip to content

Commit 9d29111

Browse files
authored
Tree-based featurization (#3812)
* Implement transformer * Initial draft of porting tree-based featurization * Internalize something * Add Tweedie and Ranking cases * Some small docs * Customize output column names * Fix save and load * Optional output columns * Fix a test and add some XML docs * Add samples * Add a sample * API docs * Fix one line * Add MC test * Extend a test further * Address some comments * Address some comments * Address comments * Comment * Add cache points * Update test/Microsoft.ML.Tests/TrainerEstimators/TreeEnsembleFeaturizerTest.cs Co-Authored-By: Justin Ormont <[email protected]> * Address comment * Add Justin's test * Reduce sample size * Update sample output
1 parent 9cd0b8e commit 9d29111

26 files changed

+3388
-83
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### Input and Output Columns
2+
The input label column data must be <xref:System.Boolean>.
3+
The input features column data must be a known-sized vector of <xref:System.Single>.
4+
5+
This estimator outputs the following columns:
6+
7+
| Output Column Name | Column Type | Description|
8+
| -- | -- | -- |
9+
| `Trees` | Known-sized vector of <xref:System.Single> | The output values of all trees. Its size is identical to the total number of trees in the tree ensemble model. |
10+
| `Leaves` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the IDs of all leaves where the input feature vector falls into. Its size is the number of total leaves in the tree ensemble model. |
11+
| `Paths` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the paths the input feature vector passed through to reach the leaves. Its size is the number of non-leaf nodes in the tree ensemble model. |
12+
13+
Those output columns are all optional and user can change their names.
14+
Please set the names of skipped columns to null so that they would not be produced.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
### Input and Output Columns
2+
The input label data type must be [key](xref:Microsoft.ML.Data.KeyDataViewType)
3+
type or <xref:System.Single>. The value of the label determines relevance, where
4+
higher values indicate higher relevance. If the label is a
5+
[key](xref:Microsoft.ML.Data.KeyDataViewType) type, then the key index is the
6+
relevance value, where the smallest index is the least relevant. If the label is a
7+
<xref:System.Single>, larger values indicate higher relevance. The feature
8+
column must be a known-sized vector of <xref:System.Single> and input row group
9+
column must be [key](xref:Microsoft.ML.Data.KeyDataViewType) type.
10+
11+
This estimator outputs the following columns:
12+
13+
| Output Column Name | Column Type | Description|
14+
| -- | -- | -- |
15+
| `Trees` | Known-sized vector of <xref:System.Single> | The output values of all trees. Its size is identical to the total number of trees in the tree ensemble model. |
16+
| `Leaves` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the IDs of all leaves where the input feature vector falls into. Its size is the number of total leaves in the tree ensemble model. |
17+
| `Paths` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the paths the input feature vector passed through to reach the leaves. Its size is the number of non-leaf nodes in the tree ensemble model. |
18+
19+
Those output columns are all optional and user can change their names.
20+
Please set the names of skipped columns to null so that they would not be produced.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### Input and Output Columns
2+
The input label column data must be <xref:System.Single>.
3+
The input features column data must be a known-sized vector of <xref:System.Single>.
4+
5+
This estimator outputs the following columns:
6+
7+
| Output Column Name | Column Type | Description|
8+
| -- | -- | -- |
9+
| `Trees` | Known-sized vector of <xref:System.Single> | The output values of all trees. Its size is identical to the total number of trees in the tree ensemble model. |
10+
| `Leaves` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the IDs of all leaves where the input feature vector falls into. Its size is the number of total leaves in the tree ensemble model. |
11+
| `Paths` | Known-sized vector of <xref:System.Single> | 0-1 vector representation to the paths the input feature vector passed through to reach the leaves. Its size is the number of non-leaf nodes in the tree ensemble model. |
12+
13+
Those output columns are all optional and user can change their names.
14+
Please set the names of skipped columns to null so that they would not be produced.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
### Prediction Details
2+
This estimator produces several output columns from a tree ensemble model. Assume that the model contains only one decision tree:
3+
4+
Node 0
5+
/ \
6+
/ \
7+
/ \
8+
/ \
9+
Node 1 Node 2
10+
/ \ / \
11+
/ \ / \
12+
/ \ Leaf -3 Node 3
13+
Leaf -1 Leaf -2 / \
14+
/ \
15+
Leaf -4 Leaf -5
16+
17+
Assume that the input feature vector falls into `Leaf -1`. The output `Trees` may be a 1-element vector where
18+
the only value is the decision value carried by `Leaf -1`. The output `Leaves` is a 0-1 vector. If the reached
19+
leaf is the $i$-th (indexed by $-(i+1)$ so the first leaf is `Leaf -1`) leaf in the tree, the $i$-th value in `Leaves`
20+
would be 1 and all other values would be 0. The output `Paths` is a 0-1 representation of the nodes passed
21+
through before reaching the leaf. The $i$-th element in `Paths` indicates if the $i$-th node (indexed by $i$) is touched.
22+
For example, reaching `Leaf -1` lead to $[1, 1, 0, 0]$ as the `Paths`. If there are multiple trees, this estimator
23+
just concatenates `Trees`'s, `Leaves`'s, `Paths`'s from all trees (first tree's information comes first in the concatenated vectors).
24+
25+
Check the See Also section for links to usage examples.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
using System;
2+
using System.Collections.Generic;
3+
using System.Linq;
4+
using Microsoft.ML;
5+
using Microsoft.ML.Data;
6+
<# if (TrainerOptions != null) { #>
7+
<#=OptionsInclude#>
8+
<# } #>
9+
10+
namespace Samples.Dynamic.Transforms.TreeFeaturization
11+
{
12+
public static class <#=ClassName#>
13+
{<#=Comments#>
14+
public static void Example()
15+
{
16+
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
17+
// as a catalog of available operations and as the source of randomness.
18+
// Setting the seed to a fixed number in this example to make outputs deterministic.
19+
var mlContext = new MLContext(seed: 0);
20+
21+
// Create a list of data points to be transformed.
22+
var dataPoints = GenerateRandomDataPoints(100).ToList();
23+
24+
// Convert the list of data points to an IDataView object, which is consumable by ML.NET API.
25+
var dataView = mlContext.Data.LoadFromEnumerable(dataPoints);
26+
<# if (CacheData) { #>
27+
28+
// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times,
29+
// it can be slow due to expensive featurization and disk operations. When the considered data can fit into memory,
30+
// a solution is to cache the data in memory. Caching is especially helpful when working with iterative algorithms
31+
// which needs many data passes.
32+
dataView = mlContext.Data.Cache(dataView);
33+
<# } #>
34+
35+
// Define input and output columns of tree-based featurizer.
36+
string labelColumnName = nameof(DataPoint.Label);
37+
string featureColumnName = nameof(DataPoint.Features);
38+
string treesColumnName = nameof(TransformedDataPoint.Trees);
39+
string leavesColumnName = nameof(TransformedDataPoint.Leaves);
40+
string pathsColumnName = nameof(TransformedDataPoint.Paths);
41+
42+
// Define the configuration of the trainer used to train a tree-based model.
43+
var trainerOptions = new <#=TrainerOptions#>;
44+
45+
// Define the tree-based featurizer's configuration.
46+
var options = new <#=Options#>;
47+
48+
// Define the featurizer.
49+
var pipeline = mlContext.Transforms.<#=Trainer#>(options);
50+
51+
// Train the model.
52+
var model = pipeline.Fit(dataView);
53+
54+
// Apply the trained transformer to the considered data set.
55+
var transformed = model.Transform(dataView);
56+
57+
// Convert IDataView object to a list. Each element in the resulted list corresponds to a row in the IDataView.
58+
var transformedDataPoints = mlContext.Data.CreateEnumerable<TransformedDataPoint>(transformed, false).ToList();
59+
60+
// Print out the transformation of the first 3 data points.
61+
for (int i = 0; i < 3; ++i)
62+
{
63+
var dataPoint = dataPoints[i];
64+
var transformedDataPoint = transformedDataPoints[i];
65+
Console.WriteLine($"The original feature vector [{String.Join(",", dataPoint.Features)}] is transformed to three different tree-based feature vectors:");
66+
Console.WriteLine($" Trees' output values: [{String.Join(",", transformedDataPoint.Trees)}].");
67+
Console.WriteLine($" Leave IDs' 0-1 representation: [{String.Join(",", transformedDataPoint.Leaves)}].");
68+
Console.WriteLine($" Paths IDs' 0-1 representation: [{String.Join(",", transformedDataPoint.Paths)}].");
69+
}
70+
71+
<#=ExpectedOutput#>
72+
}
73+
74+
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0)
75+
{
76+
var random = new Random(seed);
77+
float randomFloat() => (float)random.NextDouble();
78+
for (int i = 0; i < count; i++)
79+
{
80+
var label = randomFloat() > <#=LabelThreshold#>;
81+
yield return new DataPoint
82+
{
83+
Label = label,
84+
// Create random features that are correlated with the label.
85+
// For data points with false label, the feature values are slightly increased by adding a constant.
86+
Features = Enumerable.Repeat(label, 3).Select(x => x ? randomFloat() : randomFloat() + <#=DataSepValue#>).ToArray()
87+
};
88+
}
89+
}
90+
91+
// Example with label and 3 feature values. A data set is a collection of such examples.
92+
private class DataPoint
93+
{
94+
public bool Label { get; set; }
95+
[VectorType(3)]
96+
public float[] Features { get; set; }
97+
}
98+
99+
// Class used to capture the output of tree-base featurization.
100+
private class TransformedDataPoint : DataPoint
101+
{
102+
// The i-th value is the output value of the i-th decision tree.
103+
public float[] Trees { get; set; }
104+
// The 0-1 encoding of leaves the input feature vector falls into.
105+
public float[] Leaves { get; set; }
106+
// The 0-1 encoding of paths the input feature vector reaches the leaves.
107+
public float[] Paths { get; set; }
108+
}
109+
}
110+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
using System;
2+
using System.Collections.Generic;
3+
using System.Linq;
4+
using Microsoft.ML;
5+
using Microsoft.ML.Data;
6+
using Microsoft.ML.Trainers.FastTree;
7+
8+
namespace Samples.Dynamic.Transforms.TreeFeaturization
9+
{
10+
public static class FastForestBinaryFeaturizationWithOptions
11+
{
12+
// This example requires installation of additional NuGet package
13+
// <a href="https://www.nuget.org/packages/Microsoft.ML.FastTree/">Microsoft.ML.FastTree</a>.
14+
public static void Example()
15+
{
16+
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
17+
// as a catalog of available operations and as the source of randomness.
18+
// Setting the seed to a fixed number in this example to make outputs deterministic.
19+
var mlContext = new MLContext(seed: 0);
20+
21+
// Create a list of data points to be transformed.
22+
var dataPoints = GenerateRandomDataPoints(100).ToList();
23+
24+
// Convert the list of data points to an IDataView object, which is consumable by ML.NET API.
25+
var dataView = mlContext.Data.LoadFromEnumerable(dataPoints);
26+
27+
// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times,
28+
// it can be slow due to expensive featurization and disk operations. When the considered data can fit into memory,
29+
// a solution is to cache the data in memory. Caching is especially helpful when working with iterative algorithms
30+
// which needs many data passes.
31+
dataView = mlContext.Data.Cache(dataView);
32+
33+
// Define input and output columns of tree-based featurizer.
34+
string labelColumnName = nameof(DataPoint.Label);
35+
string featureColumnName = nameof(DataPoint.Features);
36+
string treesColumnName = nameof(TransformedDataPoint.Trees);
37+
string leavesColumnName = nameof(TransformedDataPoint.Leaves);
38+
string pathsColumnName = nameof(TransformedDataPoint.Paths);
39+
40+
// Define the configuration of the trainer used to train a tree-based model.
41+
var trainerOptions = new FastForestBinaryTrainer.Options
42+
{
43+
// Create a simpler model by penalizing usage of new features.
44+
FeatureFirstUsePenalty = 0.1,
45+
// Reduce the number of trees to 3.
46+
NumberOfTrees = 3,
47+
// Number of leaves per tree.
48+
NumberOfLeaves = 6,
49+
// Feature column name.
50+
FeatureColumnName = featureColumnName,
51+
// Label column name.
52+
LabelColumnName = labelColumnName
53+
};
54+
55+
// Define the tree-based featurizer's configuration.
56+
var options = new FastForestBinaryFeaturizationEstimator.Options
57+
{
58+
InputColumnName = featureColumnName,
59+
TreesColumnName = treesColumnName,
60+
LeavesColumnName = leavesColumnName,
61+
PathsColumnName = pathsColumnName,
62+
TrainerOptions = trainerOptions
63+
};
64+
65+
// Define the featurizer.
66+
var pipeline = mlContext.Transforms.FeaturizeByFastForestBinary(options);
67+
68+
// Train the model.
69+
var model = pipeline.Fit(dataView);
70+
71+
// Apply the trained transformer to the considered data set.
72+
var transformed = model.Transform(dataView);
73+
74+
// Convert IDataView object to a list. Each element in the resulted list corresponds to a row in the IDataView.
75+
var transformedDataPoints = mlContext.Data.CreateEnumerable<TransformedDataPoint>(transformed, false).ToList();
76+
77+
// Print out the transformation of the first 3 data points.
78+
for (int i = 0; i < 3; ++i)
79+
{
80+
var dataPoint = dataPoints[i];
81+
var transformedDataPoint = transformedDataPoints[i];
82+
Console.WriteLine($"The original feature vector [{String.Join(",", dataPoint.Features)}] is transformed to three different tree-based feature vectors:");
83+
Console.WriteLine($" Trees' output values: [{String.Join(",", transformedDataPoint.Trees)}].");
84+
Console.WriteLine($" Leave IDs' 0-1 representation: [{String.Join(",", transformedDataPoint.Leaves)}].");
85+
Console.WriteLine($" Paths IDs' 0-1 representation: [{String.Join(",", transformedDataPoint.Paths)}].");
86+
}
87+
88+
// Expected output:
89+
// The original feature vector [0.8173254,0.7680227,0.5581612] is transformed to three different tree-based feature vectors:
90+
// Trees' output values: [0.1111111,0.8823529].
91+
// Leave IDs' 0-1 representation: [0,0,0,0,1,0,0,0,0,1,0].
92+
// Paths IDs' 0-1 representation: [1,1,1,1,1,1,0,1,0].
93+
// The original feature vector [0.5888848,0.9360271,0.4721779] is transformed to three different tree-based feature vectors:
94+
// Trees' output values: [0.4545455,0.8].
95+
// Leave IDs' 0-1 representation: [0,0,0,1,0,0,0,0,0,0,1].
96+
// Paths IDs' 0-1 representation: [1,1,1,1,0,1,0,1,1].
97+
// The original feature vector [0.2737045,0.2919063,0.4673147] is transformed to three different tree-based feature vectors:
98+
// Trees' output values: [0.4545455,0.1111111].
99+
// Leave IDs' 0-1 representation: [0,0,0,1,0,0,1,0,0,0,0].
100+
// Paths IDs' 0-1 representation: [1,1,1,1,0,1,0,1,1].
101+
}
102+
103+
private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed=0)
104+
{
105+
var random = new Random(seed);
106+
float randomFloat() => (float)random.NextDouble();
107+
for (int i = 0; i < count; i++)
108+
{
109+
var label = randomFloat() > 0.5f;
110+
yield return new DataPoint
111+
{
112+
Label = label,
113+
// Create random features that are correlated with the label.
114+
// For data points with false label, the feature values are slightly increased by adding a constant.
115+
Features = Enumerable.Repeat(label, 3).Select(x => x ? randomFloat() : randomFloat() + 0.03f).ToArray()
116+
};
117+
}
118+
}
119+
120+
// Example with label and 3 feature values. A data set is a collection of such examples.
121+
private class DataPoint
122+
{
123+
public bool Label { get; set; }
124+
[VectorType(3)]
125+
public float[] Features { get; set; }
126+
}
127+
128+
// Class used to capture the output of tree-base featurization.
129+
private class TransformedDataPoint : DataPoint
130+
{
131+
// The i-th value is the output value of the i-th decision tree.
132+
public float[] Trees { get; set; }
133+
// The 0-1 encoding of leaves the input feature vector falls into.
134+
public float[] Leaves { get; set; }
135+
// The 0-1 encoding of paths the input feature vector reaches the leaves.
136+
public float[] Paths { get; set; }
137+
}
138+
}
139+
}

0 commit comments

Comments
 (0)