Skip to content

Commit 435a63b

Browse files
wschinTomFinley
authored andcommitted
Remove auto-cache mechanism (#1780)
* Remove auto-cache mechanism * Add caching usage into a sample and tests
1 parent d7d4e99 commit 435a63b

37 files changed

+885
-688
lines changed

docs/code/MlNetCookBook.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -443,10 +443,24 @@ var reader = mlContext.Data.TextReader(ctx => (
443443
// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
444444
var trainData = reader.Read(trainDataPath);
445445

446+
// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used
447+
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used.
448+
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because
449+
// a caching step, which provides the same caching function, will be inserted in the considered "learningPipeline."
450+
var cachedTrainData = trainData.Cache();
451+
446452
// Step two: define the learning pipeline.
447453
448454
// We 'start' the pipeline with the output of the reader.
449455
var learningPipeline = reader.MakeNewEstimator()
456+
// We add a step for caching data in memory so that the downstream iterative training
457+
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
458+
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
459+
// The data accessed in any downstream step will be cached since its first use. In general, you only
460+
// need to add a caching step before trainable step, because caching is not helpful if the data is
461+
// only scanned once. This step can be removed if user doesn't have enough memory to store the whole
462+
// data set.
463+
.AppendCacheCheckpoint()
450464
// Now we can add any 'training steps' to it. In our case we want to 'normalize' the data (rescale to be
451465
// between -1 and 1 for all examples)
452466
.Append(r => (
@@ -486,13 +500,28 @@ var reader = mlContext.Data.TextReader(new TextLoader.Arguments
486500
// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
487501
var trainData = reader.Read(trainDataPath);
488502

503+
// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used
504+
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used.
505+
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because
506+
// a caching step, which provides the same caching function, will be inserted in the considered "dynamicPipeline."
507+
var cachedTrainData = mlContext.Data.Cache(trainData);
508+
489509
// Step two: define the learning pipeline.
490510
491511
// We 'start' the pipeline with the output of the reader.
492512
var dynamicPipeline =
493513
// First 'normalize' the data (rescale to be
494514
// between -1 and 1 for all examples)
495515
mlContext.Transforms.Normalize("FeatureVector")
516+
// We add a step for caching data in memory so that the downstream iterative training
517+
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
518+
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
519+
// The data accessed in any downstream step will be cached since its first use. In general, you only
520+
// need to add a caching step before trainable step, because caching is not helpful if the data is
521+
// only scanned once. This step can be removed if user doesn't have enough memory to store the whole
522+
// data set. Notice that in the upstream Transforms.Normalize step, we only scan through the data
523+
// once so adding a caching step before it is not helpful.
524+
.AppendCacheCheckpoint(mlContext)
496525
// Add the SDCA regression trainer.
497526
.Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent(label: "Target", features: "FeatureVector"));
498527

@@ -595,6 +624,13 @@ var learningPipeline = reader.MakeNewEstimator()
595624
r.Label,
596625
// Concatenate all the features together into one column 'Features'.
597626
Features: r.SepalLength.ConcatWith(r.SepalWidth, r.PetalLength, r.PetalWidth)))
627+
// We add a step for caching data in memory so that the downstream iterative training
628+
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
629+
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
630+
// The data accessed in any downstream step will be cached since its first use. In general, you only
631+
// need to add a caching step before trainable step, because caching is not helpful if the data is
632+
// only scanned once.
633+
.AppendCacheCheckpoint()
598634
.Append(r => (
599635
r.Label,
600636
// Train the multi-class SDCA model to predict the label using features.
@@ -640,6 +676,8 @@ var dynamicPipeline =
640676
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
641677
// Note that the label is text, so it needs to be converted to key.
642678
.Append(mlContext.Transforms.Categorical.MapValueToKey("Label"), TransformerScope.TrainTest)
679+
// Cache data in moemory for steps after the cache check point stage.
680+
.AppendCacheCheckpoint(mlContext)
643681
// Use the multi-class SDCA model to predict the label using features.
644682
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent())
645683
// Apply the inverse conversion from 'PredictedLabel' column back to string value.
@@ -741,6 +779,7 @@ var trainData = mlContext.CreateStreamingDataView(churnData);
741779
742780
var dynamicLearningPipeline = mlContext.Transforms.Categorical.OneHotEncoding("DemographicCategory")
743781
.Append(mlContext.Transforms.Concatenate("Features", "DemographicCategory", "LastVisits"))
782+
.AppendCacheCheckpoint(mlContext) // FastTree will benefit from caching data in memory.
744783
.Append(mlContext.BinaryClassification.Trainers.FastTree("HasChurned", "Features", numTrees: 20));
745784

746785
var dynamicModel = dynamicLearningPipeline.Fit(trainData);
@@ -757,6 +796,7 @@ var staticLearningPipeline = staticData.MakeNewEstimator()
757796
.Append(r => (
758797
r.HasChurned,
759798
Features: r.DemographicCategory.OneHotEncoding().ConcatWith(r.LastVisits)))
799+
.AppendCacheCheckpoint() // FastTree will benefit from caching data in memory.
760800
.Append(r => mlContext.BinaryClassification.Trainers.FastTree(r.HasChurned, r.Features, numTrees: 20));
761801

762802
var staticModel = staticLearningPipeline.Fit(staticData);
@@ -813,6 +853,8 @@ var learningPipeline = reader.MakeNewEstimator()
813853
// When the normalizer is trained, the below delegate is going to be called.
814854
// We use it to memorize the scales.
815855
onFit: (scales, offsets) => normScales = scales)))
856+
// Cache data used in memory because the subsequently trainer needs to access the data multiple times.
857+
.AppendCacheCheckpoint()
816858
.Append(r => (
817859
r.Label,
818860
// Train the multi-class SDCA model to predict the label using features.
@@ -987,6 +1029,10 @@ var catColumns = data.GetColumn(r => r.CategoricalFeatures).Take(10).ToArray();
9871029

9881030
// Build several alternative featurization pipelines.
9891031
var learningPipeline = reader.MakeNewEstimator()
1032+
// Cache data in memory in an on-demand manner. Columns used in any downstream step will be
1033+
// cached in memory at their first uses. This step can be removed if user's machine doesn't
1034+
// have enough memory.
1035+
.AppendCacheCheckpoint()
9901036
.Append(r => (
9911037
r.Label,
9921038
r.NumericalFeatures,
@@ -1070,6 +1116,9 @@ var workclasses = transformedData.GetColumn<float[]>(mlContext, "WorkclassOneHot
10701116
var fullLearningPipeline = dynamicPipeline
10711117
// Concatenate two of the 3 categorical pipelines, and the numeric features.
10721118
.Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalBag", "WorkclassOneHotTrimmed"))
1119+
// Cache data in memory so that the following trainer will be able to access training examples without
1120+
// reading them from disk multiple times.
1121+
.AppendCacheCheckpoint(mlContext)
10731122
// Now we're ready to train. We chose our FastTree trainer for this classification task.
10741123
.Append(mlContext.BinaryClassification.Trainers.FastTree(numTrees: 50));
10751124

@@ -1121,6 +1170,10 @@ var messageTexts = data.GetColumn(x => x.Message).Take(20).ToArray();
11211170

11221171
// Apply various kinds of text operations supported by ML.NET.
11231172
var learningPipeline = reader.MakeNewEstimator()
1173+
// Cache data in memory in an on-demand manner. Columns used in any downstream step will be
1174+
// cached in memory at their first uses. This step can be removed if user's machine doesn't
1175+
// have enough memory.
1176+
.AppendCacheCheckpoint()
11241177
.Append(r => (
11251178
// One-stop shop to run the full text featurization.
11261179
TextFeatures: r.Message.FeaturizeText(),
@@ -1243,6 +1296,9 @@ var learningPipeline = reader.MakeNewEstimator()
12431296
Label: r.Label.ToKey(),
12441297
// Concatenate all the features together into one column 'Features'.
12451298
Features: r.SepalLength.ConcatWith(r.SepalWidth, r.PetalLength, r.PetalWidth)))
1299+
// Add a step for caching data in memory so that the downstream iterative training
1300+
// algorithm can efficiently scan through the data multiple times.
1301+
.AppendCacheCheckpoint()
12461302
.Append(r => (
12471303
r.Label,
12481304
// Train the multi-class SDCA model to predict the label using features.
@@ -1298,6 +1354,10 @@ var dynamicPipeline =
12981354
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
12991355
// Note that the label is text, so it needs to be converted to key.
13001356
.Append(mlContext.Transforms.Conversions.MapValueToKey("Label"), TransformerScope.TrainTest)
1357+
// Cache data in memory so that SDCA trainer will be able to randomly access training examples without
1358+
// reading data from disk multiple times. Data will be cached at its first use in any downstream step.
1359+
// Notice that unused part in the data may not be cached.
1360+
.AppendCacheCheckpoint(mlContext)
13011361
// Use the multi-class SDCA model to predict the label using features.
13021362
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent());
13031363

@@ -1439,6 +1499,7 @@ public static ITransformer TrainModel(MLContext mlContext, IDataView trainData)
14391499
Action<InputRow, OutputRow> mapping = (input, output) => output.Label = input.Income > 50000;
14401500
// Construct the learning pipeline.
14411501
var estimator = mlContext.Transforms.CustomMapping(mapping, null)
1502+
.AppendCacheCheckpoint(mlContext)
14421503
.Append(mlContext.BinaryClassification.Trainers.FastTree(label: "Label"));
14431504

14441505
return estimator.Fit(trainData);
@@ -1480,8 +1541,12 @@ public class CustomMappings
14801541
var estimator = mlContext.Transforms.CustomMapping<InputRow, OutputRow>(CustomMappings.IncomeMapping, nameof(CustomMappings.IncomeMapping))
14811542
.Append(mlContext.BinaryClassification.Trainers.FastTree(label: "Label"));
14821543

1544+
// If memory is enough, we can cache the data in-memory to avoid reading them from file
1545+
// when it will be accessed multiple times.
1546+
var cachedTrainData = mlContext.Data.Cache(trainData);
1547+
14831548
// Train the model.
1484-
var model = estimator.Fit(trainData);
1549+
var model = estimator.Fit(cachedTrainData);
14851550

14861551
// Save the model.
14871552
using (var fs = File.Create(modelPath))

docs/samples/Microsoft.ML.Samples/Dynamic/SDCA.cs

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,18 @@ public static void SDCA_BinaryClassification()
3838
// Read the data
3939
var data = reader.Read(dataFile);
4040

41+
// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
42+
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
43+
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
44+
// cache step in a pipeline is also possible, please see the construction of pipeline below.
45+
data = mlContext.Data.Cache(data);
46+
4147
// Step 2: Pipeline
4248
// Featurize the text column through the FeaturizeText API.
4349
// Then append a binary classifier, setting the "Label" column as the label of the dataset, and
44-
// the "Features" column produced by FeaturizeText as the features column.
50+
// the "Features" column produced by FeaturizeText as the features column.
4551
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
52+
.AppendCacheCheckpoint(mlContext) // Add a data-cache step within a pipeline.
4653
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(labelColumn: "Sentiment", featureColumn: "Features", l2Const: 0.001f));
4754

4855
// Step 3: Run Cross-Validation on this pipeline.

src/Microsoft.ML.Data/StaticPipe/DataView.cs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
using Microsoft.ML.StaticPipe.Runtime;
99
using System.Collections.Generic;
1010
using System;
11+
using System.Linq;
1112

1213
namespace Microsoft.ML.StaticPipe
1314
{
@@ -23,6 +24,19 @@ internal DataView(IHostEnvironment env, IDataView view, StaticSchemaShape shape)
2324
AsDynamic = view;
2425
Shape.Check(Env, AsDynamic.Schema);
2526
}
27+
28+
/// <summary>
29+
/// This function return a <see cref="DataView{TShape}"/> whose columns are all cached in memory.
30+
/// This returned <see cref="DataView{TShape}"/> is almost the same to the source <see cref="DataView{TShape}"/>.
31+
/// The only difference are cache-related properties.
32+
/// </summary>
33+
public DataView<TShape> Cache()
34+
{
35+
// Generate all column indexes in the source data.
36+
var prefetched = Enumerable.Range(0, AsDynamic.Schema.ColumnCount).ToArray();
37+
// Create a cached version of the source data by caching all columns.
38+
return new DataView<TShape>(Env, new CacheDataView(Env, AsDynamic, prefetched), Shape);
39+
}
2640
}
2741

2842
public static class DataViewExtensions

src/Microsoft.ML.Data/StaticPipe/Estimator.cs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,5 +77,14 @@ string NameMap(PipelineColumn col)
7777
return new Estimator<TInShape, TNewOutShape, ITransformer>(Env, est, _inShape, newOut);
7878
}
7979
}
80+
81+
/// <summary>
82+
/// Cache data produced in memory by this estimator. It may append an extra estimator to the this estimator
83+
/// for caching. The newly added estimator would be returned.
84+
/// </summary>
85+
public Estimator<TInShape, TOutShape, ITransformer> AppendCacheCheckpoint()
86+
{
87+
return new Estimator<TInShape, TOutShape, ITransformer>(Env, AsDynamic.AppendCacheCheckpoint(Env), _inShape, Shape);
88+
}
8089
}
8190
}

src/Microsoft.ML.Data/Training/TrainerEstimatorBase.cs

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -130,11 +130,8 @@ protected virtual void CheckLabelCompatible(SchemaShape.Column labelCol)
130130
protected TTransformer TrainTransformer(IDataView trainSet,
131131
IDataView validationSet = null, IPredictor initPredictor = null)
132132
{
133-
var cachedTrain = Info.WantCaching ? new CacheDataView(Host, trainSet, prefetch: null) : trainSet;
134-
var cachedValid = Info.WantCaching && validationSet != null ? new CacheDataView(Host, validationSet, prefetch: null) : validationSet;
135-
136-
var trainRoleMapped = MakeRoles(cachedTrain);
137-
var validRoleMapped = validationSet == null ? null : MakeRoles(cachedValid);
133+
var trainRoleMapped = MakeRoles(trainSet);
134+
var validRoleMapped = validationSet == null ? null : MakeRoles(validationSet);
138135

139136
var pred = TrainModelCore(new TrainContext(trainRoleMapped, validRoleMapped, null, initPredictor));
140137
return MakeTransformer(pred, trainSet.Schema);

test/BaselineOutput/Common/OVA/OVA-CV-iris-out.txt

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -21,35 +21,35 @@ Confusion table
2121
PREDICTED || 0 | 1 | 2 | Recall
2222
TRUTH ||========================
2323
0 || 21 | 0 | 0 | 1.0000
24-
1 || 0 | 22 | 8 | 0.7333
24+
1 || 0 | 20 | 10 | 0.6667
2525
2 || 0 | 0 | 28 | 1.0000
2626
||========================
27-
Precision ||1.0000 |1.0000 |0.7778 |
28-
Accuracy(micro-avg): 0.898734
29-
Accuracy(macro-avg): 0.911111
30-
Log-loss: 0.372620
31-
Log-loss reduction: 65.736556
27+
Precision ||1.0000 |1.0000 |0.7368 |
28+
Accuracy(micro-avg): 0.873418
29+
Accuracy(macro-avg): 0.888889
30+
Log-loss: 0.393949
31+
Log-loss reduction: 63.775293
3232

3333
Confusion table
3434
||========================
3535
PREDICTED || 0 | 1 | 2 | Recall
3636
TRUTH ||========================
3737
0 || 29 | 0 | 0 | 1.0000
38-
1 || 0 | 18 | 2 | 0.9000
38+
1 || 0 | 19 | 1 | 0.9500
3939
2 || 0 | 0 | 22 | 1.0000
4040
||========================
41-
Precision ||1.0000 |1.0000 |0.9167 |
42-
Accuracy(micro-avg): 0.971831
43-
Accuracy(macro-avg): 0.966667
44-
Log-loss: 0.357704
45-
Log-loss reduction: 67.051654
41+
Precision ||1.0000 |1.0000 |0.9565 |
42+
Accuracy(micro-avg): 0.985915
43+
Accuracy(macro-avg): 0.983333
44+
Log-loss: 0.299620
45+
Log-loss reduction: 72.401815
4646

4747
OVERALL RESULTS
4848
---------------------------------------
49-
Accuracy(micro-avg): 0.935283 (0.0365)
50-
Accuracy(macro-avg): 0.938889 (0.0278)
51-
Log-loss: 0.365162 (0.0075)
52-
Log-loss reduction: 66.394105 (0.6575)
49+
Accuracy(micro-avg): 0.929667 (0.0562)
50+
Accuracy(macro-avg): 0.936111 (0.0472)
51+
Log-loss: 0.346785 (0.0472)
52+
Log-loss reduction: 68.088554 (4.3133)
5353

5454
---------------------------------------
5555
Physical memory usage(MB): %Number%
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
OVA
22
Accuracy(micro-avg) Accuracy(macro-avg) Log-loss Log-loss reduction /p Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings
3-
0.935283 0.938889 0.365162 66.3941 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}
3+
0.929667 0.936111 0.346785 68.08855 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}
44

0 commit comments

Comments
 (0)