-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Making TensorFlowTransform trainable. #1063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
private class TensorValueGetter<T> : ITensorValueGetter | ||
{ | ||
private readonly ValueGetter<T> _srcgetter; | ||
private readonly List<T> _bufferedData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List [](start = 29, length = 7)
may be think about making it an array to make it faster. #Resolved
[Fact] | ||
public void TensorFlowTransformMNISTTemplateTrainingTest() | ||
{ | ||
var model_location = @"E:\Tensorflow\LogisticRegression"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@"E:\Tensorflow\LogisticRegression" [](start = 33, length = 35)
Need to fix this harded coded path... #Resolved
trans = new ConcatTransform(env, "Features", "Prediction").Transform(trans); | ||
trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans); | ||
|
||
var trainer = new LightGbmMulticlassTrainer(env, "Label", "Features"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGbmMulticlassTrainer [](start = 34, length = 25)
Note: a trainer on top is actually not a limitation. In Estimator/Transformer world,prediction ipeline will work without adding a trainer at the end. This is just because it being used in direct instantiation case. #Resolved
Is enough of this state exposed so that we can write serialization on top, or are things internal / not exposed to C-API? At least the Python and JS APIs are able to save so there must be a way, but I can imagine that the exports needed aren't exposed by the TF C-API (or the underlying structure of the objects isn't exposed in the headers). #Closed |
I found out a standard way of doing this through session. I will update the PR with my changes most probably today. In reply to: 425152137 [](ancestors = 425152137) |
For that I also included some other features from TF#. Your feedback will be valuable on that. In reply to: 425159496 [](ancestors = 425159496,425152137) |
trans = CategoricalTransform.Create(env, trans, "Label"); | ||
trans = NormalizeTransform.CreateMinMaxNormalizer(env, trans, "Features", "Placeholder"); | ||
|
||
var args = new TensorFlowTransform.TrainingArguments() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainingArguments [](start = 51, length = 17)
So in this scenario we are re-training the entire graph. this is good place to start the PoC.
Two other patterns for transfer-learning we need to look into :
(a) replace last layer (or last few layers) with a layer that matches shape of "Label" for the new dataset that we want to re-train on.
i.e. re-train the "lr_model" above but instead of the final layer what current has shape [-1,10] I want to re-train on iris dataset (say) which has shape [-1, 5]
(b) support 'freeze_up_to_node=<node_name>' functionality. i.e. freeze weights for nodes that <node_name> depends on, and only fine-tune weights of nodes that depend on <layer_name>
#ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhishek for pointing out these scenarios. Actually, these are mostly research related scenario and used for experimentation purposes. These can be added on top of what has been implemented so far if required.
BTW, it is no longer a PoC...:)
I would suggest reviewing it as real PR that will go in master.
In reply to: 221026509 [](ancestors = 221026509)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we track those scenarios above as separate issues ?
To me they seem more relevant -- esp (a). Otherwise we would have to tell customers that we can only do transfer learning if our label dimension is same as label dimension used in the pre-trained model.
(maybe there might be a simple fix of this -- just want to make sure we keep track of assumptions made in this PR)
In reply to: 221084395 [](ancestors = 221084395,221026509)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the label dimension means changing cost function, prediction function, label input dimension and whatnot. Its hard to identify operations in the graph that are using label and adjust them to new dimensions/settings.
If you have a solution then definitly it would be worth trying...:)
In reply to: 221099796 [](ancestors = 221099796,221084395,221026509)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I can take a shot at this as a separate issue. We need to document the existing behaviour though i.e. we only support transfer learning on models which have same feature/label dimensions as user data
In reply to: 221102056 [](ancestors = 221102056,221099796,221084395,221026509)
for (int j = 1; j < tfInputShapes[i].NumDimensions; j++) | ||
newShape[j] = tfInputShapes[i][j]; | ||
tfInputShapes[i] = new TFShape(newShape); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to bit more shape checking here due to missing dimensions... #Resolved
Yup, did it the same way. In reply to: 425178632 [](ancestors = 425178632,425171940,425159496,425152137) |
/// }) | ||
/// </summary> | ||
[Fact] | ||
public void TensorFlowTransformMNISTTemplateTrainingTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorFlowTransformMNISTTemplateTrainingTest [](start = 20, length = 44)
Write more tests... #Resolved
// Save the model on disk | ||
var path = Path.Combine(modelDir, "mlnet_model"); | ||
Session.GetRunner().AddInput("save/Const", TFTensor.CreateString(Encoding.UTF8.GetBytes(path))) | ||
.AddTarget("save/control_dependency").Run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few questions:
-
is this what's serializing the Session to disk ?
-
it will save only the parameter weights, not the saved_model.pb graph itself .. am i right ?
-
Is this a standard technique -- of using nodes with names "save/Const" and "save/control_dependency" to overwrite model weights #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to all three question above.
TensorFlow saves/restores models through these operations on graph.
In reply to: 221098961 [](ancestors = 221098961)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for #3, does TF document this somewhere in code / other language bindings etc.
Just using a combination of strings "save/Const" and and "save/control_dependency" seems so arbitrary :) ..
wonder if this is the same technique used when checkpointing models ?
In reply to: 221102898 [](ancestors = 221102898,221098961)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is how TensorFlow saves/restores models to/from disks. Both for checkpointing and simple_save. There is a reason to inject those operation in the graph because TF prefers all operation through graph instead of doing it through python or other language code. I will find a proper documentation for that.
I have made a parameter in TrainingArgument
class so that in case user need to change it they can change. However, it is highly unlikely to get changed.
In reply to: 221104033 [](ancestors = 221104033,221102898,221098961)
/// | ||
/// # Set model weights | ||
/// W = tf.Variable(tf.truncated_normal([784, 10]), name = "W") | ||
/// b = tf.Variable(tf.constant(0.1, shape=[10], dtype=tf.float32), name = "b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant [](start = 31, length = 8)
can we verify that after the fine-tuning we are actually ending up with different values for the parameters ?
e.g. verifying the bias term we get after fine-tuning the model is different from the initialization value (0.1) #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -38,7 +39,7 @@ namespace Microsoft.ML.Transforms | |||
/// <include file='doc.xml' path='doc/members/member[@name="TensorflowTransform"]/*' /> | |||
public sealed class TensorFlowTransform : ITransformer, ICanSaveModel | |||
{ | |||
public sealed class Arguments : TransformInputBase | |||
public class Arguments : TransformInputBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class Arguments [](start = 8, length = 22)
Can you make it abstract? Also I would suggest to call it ArgumentsBase rather than Arguments, because it would be confusing to have Arguments class which is actually an class with params and this class. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can not be made abstract because Arguments
object is instantiated for scoring scenario and TrainingArguments
is instantiated for training scenario.
In reply to: 221764289 [](ancestors = 221764289)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it can't remain sealed if you don't use it as base class?
Also maybe rename it into ScoringArguments?
In reply to: 221768356 [](ancestors = 221768356,221764289)
{ | ||
float loss = 0; | ||
float metric = 0; | ||
var runner = Session.GetRunner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runner [](start = 16, length = 6)
I want to make sure I understand: the runner is tuning the weights of the graph in runner.Run()? and the weights are stored somewhere in Session? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the line 451
runner.AddTarget(args.OptimizationOperation);
adds the optimization operation in this runner that updates the weights. Session hold those weights.
In reply to: 222501506 [](ancestors = 222501506)
{ | ||
if (isVector) | ||
return new TensorValueGetterVec<T>(input, colIndex, tfShape); | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else [](start = 12, length = 4)
nit: "else" not needed here. #Closed
@@ -268,7 +268,7 @@ internal class TFStatus : TFDisposable | |||
{ | |||
// extern TF_Status * TF_NewStatus (); | |||
[DllImport(NativeBinding.TensorFlowLibrary)] | |||
private static extern unsafe TF_Status TF_NewStatus(); | |||
internal static extern unsafe TF_Status TF_NewStatus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
Instead of exposing the C-API directly, you can use the C# API defined on top of it (the ctor below). Same for the other C imports below. #Resolved
@@ -2238,4 +2244,19 @@ public TFTensor AsTensor() | |||
return shape.AsTensor(); | |||
} | |||
} | |||
|
|||
internal class TFString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFString [](start = 19, length = 8)
Consider importing these APIs directly in the TFTensor class, this way they can be private. #Resolved
/// The name of the column used as label for training. | ||
/// </summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Training labels.", ShortName = "label", SortOrder = 4)] | ||
public string LabeLColumn = DefaultColumnNames.Label; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LabeLColumn [](start = 26, length = 11)
This must be identical to the name of the node in the graph corresponding to the label, correct?
If this is the case, then I don't think we should have the default value be "Label". #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it would make more sense to have two arguments: one for the name of the label column in the data view, and one for the name of the label node in the graph (unless it is possible to find it in the graph without user input?).
In reply to: 222725910 [](ancestors = 222725910)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same way the InputColumns
and OutputColumns
are treated. Do you think we should treat Label
differently? What I can do is make it required.
In reply to: 222726613 [](ancestors = 222726613,222725910)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should treat it differently, I think we should also make that change for input/output columns :-).
In reply to: 222754435 [](ancestors = 222754435,222726613,222725910)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -316,7 +695,14 @@ public void Save(ModelSaveContext ctx) | |||
{ | |||
ctx.SaveBinaryStream("TFSavedModel", w => | |||
{ | |||
string[] modelFilePaths = Directory.GetFiles(_savedModelPath, "*", SearchOption.AllDirectories); | |||
// only these files need to be saved. | |||
string[] modelFilePaths = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modelFilePaths [](start = 29, length = 14)
Is this a change in the serialization format, or just a more explicit way of writing it? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change in the format. Previously, our assumption was that folder will contain only these files so were writing all files in the folder. Now folder may contain archive files due retraining. So, I am explicitly writing required files.
In reply to: 222727569 [](ancestors = 222727569)
if (args.LossOperation != null) | ||
pch.SetHeader(new ProgressHeader("Average Loss"), (e) => e.SetProgress(1, loss / ((cursor.Position + 1) / args.BatchSize))); | ||
if (args.MetricOperation != null) | ||
pch.SetHeader(new ProgressHeader("Average Metric"), (e) => e.SetProgress(2, metric / ((cursor.Position + 1) / args.BatchSize))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A common interface for progress reporting.
/// It is expected that the progress channel interface is used from only one thread.
///
/// Supported workflow:
/// 1) Create the channel via .
/// 2) Call as many times as desired (including 0).
/// Each call to supersedes the previous one.
/// 3) Report checkpoints (0 or more) by calling .
/// 4) Repeat steps 2-3 as often as necessary.
/// 5) Dispose the channel.
see 2).
I think you need to do something like this:
pch.SetHeader(new ProgressHeader(metrics.ToArray() new[] { "Epochs" }, entry => entry.SetProgress(0, epoch, _args.Epochs);
and after each train
if (metric.Count>0)
pch.Checkpoint(double[] avgMetrics);
where metrics and avgMetrics depend on your arguments.
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Clear offset table | ||
IntPtr dst = TF_TensorData(handle); | ||
Marshal.WriteInt64(dst, 0); | ||
var status = new TFStatus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status [](start = 16, length = 6)
using (var status ...) #Resolved
@@ -290,7 +290,7 @@ public TFStatus() : base(TF_NewStatus()) | |||
|
|||
// extern void TF_DeleteStatus (TF_Status *); | |||
[DllImport(NativeBinding.TensorFlowLibrary)] | |||
private static extern unsafe void TF_DeleteStatus(TF_Status status); | |||
internal static extern unsafe void TF_DeleteStatus(TF_Status status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
Can this one and TF_GetCode go back to being private? #Closed
|
||
private (int, bool, TFDataType, TFShape) GetInputMetaData(ISchema inputSchema, string columnName, string tfNodeName, int batchSize) | ||
{ | ||
if (!inputSchema.TryGetColumnIndex(columnName, out int inputColIndices)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputColIndices [](start = 67, length = 15)
inputColIndex #Resolved
var index = inputsForTraining.Length - 1; | ||
inputsForTraining[index] = args.TensorFlowLabel; | ||
(inputColIndices[index], isInputVector[index], tfInputTypes[index], tfInputShapes[index]) = | ||
GetInputMetaData(inputSchema, args.LabelColumn, inputsForTraining[index], args.BatchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LabelColumn [](start = 55, length = 11)
Check that this is not null. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
return (inputColIndices, isInputVector, tfInputType, tfInputShape); | ||
} | ||
private void TrainCore(Arguments args, string model, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private [](start = 8, length = 7)
Add an empty line. #Resolved
var srcTensorGetters = GetTensorValueGetters(cursor, inputColIndices, isInputVector, tfInputTypes, tfInputShapes); | ||
|
||
float loss=0; | ||
float metric=0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctrl+k+d. #Resolved
}, new MultiFileSource(dataPath)); | ||
|
||
IDataView trans = CategoricalTransform.Create(env, loader, "OneHotLabel","Label"); | ||
trans = NormalizeTransform.CreateMinMaxNormalizer(env, trans, "Features", "Placeholder"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateMinMaxNormalizer [](start = 43, length = 22)
Why do we add a normalizer here? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the simple TensorFlow LR model which need normalization otherwise gradients blow out.
In reply to: 223087660 [](ancestors = 223087660)
|
||
var trainedTfDataView = TensorFlowTransform.Create(env, args, trans); | ||
trans = new ConcatTransform(env, "Features", "Prediction").Transform(trainedTfDataView); | ||
//trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans); [](start = 16, length = 85)
Delete. #Closed
This PR addressed #951 where user can retrain TensorFlow models using TensorFlowTransform in ML.Net.
The implemented training functionality is an add-on feature on top of current implementation of TensorFlowTransform where model scoring using TensorFlowTransform is untouched.
Current implementation includes
ToDo
Serializing retrained model into ML.Net stream. It seems like a big challenge right now because TensorFlow C-API does not support serialization ofTFSession
objects which contains all the retrained parameter.TrainingArguments
are passed (Will be done as separate PR Convert TensorFlowEstimator into Non-TrivialEstimator. #1110).