-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable TensorFlowTransform to work with pre-trained models that are not frozen #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Should the model files go in the separate repo that we use for the other TensorFlow models? #Resolved |
Can you merge it with master? Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:1 in 57508d3. [](commit_id = 57508d3, deletion_comment = False) |
} | ||
|
||
public void Save(ModelSaveContext ctx) | ||
{ | ||
_host.AssertValue(ctx); | ||
ctx.CheckAtModel(); | ||
ctx.SetVersionInfo(GetVersionInfo()); | ||
ctx.Writer.Write(_isFrozen ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write [](start = 27, length = 5)
I believe you can write bool #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
public void Save(ModelSaveContext ctx) | ||
{ | ||
_host.AssertValue(ctx); | ||
ctx.CheckAtModel(); | ||
ctx.SetVersionInfo(GetVersionInfo()); | ||
ctx.Writer.Write(_isFrozen ? 1 : 0); | ||
if (_isFrozen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_isFrozen [](start = 20, length = 9)
Please remain saving order, so you can read old version of model files. #Resolved
@@ -235,6 +235,8 @@ private string GetBuildPrefix() | |||
} | |||
|
|||
[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")] | |||
// [Fact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// [Fact] [](start = 7, length = 10)
not necessary #Resolved
…machinelearning into abgoswam/tf_savedmodel
public string[] OutputColumns; | ||
} | ||
|
||
private readonly IHost _host; | ||
private const string RegistrationName = "TensorFlowTransform"; | ||
|
||
internal readonly TFSession Session; | ||
internal readonly bool IsFrozen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsFrozen [](start = 31, length = 8)
You don't need them. All you care about is session, and you already have it. This is just details of how session been loaded. #Resolved
…ML.TensorFlow.TestModels contains these models
return new TensorFlowTransform(env, modelBytes, io.Item1, io.Item2, (isFrozen == 1)); | ||
} | ||
else | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 11, length = 2)
[observation] The Train() method of the LearningPipeline API ends up calling Save() and Create() multiple times.
For un-frozen models we need a particular directory structure so we can create the TFSession correctly using the TFSession.FromSavedModel API (line 215)
In this call to Create() we do the following :
(a) load the binary stream
(b) write the binary stream out to a zip file
(c) extract the zip file to a temporary folder TempSavedModelDirName (_savedModel)
(d) create the session
(e) delete the zip file
We do not delete the temporary folder because in a subsequent call to Save() we are writing out the contents of the temporary folder as a byte stream.
So for this scenario, what we have currently is that we define a unique location for the temporary folder -- and it gets cleaned up only when required, inside the Create() call. Line 139.
Is there some way to ensure the temporary folder is always cleaned up ? Or perhaps some way to not have a temporary folder at all.
@ericstj @zeahmed @yaeldekel #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should avoid the temporary zip file. If you can get a seekable stream from ctx, just pass that into the ZipArchive constructor: https://msdn.microsoft.com/es-es/library/system.io.compression.ziparchive.ziparchive(v=vs.110).aspx. It looks like you can do this by accessing the BaseStream on the binary reader passed into the TryLoadBinaryStream action.
You should also avoid allocating an extra byte array for the zip and instead do all the reading within the callback from TryLoadBinaryStream, since it looks like TryLoadBinaryStream will close the stream:
using (ent) |
I think you still need to have a temporary directory for the model files themselves. It looks like this is a requirement of TensorFlow https://github.com/tensorflow/tensorflow/blob/3be04971716fcaf0c11ad9262e60efa428553e14/tensorflow/c/c_api.h#L1367. It requires the model to be in a directory and doesn't provide a mechanism for loading from some sort of stream/memory representation. So you'll still need a temp path that you'll have to clean up.
Ideally you should provide a way for this path to be specified, rather than always call GetTempPath behind the scenes. I think ML.NET has a pattern for this, if not, perhaps consider a constructor overload that allows for specification of a temp path root. You may also consider some way to say "don't clean up the temp path".
Typically with things like this you should treat the temporary folder as state managed by the object and then cleaned up in the dispose/finalizer.
In reply to: 217123233 [](ancestors = 217123233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both of the suggestions above, but they seem to have some limitations with respect to this transform. I have since modified the logic of load&restore to do things in-memory.
Am documenting the issues hit with the above suggestion:
(1) the ZipArchive approach looked promising to avoid the temporary zip file. However it works on the entire stream, rather than on a subset of the stream.
(2) currently the TFTransform does not dispose the TFSession object. So attempts to cleanup the temp folder fail with 'file still in use' errors. The approach of using unique location for the temporary folder is problematic in general (e.g. during CV or if pipeline contains multiple TFTransforms)
I have modified the logic so we do the ML.NET serialization/de-serialization using in-memory data structures.
In reply to: 217145506 [](ancestors = 217145506,217123233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should file an issue on #2.
I think this approach looks much better than zip-in-a-zip approach before. One change I'd suggest is that instead of reading to byte array and storing in memory, you create the file within the callback (var fs = new FileStream(...)) and use br.BaseStream.CopyTo(fs, fileLength); To do this you'll need to replace readbytearray/writebytearray with a your own calls that store file length, followed by the stream copy.
One thing I notice that could be problematic for the frozen case above (not related to this change) is that it seems to use Graph.Import to import a entire model as a managed byte array. I can imagine that this will potentially hit the upper limit of managed array sizes for large models. I see that TF's graph import functionality expects a single buffer (https://github.com/tensorflow/tensorflow/blob/3be04971716fcaf0c11ad9262e60efa428553e14/tensorflow/c/c_api.h#L1018-L1020).
In reply to: 217447210 [](ancestors = 217447210,217145506,217123233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for reading to byte array and storing in memory is to handle the scenario where we saw the sequence of Save1() -> Create1() -> Save2() -> Create2() being called
When Create1() is called, we load up the contents into a dictionary. So when Save2() is called we use the dictionary to write out the contents.
Because of issue #906 (not being able to clean-up unmanaged resources ) I am not sure we can do away with the in-memory approach. Thoughts ?
In reply to: 217462971 [](ancestors = 217462971,217447210,217145506,217123233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted the stream copy based approach instead of reading to byte array
also added a finalizer that closes the session (if it isn’t closed) and deletes the temporary directory
In reply to: 217545311 [](ancestors = 217545311,217462971,217447210,217145506,217123233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt , @danmosemsft , would this be a security risk? To load the TensorFlow model here, we create a temporary directory and copy some files to it, and then load the model from that directory. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followed the recommendation of the .NET security team to address this risk
In reply to: 218159541 [](ancestors = 218159541)
private const string RegistrationName = "TensorFlowTransform"; | ||
private const string TempSavedModelDirName = "MLNET_TensorFlowTransform_SavedModel"; | ||
private const string TempSavedModelZipName = "SavedModel.zip"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private const string TempSavedModelZipName = "SavedModel.zip"; [](start = 8, length = 62)
What will happen if I am working on two different pipelines with TensorflowTransforms executing in parallel? I think having same name for .zip (SavedModel.zip
) and model_dir (MLNET_TensorFlowTransform_SavedModel
)will create a problem.
Let see what does @eric suggest. However, when I did initially, my approach was as follows
- In the constructor, load all the files in un-frozen model into a dictionary (called model dictionary) where keys are the file name and value is the stream of file.
- Load the model from the location that user provided.
- When saving, save the streams in the model dictionary into ML.Net stream.
- When loading from ML.Net stream, load the streams into the model dictionary, create temporary folder and write all the stream into the folder. Load the model into TF using that temporary folder. Delete the model once loading is done.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This obviously has a limitation on the size of model.
In reply to: 217133381 [](ancestors = 217133381)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I have modified the logic so we do the load&restore in-memory (using a dictionary).
In reply to: 217134254 [](ancestors = 217134254,217133381)
subDirInfo.Delete(true); | ||
} | ||
} | ||
internal static TFSession GetSession(IHostEnvironment env, string modelPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 8, length = 8)
nit: Add an empty line between methods. #Closed
@TomFinley or @Zruty0 might be able to give more insight. I think that with the Estimator/Transformer API it might cause problems: The TensorFlowEstimator instantiates a TensorFlowTransform in its constructor, which is returned to the user when Fit() is called. The estimator still owns the TensorFlowTransform though, and it will return the same one if Fit() is called again, so we wouldn't want someone to dispose it. In reply to: 423241990 [](ancestors = 423241990,423236224) Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True) |
Is the Estimator the better place for this state to live? At the end of the day, the directory is representing the "training state" and it needs to be managed by some object. That object should be responsible for cleaning it up and should have a definite lifetime. In reply to: 423310984 [](ancestors = 423310984,423241990,423236224) Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True) |
<ItemGroup> | ||
<PackageReference Include="System.IO.FileSystem.AccessControl" Version="4.5.0" /> | ||
<PackageReference Include="System.Security.Principal.Windows" Version="4.5.0" /> | ||
</ItemGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A) Do we really need them? Because presence of Principal.Windows, for multiplatform code is suspicious.
B) If we really need them you need to make them part of nuget dependency, and also we need to modify https://github.com/dotnet/machinelearning/blob/master/build/Dependencies.props instead of specify versions here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this as a security measure on windows platform. We are using temp directory for loading TFSession for SavedModels. Models are considered executable code, so for this temp directory we need to ACL it in the high-rights process so low-rights process can’t access it.
I will look into making it part of nuget dependency
In reply to: 219572600 [](ancestors = 219572600)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use this one as reference implementation https://github.com/dotnet/machinelearning/blob/master/pkg/Microsoft.ML.ImageAnalytics/Microsoft.ML.ImageAnalytics.nupkgproj
In reply to: 219577320 [](ancestors = 219577320,219572600)
return CheckFileAndRead(env, modelPath); | ||
} | ||
|
||
private static TFSession CheckFileAndRead(IHostEnvironment env, string modelFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static TFSession CheckFileAndRead(IHostEnvironment env, string modelFile) [](start = 8, length = 81)
can you get rid of this method? no one using it any more except GetSession, and I think it would be much better to move checks in start of GetSesssion #Resolved
return attr.HasFlag(FileAttributes.Directory); | ||
} | ||
|
||
internal static void CreateTempDirectory(string tempDirPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTempDirectory [](start = 29, length = 19)
What's the main reason for this? Why we treat Windows differently?
If this is necessary, why it's part of tensorflow only and not part of utility which we use among whole project? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(replied on other comment about its necessity)
Currently used in TFTransform only to protect SavedModels which are considered executable code by .NET security team
In reply to: 219577130 [](ancestors = 219577130)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put comment with description why we doing this?
I would be really tempted to remove this code if I wouldn't know story behind it.
In reply to: 219578542 [](ancestors = 219578542,219577130)
As suggested marking as Pending for now. We do not want block this PR on this. Will track it as part of issue #906 In reply to: 423320325 [](ancestors = 423320325,423310984,423241990,423236224) Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True) |
/// Given a folder path, create it with proper ACL if it doesn't exist. | ||
/// Fails if the folder name is empty, or can't create the folder. | ||
/// </summary> | ||
internal static void CreateFolderWithAclIfNotExists(IHostEnvironment env, string folder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateFolderWithAclIfNotExists [](start = 29, length = 30)
How does creation of ACL folder work in Linux/Mac? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catches the PlatformNotSupported exception and does the normal create.
In reply to: 219903672 [](ancestors = 219903672)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…eleting temp folder (3) deleted test using Legacy Learning API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<PackageReference Include="System.IO.FileSystem.AccessControl" Version="$(SystemIOFileSystemAccessControl)" /> | ||
<PackageReference Include="System.Security.Principal.Windows" Version="$(SystemSecurityPrincipalWindows)" /> | ||
<ProjectReference Include="../Microsoft.ML/Microsoft.ML.nupkgproj" /> | ||
<ProjectReference Include="../Microsoft.ML.TensorFlow.Redist/Microsoft.ML.TensorFlow.Redist.nupkgproj" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but please use spaces #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
internal static TFSession GetSession(IHostEnvironment env, string modelPath) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts.Check(env, nameof(env); #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this check.
btw i believe you meant Contracts.Check(env != null, nameof(env));
In reply to: 220031028 [](ancestors = 220031028)
{ | ||
env.CheckValue(exportDirSavedModel, nameof(exportDirSavedModel)); | ||
var sessionOptions = new TFSessionOptions(); | ||
var exportDir = exportDirSavedModel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var exportDir = exportDirSavedModel; [](start = 11, length = 37)
not needed. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…t of Microsoft.ML.TensorFlow
@@ -2,7 +2,7 @@ | |||
|
|||
<PropertyGroup> | |||
<OutputType>Exe</OutputType> | |||
<TargetFramework>netcoreapp2.1</TargetFramework> | |||
<TargetFramework>netcoreapp2.0</TargetFramework> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be netcoreapp2.1
. Note that netcoreapp
and netstandard
are different things. We want netcoreapp2.1
for any and all executables and tests. We want netstandard2.0
for any libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, we don't even want this in the lib folder as it is just a commandline tool.
Fixes #784