-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Re-using the same Dataview with Bitmaps in memory, breaks when fitting different models or run cross validation on it #4126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I ran into a similar issue a few weeks ago while using |
From what I've read, issue #4084 seems to be closely related to this problem. I'm not sure whether the suggestions there were fully implemented or not. |
Was this addressed in the upcoming 1.4.0 version? I need to know in order to either wait or try another approach. |
@ssaporito I ended up implementing my own cross validation. I split the data set into 5 different permutations train/test (80%/20%) sets, train a model on each of the permutations and average the resulting metrics of all 5 runs. |
Hi, @SnakyBeaky and @ssaporito . Sorry for the late response I've written the code below (based on this sample but modifying it to use images in memory, and using its dataset and tensorflow model) (EDIT: With the code below I wasn't able to reproduce the issue, but in my next post I actually say what to change in order to reproduce it). By the way, I don't think issue #4084 was related to this, but it was more about introducing a whole new API specifically for a new Image Classification transformer, which uses a pretrained tensorflow model and provides more options to the user. This API was introduced here, and it wouldn't be related to this issue here (which uses LoadTensorFlowModel instead of the other API). using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Image;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
namespace is4126CrossValidateInMemoryImages
{
class Program
{
public class ImageData
{
public string ImagePath;
public string Label;
[ImageType(227, 227)]
public Bitmap Image { get; set; }
}
static void Main(string[] args)
{
var inputTensorFlowModelFilePath = @"C:\Users\anvelazq\Desktop\is4126\inception_v3_2016_08_28_frozen.pb";
var inputDataSetFolder = @"C:\Users\anvelazq\Desktop\is4126\flower_photos_small_set";
// var outModelPath = @"C:\Users\anvelazq\Desktop\is4126\model.zip";
var mlContext = new MLContext();
var imageSet = LoadImagesFromDirectory(inputDataSetFolder);
IDataView fullImagesDataset = mlContext.Data.LoadFromEnumerable(imageSet);
IDataView trainDataset = fullImagesDataset;
var pipeline =
mlContext.Transforms.Conversion.MapValueToKey("Label")
//.Append(mlContext.Transforms.LoadImages(outputColumnName: "image_object", imageFolder: null, "ImagePath"))
//.Append(mlContext.Transforms.CopyColumns("image_object", "Image"))
.Append(mlContext.Transforms.ResizeImages(outputColumnName: "image_object_resized", imageWidth: 299, imageHeight: 299, inputColumnName: "Image"))
.Append(mlContext.Transforms.ExtractPixels(outputColumnName: "input", inputColumnName: "image_object_resized", interleavePixelColors: true, offsetImage: 117, scaleImage: 1 / 255f))
.Append(mlContext.Model.LoadTensorFlowModel(inputTensorFlowModelFilePath).ScoreTensorFlowModel(outputColumnNames: new[] { "InceptionV3/Predictions/Reshape" }, inputColumnNames: new[] { "input" }, addBatchDimensionInput: false))
.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(labelColumnName: "Label", featureColumnName: "input"))
;
Console.WriteLine(DateTime.Now);
var model = pipeline.Fit(trainDataset);
// var x = model.Transform(trainDataset).Preview();
Console.WriteLine(DateTime.Now);
// mlContext.Model.Save(model, trainDataset.Schema, outModelPath);
Console.WriteLine(DateTime.Now);
var eval = mlContext.MulticlassClassification.CrossValidate(trainDataset, pipeline, 5);
Console.WriteLine(DateTime.Now);
}
public static IEnumerable<ImageData> LoadImagesFromDirectory(string folder, bool useFolderNameasLabel = true)
{
var files = Directory.GetFiles(folder, "*",
searchOption: SearchOption.AllDirectories);
foreach (var file in files)
{
if ((Path.GetExtension(file) != ".jpg") && (Path.GetExtension(file) != ".png"))
continue;
var label = Path.GetFileName(file);
if (useFolderNameasLabel)
label = Directory.GetParent(file).Name;
else
{
for (int index = 0; index < label.Length; index++)
{
if (!char.IsLetter(label[index]))
{
label = label.Substring(0, index);
break;
}
}
}
yield return new ImageData()
{
ImagePath = file,
Label = label,
Image = (Bitmap) Image.FromFile(file)
};
}
}
}
} |
By changing the following line on my code above, I was actually able to reproduce this issue, and got a "System.ArgumentException: Parameter is not valid"; I got it even in version 1.5.0-preview2 so this hasn't been fixed. IDataView fullImagesDataset = mlContext.Data.LoadFromEnumerable(imageSet.ToList()); The difference is that by turning the imageSet enumerable into a List, now it's actually reusing the same objects, instead of yielding new objects as I did in my original code. So I will look into this. Nonetheless, notice that a workaround would be to actually create an IEnumerable which yields clones of the original images (I guess this wouldn't be as performant, but I guess it could be a valid workaround): IDataView fullImagesDataset = mlContext.Data.LoadFromEnumerable(yieldImages(imageSet.ToList())); with public static IEnumerable<ImageData> yieldImages(List<ImageData> imageSet)
{
foreach (var imageData in imageSet)
{
yield return new ImageData()
{
ImagePath = imageData.ImagePath,
Label = imageData.Label,
Image = (Bitmap) imageData.Image.Clone()
};
}
} EDIT: After fixing this in #5056 this workaround isn't needed. Furthermore, I now realize that using this workaround would be problematic because no one would be disposing the created images in here, so to use this somewhere you would have needed to store references to the cloned images, and then dispose them after running cross validation. |
Also, this isn't a problem specifically with using CrossValidation. E.g. if I try to do the following I get the same exception when fitting the second model. IDataView trainDataset = mlContext.Data.LoadFromEnumerable(imageSet.ToList());
var model = pipeline.Fit(trainDataset);
var model2 = pipeline.Fit(trainDataset); The same workaround I provided on my previous post would fix the exception in here as well. But in general it looks like this issue is actually about not being able to reuse Image objects when fitting models more than once. So when running crossvalidation, it actually fits different models with the same dataview (just as above) so that's why we get the same exception. Particullarly, I get the same exception even when running this code below (which only includes a ResizeImage transformer and doesn't use Tensorflow, or any trainer nor cross validation) var imageSet = LoadImagesFromDirectory(inputDataSetFolder);
var imageList = imageSet.ToList();
IDataView trainDataset = mlContext.Data.LoadFromEnumerable(imageList);
var pipeline =
mlContext.Transforms.Conversion.MapValueToKey("Label")
.Append(mlContext.Transforms.ResizeImages(outputColumnName: "image_object_resized", imageWidth: 299, imageHeight: 299, inputColumnName: "Image"))
;
var model1 = pipeline.Fit(trainDataset);
var prev11 = model1.Transform(trainDataset).Preview();
imageList[0].Image.Save(@"C:\Users\anvelazq\Desktop\is4126\saved.jpg"); // I can still access the original Image objects
var model2 = pipeline.Fit(trainDataset);
var prev2 = model2.Transform(trainDataset).Preview(); // Exception Notice that if I try to access the imageo bjects inside imageList after fitting model1 I can do it without problems. So the objects themselves are still there. Problem is when trying to access them by applying (again) a ResizeImages Transformer. Again, the same workaround of yielding clones would fix this. |
The problem is in the disposer created in the machinelearning/src/Microsoft.ML.ImageAnalytics/ImageResizer.cs Lines 283 to 291 in 919bc8b
If I comment out the content of the disposer so that it's empty, then the exceptions go away in the repros I've provided (both when using cross validation, or when fitting twice a pipeline). Still, I'd have to investigate if that disposer is truly necessary, and how to avoid using it in the case where a user wants to use in-memory images to train multiple pipelines and/or use cross validation with them. It is still unclear to me why in my last post I was still able to call EDIT: It turns out the disposer is only called once after having processed all the inputs, so it is actually only disposing the last image. Because of this I could access imageList[0].Image.Save() in my past post, but now I realize I couldn't access the last image on that list. This means that the exception related with this issue is caused only because we're disposing the last image of the input dataset, and it can't be reused when refitting. |
System information
Issue
I had a working pipeline for training image classification with cross-validation on the previous ML.NET version, using file paths as input. Now, being able to load Bitmaps, I am trying to setup a similar pipeline, but allowing training and predictions from in-memory bitmaps.
The training works if I just Fit the data,
ITransformer mlModel = pipeline.Fit(trainData);
but it fails if I try to use CrossValidate
var cvResults = _mlContext.MulticlassClassification.CrossValidate(trainData, pipeline, numberOfFolds);
I expected a pipeline that worked with Fit to work with CrossValidate, but it seems the internal multiple passes do something to the Bitmaps (they lose data).
Source code / logs
My current pipeline, based on this sample is this:
The error log includes the following exceptions:
This is my first issue here, and I apologize if I overlooked something. I found no posts about this error anywhere.
The text was updated successfully, but these errors were encountered: