Closed
Description
I'm trying out the sample shown here. However, whenever I try to train the model I get an error: "The size of input lines is not consistent". This is using the exact files that are specified in the tutorial so I'm not sure where I'm going wrong - any ideas?
#r "netstandard"
#load @"C:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\ml.group.fsx"
open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Transforms
open Microsoft.ML.Trainers
let dataPath = @"data\imdb_labelled.txt"
let testDataPath = @"data\yelp_labelled.txt"
type SentimentData =
{ [<Column(ordinal = "0")>] SentimentText : string
[<Column(ordinal = "1", name = "Label")>] Sentiment : float }
[<CLIMutable>]
type SentimentPrediction =
{ [<ColumnName "PredictedLabel">] Sentiment : bool }
let pipeline = LearningPipeline()
pipeline.Add(TextLoader<SentimentData>(dataPath, useHeader = false, separator = "tab"))
pipeline.Add(TextFeaturizer("Features", "SentimentText"))
pipeline.Add(FastTreeBinaryClassifier(NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2))
/// Pop!
let model = pipeline.Train<SentimentData, SentimentPrediction>()
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
eerhardt commentedon May 9, 2018
Some notes:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs
Lines 507 to 508 in c023727
LF
for line endings. Opening in VS prompts me that the line endings are not consistent and asks if I want to normalize the line endings.@isaacabraham - are you on Windows? If so, can you open the imdb_labelled.txt file in VS and "normalize" the line endings to
Windows (CR LF)
and see if that fixes the problem?eerhardt commentedon May 9, 2018
The other interesting thing I've noted about this data set is that it has 6 formatting errors:
Looking at the data - there are unmatched double quotes
"
in the lines, for example:It is working for me on .NET Core 2.0. Another thing to try is to scrub these 6 formatting errors out of the file by removing the unmatched double quotes.
isaacabraham commentedon May 9, 2018
Yes, Windows here! I've just tried it in VS2017 (previously I was using Code) and have normalised the line endings. Now I get a completely different error:
Is there any runtime reflection / lookups for this stuff? It all compiles in the script file - just when that Train method is called, it goes pop.
isaacabraham commentedon May 9, 2018
Note - even with the normalised file I still get that error in Code.
eerhardt commentedon May 9, 2018
(Sorry, I'm not even a novice in F#) Can you show what is in
C:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\ml.group.fsx
?eerhardt commentedon May 9, 2018
Yes, ML.NET uses a "catalog" of components, which are discovered and invoked using reflection. See
machinelearning/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs
Lines 399 to 414 in c023727
zeahmed commentedon May 9, 2018
You can set "allowQuotedStrings = false" in TextLoader. I see that the text columns are not quoted for every example except for a few. This causes "The size of input lines is not consistent" error sometime.
isaacabraham commentedon May 9, 2018
@zeahmed Thanks - unfortunately changing to that gives a different error:
Source column 'SentimentText' not found
.@eerhardt no problem. The file is generated by Paket to load in all the assemblies required as dependencies from the ML library. Here's what it contains:
helloguo commentedon May 10, 2018
@eerhardt
I also see the Warnings, which are expected based on dotnet/docs#5256 (comment).
I was wondering if there is there a way not to show the Warnings on the console?
isaacabraham commentedon May 12, 2018
In C# does this sample then work? Or is it the same issue with the sample data file?
zeahmed commentedon May 13, 2018
yes, a working example is here: dotnet/docs#5330
Get ride of all the data loading warning messages by setting "allowQuotedStrings: false".
isaacabraham commentedon May 19, 2018
Small update here. I have managed to get this working within a console application by also removing the use of records and replacing them with mutable classes. This is - from an F# perspective - undesirable but at least it's a starting point.
I'm still unable to get it to work from a script however, which is very important in my opinion from an data analysis point of view (@mathias-brandewinder can probably elaborate the rationale on why this is better than I. Or probably any Python machine learning person...). The error I'm now seeing is:
eerhardt commentedon May 21, 2018
The error you are getting is caused by the runtime not finding the "native" assemblies that are used by ML.NET. These assemblies are in the NuGet package under the
runtimes/win-x64/native
folder of the NuGet package. When you use<PackageReference>
in an MSBuild project, NuGet will automatically pull the correct native assets into your app's runtime directory.We had a similar problem as above when using
packages.config
, because NuGet doesn't automatically pull these native assets. So instead, we had to manually do it in the NuGet package when the project is usingpackages.config
. See #165 that fixed this.I don't have any real experience with the F# scripting tooling. How does it normally handle native (C++) assemblies contained in a NuGet package? If there is something we can/should do in the NuGet package? Or are native assemblies from a NuGet package not supported in F# scripting?
isaacabraham commentedon May 22, 2018
@eerhardt That helped, and I have it working now. There are a few ways of doing this - the issue is that the F# Interactive process (FSI.exe) can't see the native dlls in any path / probing folder by default so it can't find them. F# scripts do have the ability to add a folder / path to probing using the
#I
directive, but this only works for .NET assembliesThe most "fully featured" answer I found to this was here http://christoph.ruegg.name/blog/loading-native-dlls-in-fsharp-interactive.html. By adding the path to the native dlls before running the model, I got it to work i.e.
Unfortunately this is not especially easy to figure out. I've seen a similar issue recently with CosmosDB using some native assemblies - they aren't particularly easy to work with.
Regarding NuGet etc. - the main NuGet tooling is, to be honest, a dead loss from the point of F# scripting - you need some form of msbuild project file to mark your dependencies, and there's no easy way to reference the assemblies anyway, which is one of the reasons why many F# developers use Paket instead. Paket already supports the ability to generate a "load dependencies" file for scripts (as seen in my earlier post here) but it doesn't know about native dlls. @forki do you think that this is something that could be added to Paket's generate load scripts functionality? Are native folders a "proper" thing in NuGet packages?
eerhardt commentedon May 22, 2018
Check out https://docs.microsoft.com/en-us/nuget/create-packages/supporting-multiple-target-frameworks#architecture-specific-folders for the docs on the
runtimes
folder:isaacabraham commentedon May 24, 2018
@eerhardt is there any way not to have to fall back to these native dlls?
eerhardt commentedon May 24, 2018
Currently, no, the native assemblies are required.
However, we are exploring/thinking of other options here. The CpuMath assembly is written in C++ because it wants to use SIMD instructions, which were only available in C/C++. With .NET Core 2.1, these SIMD instructions are available through .NET APIs. We could replace the CpuMath assembly with .NET code that uses the same instructions. On .NET Framework, we would still require the native assembly in order to use the SIMD instructions, because this support is only for .NET Core.
Another option/thought here is to provide software fallback methods, which of course would be slower. But the advantage is that it would have wider reach where the SIMD instructions aren't available (for example on ARM processors).
dsyme commentedon Jul 16, 2018
Please tag this with "F#" (though it might not be specifically related to F#)
dsyme commentedon Jul 30, 2018
After doing #600 I think there is no F#-specific issue remaining here, see #180 for the record issue