Skip to content

Sample fails with "The size of input lines is not consistent" #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
isaacabraham opened this issue May 9, 2018 · 19 comments
Closed

Sample fails with "The size of input lines is not consistent" #92

isaacabraham opened this issue May 9, 2018 · 19 comments
Labels
F# Support of F# language

Comments

@isaacabraham
Copy link

isaacabraham commented May 9, 2018

I'm trying out the sample shown here. However, whenever I try to train the model I get an error: "The size of input lines is not consistent". This is using the exact files that are specified in the tutorial so I'm not sure where I'm going wrong - any ideas?

#r "netstandard"
#load @"C:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\ml.group.fsx"

open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Transforms
open Microsoft.ML.Trainers

let dataPath = @"data\imdb_labelled.txt"
let testDataPath = @"data\yelp_labelled.txt"

type SentimentData =
    { [<Column(ordinal = "0")>] SentimentText : string
      [<Column(ordinal = "1", name = "Label")>] Sentiment : float }

[<CLIMutable>]
type SentimentPrediction =
    { [<ColumnName "PredictedLabel">] Sentiment : bool }

let pipeline = LearningPipeline()
pipeline.Add(TextLoader<SentimentData>(dataPath, useHeader = false, separator = "tab"))
pipeline.Add(TextFeaturizer("Features", "SentimentText"))
pipeline.Add(FastTreeBinaryClassifier(NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2))

/// Pop!
let model = pipeline.Train<SentimentData, SentimentPrediction>()
@eerhardt
Copy link
Member

eerhardt commented May 9, 2018

Some notes:

  1. That exception appears to be coming from
    if (min < max)
    throw ch.ExceptUserArg(nameof(Column.Source), "The size of input lines is not consistent");
    which is checking that the minimum number of columns of all the lines isn't less than the maximum columns.
  2. I downloaded the file, and it appears to be using LF for line endings. Opening in VS prompts me that the line endings are not consistent and asks if I want to normalize the line endings.

@isaacabraham - are you on Windows? If so, can you open the imdb_labelled.txt file in VS and "normalize" the line endings to Windows (CR LF) and see if that fixes the problem?

@eerhardt
Copy link
Member

eerhardt commented May 9, 2018

The other interesting thing I've noted about this data set is that it has 6 formatting errors:

Warning: Format error at (20,1)-(20,98): Illegal quoting
Warning: Format error at (323,1)-(323,17): Illegal quoting
Warning: Format error at (348,1)-(348,102): Illegal quoting
Warning: Format error at (197,1)-(197,21): Illegal quoting
Warning: Format error at (213,1)-(213,117): Illegal quoting
Warning: Format error at (845,1)-(845,101): Illegal quoting

Looking at the data - there are unmatched double quotes " in the lines, for example:

" The structure of this film is easily the most tightly constructed in the history of cinema.  	1

It is working for me on .NET Core 2.0. Another thing to try is to scrub these 6 formatting errors out of the file by removing the unmatched double quotes.

@isaacabraham
Copy link
Author

isaacabraham commented May 9, 2018

Yes, Windows here! I've just tried it in VS2017 (previously I was using Code) and have normalised the line endings. Now I get a completely different error:

System.InvalidOperationException: Entry point 'Transforms.TextFeaturizer' not found
   at Microsoft.ML.Runtime.EntryPoints.EntryPointNode..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, RunContext context, String id, String entryPointName, JObject inputs, JObject outputs, Boolean checkpoint, String stageId, Single cost)
   at Microsoft.ML.Runtime.EntryPoints.EntryPointNode.ValidateNodes(IHostEnvironment env, RunContext context, JArray nodes, ModuleCatalog moduleCatalog)
   at Microsoft.ML.Runtime.EntryPoints.EntryPointGraph..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, JArray nodes)
   at Microsoft.ML.Runtime.Experiment.Compile()
   at Microsoft.ML.LearningPipeline.Train[TInput,TOutput]()
   at <StartupCode$FSI_0010>.$FSI_0010.main@() in C:\Users\Isaac\Source\Repos\scratchpad\ml.fsx:line 25
Stopped due to error

Is there any runtime reflection / lookups for this stuff? It all compiles in the script file - just when that Train method is called, it goes pop.

@isaacabraham
Copy link
Author

Note - even with the normalised file I still get that error in Code.

@eerhardt
Copy link
Member

eerhardt commented May 9, 2018

(Sorry, I'm not even a novice in F#) Can you show what is in C:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\ml.group.fsx?

@eerhardt
Copy link
Member

eerhardt commented May 9, 2018

Is there any runtime reflection / lookups for this stuff?

Yes, ML.NET uses a "catalog" of components, which are discovered and invoked using reflection. See

foreach (Assembly a in AppDomain.CurrentDomain.GetAssemblies())
{
// Ignore dynamic assemblies.
if (a.IsDynamic)
continue;
_assemblyQueue.Enqueue(a);
if (!_loadedAssemblies.TryAdd(a.FullName, a))
{
// Duplicate loading.
Console.Error.WriteLine("Duplicate loaded assembly '{0}'", a.FullName);
}
}
// Load all assemblies in our directory.
var moduleName = typeof(ComponentCatalog).Module.FullyQualifiedName;
for reference.

@zeahmed
Copy link
Contributor

zeahmed commented May 9, 2018

You can set "allowQuotedStrings = false" in TextLoader. I see that the text columns are not quoted for every example except for a few. This causes "The size of input lines is not consistent" error sometime.

@isaacabraham
Copy link
Author

@zeahmed Thanks - unfortunately changing to that gives a different error: Source column 'SentimentText' not found.

@eerhardt no problem. The file is generated by Paket to load in all the assemblies required as dependencies from the ML library. Here's what it contains:

#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Core.dll" 
#r "../../../../../../../.nuget/packages/system.reflection.emit.lightweight/4.3.0/lib/netstandard1.3/System.Reflection.Emit.Lightweight.dll" 
#r "../../../../../../../.nuget/packages/system.reflection.emit.ilgeneration/4.3.0/lib/netstandard1.3/System.Reflection.Emit.ILGeneration.dll" 
#r "../../../../../../../.nuget/packages/google.protobuf/3.5.1/lib/netstandard1.0/Google.Protobuf.dll" 
#r "../../../../../../../.nuget/packages/newtonsoft.json/11.0.2/lib/netstandard2.0/Newtonsoft.Json.dll" 
#r "../../../../../../../.nuget/packages/system.codedom/4.4.0/lib/netstandard2.0/System.CodeDom.dll" 
#r "../../../../../../../.nuget/packages/system.threading.tasks.dataflow/4.8.0/lib/netstandard2.0/System.Threading.Tasks.Dataflow.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.UniversalModelFormat.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Maml.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.InternalStreams.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.CpuMath.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Data.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Transforms.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.ResultProcessor.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.PCA.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.KMeansClustering.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.FastTree.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Api.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Sweeper.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.StandardLearners.dll" 
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.PipelineInference.dll" 
#r "System" 
#r "System.ComponentModel.Composition" 
#r "System.Core"

@helloguo
Copy link

@eerhardt

The other interesting thing I've noted about this data set is that it has 6 formatting errors:

I also see the Warnings, which are expected based on dotnet/docs#5256 (comment).

I was wondering if there is there a way not to show the Warnings on the console?

@isaacabraham
Copy link
Author

In C# does this sample then work? Or is it the same issue with the sample data file?

@zeahmed
Copy link
Contributor

zeahmed commented May 13, 2018

yes, a working example is here: dotnet/docs#5330
Get ride of all the data loading warning messages by setting "allowQuotedStrings: false".

@isaacabraham
Copy link
Author

Small update here. I have managed to get this working within a console application by also removing the use of records and replacing them with mutable classes. This is - from an F# perspective - undesirable but at least it's a starting point.

I'm still unable to get it to work from a script however, which is very important in my opinion from an data analysis point of view (@mathias-brandewinder can probably elaborate the rationale on why this is better than I. Or probably any Python machine learning person...). The error I'm now seeing is:

Binding session to 'c:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\../../../../../../../.nuget/packages/newtonsoft.json/11.0.2/lib/netstandard2.0/Newtonsoft.Json.dll'...
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data ---> System.DllNotFoundException: Unable to load DLL 'CpuMathNative': The specified module could not be found. (Exception from HRESULT: 0x8007007E)
   at Microsoft.ML.Runtime.Internal.CpuMath.Thunk.SumSqU(Single* ps, Int32 c)
   at Microsoft.ML.Runtime.Data.LpNormNormalizerTransform.<>c__DisplayClass27_0.<GetGetterCore>b__5(VBuffer`1& dst)
   at Microsoft.ML.Runtime.Data.ConcatTransform.<>c__DisplayClass36_0`1.<MakeGetter>b__0(VBuffer`1& dst)
   at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Consolidator.<>c__DisplayClass4_1.<ConsolidateCore>b__2()
   --- End of inner exception stack trace ---
   at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
   at Microsoft.ML.Runtime.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Runtime.Training.TrainingCursorBase.MoveNext()
   at Microsoft.ML.Runtime.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
   at Microsoft.ML.Runtime.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Runtime.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Runtime.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
   at Microsoft.ML.Runtime.FastTree.FastTreeTrainerBase`2.ConvertData(RoleMappedData trainData)
   at Microsoft.ML.Runtime.FastTree.FastTreeBinaryClassificationTrainer.Train(RoleMappedData trainData)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Microsoft.ML.Runtime.Data.TrainUtils.TrainCore(IHostEnvironment env, IChannel ch, RoleMappedData data, ITrainer trainer, String name, RoleMappedData validData, ICalibratorTrainer calibrator, Int32 maxCalibrationExamples, Nullable`1 cacheData, IPredictor inpPredictor)
   at Microsoft.ML.Runtime.EntryPoints.LearnerEntryPointsUtils.Train[TArg,TOut](IHost host, TArg input, Func`1 createTrainer, Func`1 getLabel, Func`1 getWeight, Func`1 getGroup, Func`1 getName, Func`1 getCustom, ICalibratorTrainerFactory calibrator, Int32 maxCalibrationExamples)
   at Microsoft.ML.Runtime.FastTree.FastTree.TrainBinary(IHostEnvironment env, Arguments input)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Microsoft.ML.Runtime.EntryPoints.EntryPointNode.Run()
   at Microsoft.ML.Runtime.EntryPoints.EntryPointGraph.RunNode(EntryPointNode node)
   at Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAllNonMacros()
   at Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAll()
   at Microsoft.ML.LearningPipeline.Train[TInput,TOutput]()
   at <StartupCode$FSI_0010>.$FSI_0010.main@() in c:\Users\Isaac\Source\Repos\scratchpad\ml.fsx:line 25 

@eerhardt
Copy link
Member

The error you are getting is caused by the runtime not finding the "native" assemblies that are used by ML.NET. These assemblies are in the NuGet package under the runtimes/win-x64/native folder of the NuGet package. When you use <PackageReference> in an MSBuild project, NuGet will automatically pull the correct native assets into your app's runtime directory.

We had a similar problem as above when using packages.config, because NuGet doesn't automatically pull these native assets. So instead, we had to manually do it in the NuGet package when the project is using packages.config. See #165 that fixed this.

I don't have any real experience with the F# scripting tooling. How does it normally handle native (C++) assemblies contained in a NuGet package? If there is something we can/should do in the NuGet package? Or are native assemblies from a NuGet package not supported in F# scripting?

@isaacabraham
Copy link
Author

@eerhardt That helped, and I have it working now. There are a few ways of doing this - the issue is that the F# Interactive process (FSI.exe) can't see the native dlls in any path / probing folder by default so it can't find them. F# scripts do have the ability to add a folder / path to probing using the #I directive, but this only works for .NET assemblies

The most "fully featured" answer I found to this was here http://christoph.ruegg.name/blog/loading-native-dlls-in-fsharp-interactive.html. By adding the path to the native dlls before running the model, I got it to work i.e.

open System

let nativeDirectory = @"C:\Users\Isaac\.nuget\packages\microsoft.ml\0.1.0\runtimes\win-x64\native"
Environment.SetEnvironmentVariable("Path", Environment.GetEnvironmentVariable("Path") + ";" + nativeDirectory)

Unfortunately this is not especially easy to figure out. I've seen a similar issue recently with CosmosDB using some native assemblies - they aren't particularly easy to work with.

Regarding NuGet etc. - the main NuGet tooling is, to be honest, a dead loss from the point of F# scripting - you need some form of msbuild project file to mark your dependencies, and there's no easy way to reference the assemblies anyway, which is one of the reasons why many F# developers use Paket instead. Paket already supports the ability to generate a "load dependencies" file for scripts (as seen in my earlier post here) but it doesn't know about native dlls. @forki do you think that this is something that could be added to Paket's generate load scripts functionality? Are native folders a "proper" thing in NuGet packages?

@eerhardt
Copy link
Member

Are native folders a "proper" thing in NuGet packages?

Check out https://docs.microsoft.com/en-us/nuget/create-packages/supporting-multiple-target-frameworks#architecture-specific-folders for the docs on the runtimes folder:

If you have architecture-specific assemblies, that is, separate assemblies that target ARM, x86, and x64, you must place them in a folder named runtimes within sub-folders named {platform}-{architecture}\lib{framework} or {platform}-{architecture}\native

@isaacabraham
Copy link
Author

@eerhardt is there any way not to have to fall back to these native dlls?

@eerhardt
Copy link
Member

Currently, no, the native assemblies are required.

However, we are exploring/thinking of other options here. The CpuMath assembly is written in C++ because it wants to use SIMD instructions, which were only available in C/C++. With .NET Core 2.1, these SIMD instructions are available through .NET APIs. We could replace the CpuMath assembly with .NET code that uses the same instructions. On .NET Framework, we would still require the native assembly in order to use the SIMD instructions, because this support is only for .NET Core.

Another option/thought here is to provide software fallback methods, which of course would be slower. But the advantage is that it would have wider reach where the SIMD instructions aren't available (for example on ARM processors).

@dsyme
Copy link
Contributor

dsyme commented Jul 16, 2018

Please tag this with "F#" (though it might not be specifically related to F#)

@eerhardt eerhardt added the F# Support of F# language label Jul 16, 2018
@dsyme
Copy link
Contributor

dsyme commented Jul 30, 2018

After doing #600 I think there is no F#-specific issue remaining here, see #180 for the record issue

@dsyme dsyme closed this as completed Jul 30, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F# Support of F# language
Projects
None yet
Development

No branches or pull requests

5 participants