Skip to content

Fix bug in TextLoader #3011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
7f4e341
Fix bug in TextLoader
yaeldMS Mar 19, 2019
1a89468
Clean FeatureContributionCalculation and PermutationFeatureImportance…
artidoro Mar 19, 2019
aea88dc
Updating LightGBM Arguments (#2948)
singlis Mar 19, 2019
8b1b14f
Hiding of ColumnOptions (#2959)
artidoro Mar 19, 2019
f03c49d
Updating the FunctionalTests to clearly explain why they are not stro…
eerhardt Mar 19, 2019
00a5b35
Added samples for tree regression trainers. (#2999)
Mar 19, 2019
fd1c700
Cleanup the statistics usage API (#2048)
sfilipi Mar 19, 2019
de5d48a
Refactor cancellation mechanism and make it internal, accessible via …
codemzs Mar 19, 2019
c38f81b
Add functional tests for ONNX scenarios (#2984)
rogancarr Mar 19, 2019
3af9a5d
Make Multiclass Linear Trainers Typed Based on Output Model Types. (#…
wschin Mar 20, 2019
807d813
Clean up the SchemaDefinition class (#2995)
yaeldMS Mar 20, 2019
c8a4c7d
Data catalog done (#3021)
sfilipi Mar 20, 2019
ce56462
Activate OnnxTransform unit tests for MacOS (#2695)
jignparm Mar 20, 2019
e00d19d
Added tests for text featurizer options (Part1). (#3006)
zeahmed Mar 20, 2019
a2d7987
Binary FastTree/Forest samples using T4 templates. (#3035)
Mar 20, 2019
77be9d9
Polish standard trainers' catalog (Just rename some variables) (#3029)
wschin Mar 21, 2019
5b22420
Polish train catalog (renaming only) (#3030)
wschin Mar 21, 2019
ce7f0fb
Merge branch 'tryparseschema' of https://github.com/yaeldekel/machine…
yaeldMS Mar 21, 2019
62dda6f
Add more checks for the syntax of the embedded TextLoader options
yaeldMS Mar 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1291,11 +1291,6 @@ private static bool TryParseSchema(IHost host, IMultiStreamSource files,
if (loader == null || string.IsNullOrWhiteSpace(loader.Name))
goto LDone;

// Make sure the loader binds to us.
var info = host.ComponentCatalog.GetLoadableClassInfo<SignatureDataLoader>(loader.Name);
if (info.Type != typeof(ILegacyDataLoader) || info.ArgType != typeof(Options))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info [](start = 20, length = 4)

Could we explain this change? As far as I can tell, we're considering not finding this loader in the component factory to be no longer an error condition. This may be right, but I'd feel a bit more comfortable with this change if we can explain why?

I did read the attached issue, but it was a little vague on precisely why removing this check is a right and desirable thing to do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this check was there in order to make sure that the schema defined in the file has the correct syntax for creating a legacy TextLoader. The schema defined in the file should always be in the format TextLoader{col=... }, since it is generated by the TextSaver. Instead of the check that I deleted, should we check that loader.Name==TextLoader.LoaderSignature?


In reply to: 267105243 [](ancestors = 267105243)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a text file, might be manually altered after saving it with the TextSaver?


In reply to: 267424076 [](ancestors = 267424076,267105243)

goto LDone;

var optionsNew = new Options();
// Set the fields of optionsNew to the arguments parsed from the file.
if (!CmdParser.ParseArguments(host, loader.GetSettingsString(), optionsNew, typeof(Options), msg => ch.Error(msg)))
Expand Down
13 changes: 13 additions & 0 deletions test/Microsoft.ML.Tests/TextLoaderTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -598,6 +598,19 @@ public void ThrowsExceptionWithPropertyName()
catch (NullReferenceException) { };
}

[Fact]
public void ParseSchemaFromTextFile()
{
var mlContext = new MLContext(seed: 1);
var fileName = GetDataPath(TestDatasets.adult.trainFilename);
var loader = mlContext.Data.CreateTextLoader(new TextLoader.Options(), new MultiFileSource(fileName));
var data = loader.Load(new MultiFileSource(fileName));
Assert.NotNull(data.Schema.GetColumnOrNull("Label"));
Assert.NotNull(data.Schema.GetColumnOrNull("Workclass"));
Assert.NotNull(data.Schema.GetColumnOrNull("Categories"));
Assert.NotNull(data.Schema.GetColumnOrNull("NumericFeatures"));
}

public class QuoteInput
{
[LoadColumn(0)]
Expand Down