Skip to content

Change Default Settings in TextLoader #2630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 22, 2019
Merged

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 19, 2019

To fix #2576, this PR makes the default to sparse- everywhere including both of command line and public APIs.

@wschin wschin added the API Issues pertaining the friendly API label Feb 19, 2019
@wschin wschin self-assigned this Feb 19, 2019
@@ -395,7 +396,7 @@ public sealed class Options : ArgumentsCore
internal static class DefaultArguments
{
internal const bool AllowQuoting = true;
Copy link
Contributor

@TomFinley TomFinley Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllowQuoting [](start = 32, length = 12)

I feel like this needs to be changed too. That is, quoting is not a good option to have on by default. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Please take a look at Iteration 3.


In reply to: 258240404 [](ancestors = 258240404)

Copy link
Contributor

@justinormont justinormont Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having quoting on by default is better. Mainly this helps with CSV files w/ text columns (commas in the quoted strings).

row needs quotes
true, "Schrödinger's cat walks into a bar, and doesn't." yes
false, Wanted: Schrodinger's cat. Dead and alive. no

I see more datasets with gains from default on quoting, then hurt by it. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have non-standard settings to be non-default. There is no standard on parsing quoted strings.


In reply to: 258341863 [](ancestors = 258341863)

Copy link
Contributor

@justinormont justinormont Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, my recommendation is to keep AllowQuoting defaulting to true.

General reason is that it more datasets will work. The impact of the parameter falsely on, is less than falsely off. The main non-standard part is escaping methods, having quoting is standard.

You'll see others default to quoting:

/cc @TomFinley: thoughts? #Resolved

Copy link
Contributor

@justinormont justinormont Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.

The issue was our quoting support did not go far enough. I main issue I faced was that we haven't added support for newlines in the quoted string. I believe the argument at the time was that we treat lines independently for speed.

Most datasets are in CSV/TSV format. We should be supporting the datasets of our users. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If quote is not fully-functioning, it'd be better to turn it off by default and have another TSC & CSV reader implementation, I guess.


In reply to: 258668384 [](ancestors = 258668384)

Copy link
Contributor

@TomFinley TomFinley Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.

I don't think that's true. I recall on multiple occasions seeing you run experiments where you were getting these warnings and you just sort of ignored it, even though it was probably corrupting your dataset. It definitely wasn't beneficial on the whole. It isn't a matter of it "not go far enough," because it wasn't intended to do that at any point. That the CSV format is not self synchronizing (due to multi-line issue) and therefore inappropriate for distributed applications, and therefore we did not elect to use it as our primary format, has been explained in other forums. However it again reinforces the point that this is not a CSV reader, which I believe I also mentioned in my reply. I don't mean to belabor the point. Since it is the premise of the claim that we should continue to support quoting, though, this is why I do not find that argument convincing.

There's also the issue that there are other people in the world, and on the whole if I look through my DRI email list, I see way more instances of turning quoting off than I see instances where someone somehow accidentally stumbled into using our quoting system.

So, off.

Copy link
Contributor

@justinormont justinormont Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as lesser of two evils.

Default ON:

  • Has the potential to cause read errors (dropped rows) when there are newlines or quotes in the quoted string (generally doesn't matter)
  • Non-text datasets are unaffected

Default OFF:

  • Most text datasets fail (bad)
  • Non-text datasets are unaffected

Let me know if I'm wrong, but I think the argument more directly is:
The user should be forced to mung their dataset in to the form we are expecting (eg: no commas/newlines/escaped-quotes in a string) vs. we should do our best on the data the user hands us (expectation of dirty data).

/cc: @CESARDELATORRE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's fairly simpler than that. If I think of the user questions and scenarios that I've handled with text loader over the years, and I tally up (1) those datasets that conformed to our scheme of quoting and (2) those that did not, there's a really clear tendency. I also think of all the problems it's caused over the years.

IMultiStreamSource dataSample = null)
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample);
IMultiStreamSource dataSample = null,
bool allowSparse = TextLoader.DefaultArguments.AllowSparse)
Copy link
Contributor

@TomFinley TomFinley Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader [](start = 31, length = 10)

Consider that since this is the code that needs to be changed for #2452, which you've self-assigned anyway, we may as well correct this while we're at it. (Otherwise I see no point in changing it.) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't agree more. Proposed solution to #2452 can be found in Iteration 4!


In reply to: 258240786 [](ancestors = 258240786)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, thank you @wschin !! My only high level comments are we should also fix the "quoting" thing as discussed in the issue as well as separate out the text loader/saving arguments, since this touches exactly the code we'd have to address in #2452 anyway.

@codecov
Copy link

codecov bot commented Feb 19, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@412e1f9). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #2630   +/-   ##
=========================================
  Coverage          ?   71.58%           
=========================================
  Files             ?      805           
  Lines             ?   142025           
  Branches          ?    16130           
=========================================
  Hits              ?   101675           
  Misses            ?    35910           
  Partials          ?     4440
Flag Coverage Δ
#Debug 71.58% <100%> (?)
#production 67.88% <100%> (?)
#test 85.74% <100%> (?)
Impacted Files Coverage Δ
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 73.76% <100%> (ø)
...est/Microsoft.ML.Predictor.Tests/TestPredictors.cs 63.84% <100%> (ø)
test/Microsoft.ML.Tests/AnomalyDetectionTests.cs 100% <100%> (ø)
test/Microsoft.ML.TimeSeries.Tests/TimeSeries.cs 87.61% <100%> (ø)
test/Microsoft.ML.FSharp.Tests/SmokeTests.fs 96.07% <100%> (ø)
test/Microsoft.ML.Functional.Tests/DataIO.cs 100% <100%> (ø)
...L.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs 100% <100%> (ø)
...crosoft.ML.TestFramework/BaseTestPredictorsMaml.cs 77.01% <100%> (ø)
...cenariosWithDirectInstantiation/TensorflowTests.cs 91.76% <100%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.09% <100%> (ø)
... and 8 more

@wschin wschin changed the title [WIP] Change Default Settings in TextLoader Change Default Settings in TextLoader Feb 19, 2019
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample)
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
/// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param>
public TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null, bool allowSparse = false, bool allowQuoting = false)
Copy link
Member

@eerhardt eerhardt Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new parameters should use the default constants. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 258307575 [](ancestors = 258307575)

@@ -1145,6 +1148,7 @@ private sealed class LoaderHolder
// Verify that the current schema-defining arguments are default.
// Get settings just for core arguments, not everything.
string tmp = CmdParser.GetSettings(host, options, new ArgumentsCore());
tmp = Regex.Replace(tmp, @"[(sparse=\+)|(quote\+)]", "");
Copy link
Member

@eerhardt eerhardt Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand what this is for... can you explain? #Resolved

Copy link
Member Author

@wschin wschin Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our test framework throws if any non-default setting found (indicating by an non-empty tmp here). As sparse+ and quote+ become non-default settings and are required in many tests, we need to remove them here. The design doesn't look good to me but I feel it may take a while (e.g., maybe 3-4 days but I am not sure) if we need to refactorize the framework.


In reply to: 258307862 [](ancestors = 258307862)

Copy link
Member Author

@wschin wschin Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With changes made by (probably) @artidoro, I have dropped the new regexp. #Resolved

// REVIEW: consider saving a command line in a separate file.
public sealed class Arguments
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Separator", ShortName = "sep")]
public string Separator = "tab";
public string Separator = DefaultArguments.Separator.ToString();

Copy link
Contributor

@artidoro artidoro Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that what we had before was correct, I think that '\t'.ToString() simply gives a tab as a string. Here it was meant to be an understandable description of separators. We since moved to separator chars in TextLoader and such, in which case we use the character directly to define the separator. Here I believe we have to keep it like it was "tab". #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


In reply to: 258714604 [](ancestors = 258714604)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wschin !

@@ -144,20 +148,22 @@ public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, str
/// <param name="headerRow">Whether to write the header row.</param>
/// <param name="schema">Whether to write the header comment with the schema.</param>
/// <param name="keepHidden">Whether to keep hidden columns in the dataset.</param>
/// <param name="forceDense">Whether to save columns in dense format even if they are sparse vectors.</param>
public static void SaveAsText(this DataOperationsCatalog catalog,
IDataView data,
Stream stream,
char separatorChar = TextLoader.DefaultArguments.Separator,
Copy link
Contributor

@artidoro artidoro Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be TextSaver.DefaultArguments.Separator? #Resolved

Copy link
Contributor

@artidoro artidoro Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should match the signature of TextLoader which uses a separatorChar. If this logic works, then my comment about the need to use "tab" instead of '\t'.ToString() was wrong. Could you double check that the default '\t'.ToString() that you are using here actually works? If not, I feel we should change TextSaver to use a separatorChar.


In reply to: 259067133 [](ancestors = 259067133)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it didn't work, some tests may fail. I agree char is better.


In reply to: 259071246 [](ancestors = 259071246,259067133)

@artidoro
Copy link
Contributor

artidoro commented Feb 21, 2019

Could you also rebase/merge with master as I think I made some conflicting changes to TextLoader. #Resolved

@wschin
Copy link
Member Author

wschin commented Feb 22, 2019

Done. Hopefully that's my last time of doing so for this PR.....


In reply to: 466127204 [](ancestors = 466127204)

@@ -7,6 +7,7 @@
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary anymore, right? #Resolved

internal TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null)
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample)
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
/// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param>
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste error here. I don't think this is what allowQuoting means :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it changed to /// <param name="allowQuoting">Whether the content of a column can be parsed from a string starting and ending with quote.</param>.


In reply to: 259418380 [](ancestors = 259418380)

IMultiStreamSource dataSample = null)
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample);
IMultiStreamSource dataSample = null,
bool allowSparse = TextLoader.Defaults.AllowSparse,
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these go before dataSample? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reorder those arguments based on their usage frequencies (in my mind). In general, the more ML.NET-specific the later an argument appears.


In reply to: 259418984 [](ancestors = 259418984)

@@ -1,5 +1,6 @@
#@ TextLoader{
#@ header+
#@ sparse+
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is sparse about this dataset? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


In reply to: 259421045 [](ancestors = 259421045)

@@ -19,12 +19,16 @@ public static class TextLoaderSaverCatalog
/// <param name="hasHeader">Whether the file has a header.</param>
/// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param>
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
/// <param name="allowQuoting">Whether the file can contain column defined by a quoted string.</param>
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFromTextFile calls these options:

            bool allowQuotedStrings = TextLoader.Defaults.AllowQuoting,
            bool supportSparse = TextLoader.Defaults.AllowSparse,

We should be consistent in the names everywhere. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I also checked other quote and sparse in this file.


In reply to: 259422639 [](ancestors = 259422639)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@CESARDELATORRE
Copy link
Contributor

We had a issue in one of the samples when migrating to 0.11 because the Label column in the dataset file had numeric values in quotes. Best workaround was to remove the quotes from the dataset file...

#2821

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
6 participants