Change Default Settings in TextLoader #2630

wschin · 2019-02-19T21:28:22Z

To fix #2576, this PR makes the default to sparse- everywhere including both of command line and public APIs.

Make sparse- as default
Make quote- as default
Fix DataOperationsCatalog SaveAsText extension method is evil #2452, which is about TextSaver.

TomFinley · 2019-02-19T21:37:44Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

@@ -395,7 +396,7 @@ public sealed class Options : ArgumentsCore
        internal static class DefaultArguments
        {
            internal const bool AllowQuoting = true;


AllowQuoting [](start = 32, length = 12)

I feel like this needs to be changed too. That is, quoting is not a good option to have on by default. #Resolved

Absolutely. Please take a look at Iteration 3.

In reply to: 258240404 [](ancestors = 258240404)

I feel like having quoting on by default is better. Mainly this helps with CSV files w/ text columns (commas in the quoted strings).

row needs quotes

true, "Schrödinger's cat walks into a bar, and doesn't." yes

false, Wanted: Schrodinger's cat. Dead and alive. no

I see more datasets with gains from default on quoting, then hurt by it. #Resolved

I'd like to have non-standard settings to be non-default. There is no standard on parsing quoted strings.

In reply to: 258341863 [](ancestors = 258341863)

To be clear, my recommendation is to keep AllowQuoting defaulting to true.

General reason is that it more datasets will work. The impact of the parameter falsely on, is less than falsely off. The main non-standard part is escaping methods, having quoting is standard.

You'll see others default to quoting:

Python: https://docs.python.org/2/library/csv.html

Perl: https://metacpan.org/pod/Text::CSV#always_quote

Go: https://golang.org/pkg/encoding/csv/

RFC 4180: https://en.wikipedia.org/wiki/Comma-separated_values#RFC_4180_standard

Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

PyTorch: https://github.com/pytorch/text/blob/499e327ea53bdf67c648f5747ed26764283b968a/torchtext/data/dataset.py#L217

TensorFlow: https://www.tensorflow.org/api_docs/python/tf/io/decode_csv

Excel: ...

/cc @TomFinley: thoughts? #Resolved

This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.

The issue was our quoting support did not go far enough. I main issue I faced was that we haven't added support for newlines in the quoted string. I believe the argument at the time was that we treat lines independently for speed.

Most datasets are in CSV/TSV format. We should be supporting the datasets of our users. #Resolved

If quote is not fully-functioning, it'd be better to turn it off by default and have another TSC & CSV reader implementation, I guess.

In reply to: 258668384 [](ancestors = 258668384)

This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.

I don't think that's true. I recall on multiple occasions seeing you run experiments where you were getting these warnings and you just sort of ignored it, even though it was probably corrupting your dataset. It definitely wasn't beneficial on the whole. It isn't a matter of it "not go far enough," because it wasn't intended to do that at any point. That the CSV format is not self synchronizing (due to multi-line issue) and therefore inappropriate for distributed applications, and therefore we did not elect to use it as our primary format, has been explained in other forums. However it again reinforces the point that this is not a CSV reader, which I believe I also mentioned in my reply. I don't mean to belabor the point. Since it is the premise of the claim that we should continue to support quoting, though, this is why I do not find that argument convincing.

There's also the issue that there are other people in the world, and on the whole if I look through my DRI email list, I see way more instances of turning quoting off than I see instances where someone somehow accidentally stumbled into using our quoting system.

So, off.

I see this as lesser of two evils.

Default ON:

Has the potential to cause read errors (dropped rows) when there are newlines or quotes in the quoted string (generally doesn't matter)

Non-text datasets are unaffected

Default OFF:

Most text datasets fail (bad)

Non-text datasets are unaffected

Let me know if I'm wrong, but I think the argument more directly is:
The user should be forced to mung their dataset in to the form we are expecting (eg: no commas/newlines/escaped-quotes in a string) vs. we should do our best on the data the user hands us (expectation of dirty data).

/cc: @CESARDELATORRE

To me it's fairly simpler than that. If I think of the user questions and scenarios that I've handled with text loader over the years, and I tally up (1) those datasets that conformed to our scheme of quoting and (2) those that did not, there's a really clear tendency. I also think of all the problems it's caused over the years.

TomFinley · 2019-02-19T21:38:57Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs

-            IMultiStreamSource dataSample = null)
-            => new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample);
+            IMultiStreamSource dataSample = null,
+            bool allowSparse = TextLoader.DefaultArguments.AllowSparse)


TextLoader [](start = 31, length = 10)

Consider that since this is the code that needs to be changed for #2452, which you've self-assigned anyway, we may as well correct this while we're at it. (Otherwise I see no point in changing it.) #Resolved

Couldn't agree more. Proposed solution to #2452 can be found in Iteration 4!

In reply to: 258240786 [](ancestors = 258240786)

TomFinley

Looks mostly good, thank you @wschin !! My only high level comments are we should also fix the "quoting" thing as discussed in the issue as well as separate out the text loader/saving arguments, since this touches exactly the code we'd have to address in #2452 anyway.

codecov · 2019-02-19T22:12:52Z

Codecov Report

❗ No coverage uploaded for pull request base (master@412e1f9). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #2630   +/-   ##
=========================================
  Coverage          ?   71.58%           
=========================================
  Files             ?      805           
  Lines             ?   142025           
  Branches          ?    16130           
=========================================
  Hits              ?   101675           
  Misses            ?    35910           
  Partials          ?     4440

Flag	Coverage Δ
#Debug	`71.58% <100%> (?)`
#production	`67.88% <100%> (?)`
#test	`85.74% <100%> (?)`

Impacted Files	Coverage Δ
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs	`73.76% <100%> (ø)`
...est/Microsoft.ML.Predictor.Tests/TestPredictors.cs	`63.84% <100%> (ø)`
test/Microsoft.ML.Tests/AnomalyDetectionTests.cs	`100% <100%> (ø)`
test/Microsoft.ML.TimeSeries.Tests/TimeSeries.cs	`87.61% <100%> (ø)`
test/Microsoft.ML.FSharp.Tests/SmokeTests.fs	`96.07% <100%> (ø)`
test/Microsoft.ML.Functional.Tests/DataIO.cs	`100% <100%> (ø)`
...L.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs	`100% <100%> (ø)`
...crosoft.ML.TestFramework/BaseTestPredictorsMaml.cs	`77.01% <100%> (ø)`
...cenariosWithDirectInstantiation/TensorflowTests.cs	`91.76% <100%> (ø)`
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.09% <100%> (ø)`
... and 8 more

eerhardt · 2019-02-20T02:05:15Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

-            : this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample)
+        /// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
+        /// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param>
+        public TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null, bool allowSparse = false, bool allowQuoting = false)


The new parameters should use the default constants. #Resolved

Sure.

In reply to: 258307575 [](ancestors = 258307575)

eerhardt · 2019-02-20T02:06:48Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

@@ -1145,6 +1148,7 @@ private sealed class LoaderHolder
            // Verify that the current schema-defining arguments are default.
            // Get settings just for core arguments, not everything.
            string tmp = CmdParser.GetSettings(host, options, new ArgumentsCore());
+            tmp = Regex.Replace(tmp, @"[(sparse=\+)|(quote\+)]", "");


I don’t understand what this is for... can you explain? #Resolved

Our test framework throws if any non-default setting found (indicating by an non-empty tmp here). As sparse+ and quote+ become non-default settings and are required in many tests, we need to remove them here. The design doesn't look good to me but I feel it may take a while (e.g., maybe 3-4 days but I am not sure) if we need to refactorize the framework.

In reply to: 258307862 [](ancestors = 258307862)

With changes made by (probably) @artidoro, I have dropped the new regexp. #Resolved

artidoro · 2019-02-20T23:00:39Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextSaver.cs

        // REVIEW: consider saving a command line in a separate file.
        public sealed class Arguments
        {
            [Argument(ArgumentType.AtMostOnce, HelpText = "Separator", ShortName = "sep")]
-            public string Separator = "tab";
+            public string Separator = DefaultArguments.Separator.ToString();



I think that what we had before was correct, I think that '\t'.ToString() simply gives a tab as a string. Here it was meant to be an understandable description of separators. We since moved to separator chars in TextLoader and such, in which case we use the character directly to define the separator. Here I believe we have to keep it like it was "tab". #Resolved

Ok.

In reply to: 258714604 [](ancestors = 258714604)

TomFinley

Thanks @wschin !

artidoro · 2019-02-21T18:53:37Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs

@@ -144,20 +148,22 @@ public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, str
        /// <param name="headerRow">Whether to write the header row.</param>
        /// <param name="schema">Whether to write the header comment with the schema.</param>
        /// <param name="keepHidden">Whether to keep hidden columns in the dataset.</param>
+        /// <param name="forceDense">Whether to save columns in dense format even if they are sparse vectors.</param>
        public static void SaveAsText(this DataOperationsCatalog catalog,
            IDataView data,
            Stream stream,
            char separatorChar = TextLoader.DefaultArguments.Separator,


Should this be TextSaver.DefaultArguments.Separator? #Resolved

Actually we should match the signature of TextLoader which uses a separatorChar. If this logic works, then my comment about the need to use "tab" instead of '\t'.ToString() was wrong. Could you double check that the default '\t'.ToString() that you are using here actually works? If not, I feel we should change TextSaver to use a separatorChar.

In reply to: 259067133 [](ancestors = 259067133)

If it didn't work, some tests may fail. I agree char is better.

In reply to: 259071246 [](ancestors = 259071246,259067133)

artidoro · 2019-02-21T19:10:53Z

Could you also rebase/merge with master as I think I made some conflicting changes to TextLoader. #Resolved

wschin · 2019-02-22T01:08:52Z

Done. Hopefully that's my last time of doing so for this PR.....

In reply to: 466127204 [](ancestors = 466127204)

eerhardt · 2019-02-22T16:39:11Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

@@ -7,6 +7,7 @@
 using System.Linq;
 using System.Reflection;
 using System.Text;
+using System.Text.RegularExpressions;


This isn't necessary anymore, right? #Resolved

eerhardt · 2019-02-22T16:39:44Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

-        internal TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null)
-            : this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample)
+        /// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
+        /// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param>


copy-paste error here. I don't think this is what allowQuoting means :) #Resolved

Thanks. Got it changed to /// <param name="allowQuoting">Whether the content of a column can be parsed from a string starting and ending with quote.</param>.

In reply to: 259418380 [](ancestors = 259418380)

eerhardt · 2019-02-22T16:40:49Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs

-            IMultiStreamSource dataSample = null)
-            => new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample);
+            IMultiStreamSource dataSample = null,
+            bool allowSparse = TextLoader.Defaults.AllowSparse,


Should these go before dataSample? #Resolved

I reorder those arguments based on their usage frequencies (in my mind). In general, the more ML.NET-specific the later an argument appears.

In reply to: 259418984 [](ancestors = 259418984)

eerhardt · 2019-02-22T16:43:31Z

test/data/adult.tiny.with-schema.txt

@@ -1,5 +1,6 @@
 #@ TextLoader{
 #@   header+
+#@   sparse+


What is sparse about this dataset? #Resolved

Removed.

In reply to: 259421045 [](ancestors = 259421045)

eerhardt · 2019-02-22T16:45:54Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs

@@ -19,12 +19,16 @@ public static class TextLoaderSaverCatalog
        /// <param name="hasHeader">Whether the file has a header.</param>
        /// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param>
        /// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param>
+        /// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param>
+        /// <param name="allowQuoting">Whether the file can contain column defined by a quoted string.</param>


ReadFromTextFile calls these options:

bool allowQuotedStrings = TextLoader.Defaults.AllowQuoting, bool supportSparse = TextLoader.Defaults.AllowSparse,

We should be consistent in the names everywhere. #Resolved

Ok. I also checked other quote and sparse in this file.

In reply to: 259422639 [](ancestors = 259422639)

artidoro

CESARDELATORRE · 2019-03-04T03:39:53Z

We had a issue in one of the samples when migrating to 0.11 because the Label column in the dataset file had numeric values in quotes. Best workaround was to remove the quotes from the dataset file...

#2821

wschin added the API Issues pertaining the friendly API label Feb 19, 2019

wschin self-assigned this Feb 19, 2019

wschin requested review from eerhardt, TomFinley and artidoro February 19, 2019 21:28

TomFinley reviewed Feb 19, 2019

View reviewed changes

wschin changed the title ~~[WIP] Change Default Settings in TextLoader~~ Change Default Settings in TextLoader Feb 19, 2019

eerhardt reviewed Feb 20, 2019

View reviewed changes

artidoro reviewed Feb 20, 2019

View reviewed changes

TomFinley approved these changes Feb 20, 2019

View reviewed changes

artidoro reviewed Feb 21, 2019

View reviewed changes

wschin added 6 commits February 21, 2019 16:16

Use AllowSparse=false as default in TextLoader

7034a09

Update entry point catelog

e954b86

Make quote- default

39beedd

TextLoader uses TextLoader's default settings

1cabfff

Address comments

a537d53

tab to \t

fdd08cf

wschin force-pushed the textloader-args branch from 53c1850 to fdd08cf Compare February 22, 2019 01:06

Revert a weird change

efe8019

eerhardt reviewed Feb 22, 2019

View reviewed changes

Address comments

e880f3b

wschin added 3 commits February 22, 2019 09:21

Reorder arguments

7ce8a5d

Polish cookbook

ddf3a10

Reorder arguments in static TextLoader

36829ab

artidoro approved these changes Feb 22, 2019

View reviewed changes

Also change argument order in F#

dd29269

wschin merged commit ec418e4 into dotnet:master Feb 22, 2019

wschin deleted the textloader-args branch February 22, 2019 19:22

Ivanidzo4ka mentioned this pull request Mar 3, 2019

In v0.11 Transforms.Conversion.ConvertType() does not properly convert numeric values if they are "in quotes" #2824

Closed

CESARDELATORRE mentioned this pull request Mar 4, 2019

Exception in CreditCard Fraud Detection sample while migrating to v0.11 #2821

Closed

stephentoub mentioned this pull request Mar 5, 2019

Revert "Turn off supportSparse in GitHubLabeler sample" dotnet/machinelearning-samples#296

Merged

justinormont mentioned this pull request Nov 8, 2019

[AutoML v0.16.0] InferColumn doesn't work on tricky csv file #4460

Closed

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

row	needs quotes
`true`, `"Schrödinger's cat walks into a bar, and doesn't."`	yes
`false`, `Wanted: Schrodinger's cat. Dead and alive.`	no

Change Default Settings in TextLoader #2630

Change Default Settings in TextLoader #2630

Uh oh!

Conversation

wschin commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomFinley Feb 19, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley Feb 19, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eerhardt Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerhardt Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artidoro Feb 20, 2019 • edited by wschin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley left a comment

Choose a reason for hiding this comment

Uh oh!

wschin commented Feb 19, 2019 •

edited

Loading

TomFinley Feb 19, 2019 •

edited by wschin

Loading

justinormont Feb 20, 2019 •

edited by wschin

Loading

justinormont Feb 20, 2019 •

edited by wschin

Loading

justinormont Feb 20, 2019 •

edited by wschin

Loading

TomFinley Feb 20, 2019 •

edited

Loading

justinormont Feb 20, 2019 •

edited

Loading

TomFinley Feb 19, 2019 •

edited by wschin

Loading

TomFinley left a comment •

edited

Loading

codecov bot commented Feb 19, 2019 •

edited

Loading

eerhardt Feb 20, 2019 •

edited by wschin

Loading

eerhardt Feb 20, 2019 •

edited by wschin

Loading

wschin Feb 20, 2019 •

edited

Loading

wschin Feb 22, 2019 •

edited

Loading

artidoro Feb 20, 2019 •

edited by wschin

Loading

artidoro Feb 21, 2019 •

edited by wschin

Loading

artidoro Feb 21, 2019 •

edited

Loading

artidoro commented Feb 21, 2019 •

edited by wschin

Loading

eerhardt Feb 22, 2019 •

edited by wschin

Loading

eerhardt Feb 22, 2019 •

edited by wschin

Loading

eerhardt Feb 22, 2019 •

edited by wschin

Loading

eerhardt Feb 22, 2019 •

edited by wschin

Loading

eerhardt Feb 22, 2019 •

edited by wschin

Loading