-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Change Default Settings in TextLoader #2630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -395,7 +396,7 @@ public sealed class Options : ArgumentsCore | |||
internal static class DefaultArguments | |||
{ | |||
internal const bool AllowQuoting = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllowQuoting [](start = 32, length = 12)
I feel like this needs to be changed too. That is, quoting is not a good option to have on by default. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like having quoting on by default is better. Mainly this helps with CSV files w/ text columns (commas in the quoted strings).
row | needs quotes |
---|---|
true , "Schrödinger's cat walks into a bar, and doesn't." |
yes |
false , Wanted: Schrodinger's cat. Dead and alive. |
no |
I see more datasets with gains from default on quoting, then hurt by it. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have non-standard settings to be non-default. There is no standard on parsing quoted strings.
In reply to: 258341863 [](ancestors = 258341863)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, my recommendation is to keep AllowQuoting
defaulting to true
.
General reason is that it more datasets will work. The impact of the parameter falsely on, is less than falsely off. The main non-standard part is escaping methods, having quoting is standard.
You'll see others default to quoting:
- Python: https://docs.python.org/2/library/csv.html
- Perl: https://metacpan.org/pod/Text::CSV#always_quote
- Go: https://golang.org/pkg/encoding/csv/
- RFC 4180: https://en.wikipedia.org/wiki/Comma-separated_values#RFC_4180_standard
- Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
- PyTorch: https://github.com/pytorch/text/blob/499e327ea53bdf67c648f5747ed26764283b968a/torchtext/data/dataset.py#L217
- TensorFlow: https://www.tensorflow.org/api_docs/python/tf/io/decode_csv
- Excel: ...
/cc @TomFinley: thoughts? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.
The issue was our quoting support did not go far enough. I main issue I faced was that we haven't added support for newlines in the quoted string. I believe the argument at the time was that we treat lines independently for speed.
Most datasets are in CSV/TSV format. We should be supporting the datasets of our users. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If quote is not fully-functioning, it'd be better to turn it off by default and have another TSC & CSV reader implementation, I guess.
In reply to: 258668384 [](ancestors = 258668384)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. Exactly zero of the issues I ran into would be solved by turning off quoting.
I don't think that's true. I recall on multiple occasions seeing you run experiments where you were getting these warnings and you just sort of ignored it, even though it was probably corrupting your dataset. It definitely wasn't beneficial on the whole. It isn't a matter of it "not go far enough," because it wasn't intended to do that at any point. That the CSV format is not self synchronizing (due to multi-line issue) and therefore inappropriate for distributed applications, and therefore we did not elect to use it as our primary format, has been explained in other forums. However it again reinforces the point that this is not a CSV reader, which I believe I also mentioned in my reply. I don't mean to belabor the point. Since it is the premise of the claim that we should continue to support quoting, though, this is why I do not find that argument convincing.
There's also the issue that there are other people in the world, and on the whole if I look through my DRI email list, I see way more instances of turning quoting off than I see instances where someone somehow accidentally stumbled into using our quoting system.
So, off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as lesser of two evils.
Default ON:
- Has the potential to cause read errors (dropped rows) when there are newlines or quotes in the quoted string (generally doesn't matter)
- Non-text datasets are unaffected
Default OFF:
- Most text datasets fail (bad)
- Non-text datasets are unaffected
Let me know if I'm wrong, but I think the argument more directly is:
The user should be forced to mung their dataset in to the form we are expecting (eg: no commas/newlines/escaped-quotes in a string) vs. we should do our best on the data the user hands us (expectation of dirty data).
/cc: @CESARDELATORRE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it's fairly simpler than that. If I think of the user questions and scenarios that I've handled with text loader over the years, and I tally up (1) those datasets that conformed to our scheme of quoting and (2) those that did not, there's a really clear tendency. I also think of all the problems it's caused over the years.
IMultiStreamSource dataSample = null) | ||
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample); | ||
IMultiStreamSource dataSample = null, | ||
bool allowSparse = TextLoader.DefaultArguments.AllowSparse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader [](start = 31, length = 10)
Consider that since this is the code that needs to be changed for #2452, which you've self-assigned anyway, we may as well correct this while we're at it. (Otherwise I see no point in changing it.) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #2630 +/- ##
=========================================
Coverage ? 71.58%
=========================================
Files ? 805
Lines ? 142025
Branches ? 16130
=========================================
Hits ? 101675
Misses ? 35910
Partials ? 4440
|
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample) | ||
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param> | ||
/// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param> | ||
public TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null, bool allowSparse = false, bool allowQuoting = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new parameters should use the default constants. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1145,6 +1148,7 @@ private sealed class LoaderHolder | |||
// Verify that the current schema-defining arguments are default. | |||
// Get settings just for core arguments, not everything. | |||
string tmp = CmdParser.GetSettings(host, options, new ArgumentsCore()); | |||
tmp = Regex.Replace(tmp, @"[(sparse=\+)|(quote\+)]", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t understand what this is for... can you explain? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our test framework throws if any non-default setting found (indicating by an non-empty tmp
here). As sparse+
and quote+
become non-default settings and are required in many tests, we need to remove them here. The design doesn't look good to me but I feel it may take a while (e.g., maybe 3-4 days but I am not sure) if we need to refactorize the framework.
In reply to: 258307862 [](ancestors = 258307862)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With changes made by (probably) @artidoro, I have dropped the new regexp. #Resolved
// REVIEW: consider saving a command line in a separate file. | ||
public sealed class Arguments | ||
{ | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Separator", ShortName = "sep")] | ||
public string Separator = "tab"; | ||
public string Separator = DefaultArguments.Separator.ToString(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that what we had before was correct, I think that '\t'.ToString()
simply gives a tab as a string. Here it was meant to be an understandable description of separators. We since moved to separator chars in TextLoader and such, in which case we use the character directly to define the separator. Here I believe we have to keep it like it was "tab". #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wschin !
@@ -144,20 +148,22 @@ public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, str | |||
/// <param name="headerRow">Whether to write the header row.</param> | |||
/// <param name="schema">Whether to write the header comment with the schema.</param> | |||
/// <param name="keepHidden">Whether to keep hidden columns in the dataset.</param> | |||
/// <param name="forceDense">Whether to save columns in dense format even if they are sparse vectors.</param> | |||
public static void SaveAsText(this DataOperationsCatalog catalog, | |||
IDataView data, | |||
Stream stream, | |||
char separatorChar = TextLoader.DefaultArguments.Separator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be TextSaver.DefaultArguments.Separator
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should match the signature of TextLoader
which uses a separatorChar. If this logic works, then my comment about the need to use "tab" instead of '\t'.ToString()
was wrong. Could you double check that the default '\t'.ToString()
that you are using here actually works? If not, I feel we should change TextSaver
to use a separatorChar.
In reply to: 259067133 [](ancestors = 259067133)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it didn't work, some tests may fail. I agree char is better.
In reply to: 259071246 [](ancestors = 259071246,259067133)
Could you also rebase/merge with master as I think I made some conflicting changes to |
53c1850
to
fdd08cf
Compare
Done. Hopefully that's my last time of doing so for this PR..... In reply to: 466127204 [](ancestors = 466127204) |
@@ -7,6 +7,7 @@ | |||
using System.Linq; | |||
using System.Reflection; | |||
using System.Text; | |||
using System.Text.RegularExpressions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary anymore, right? #Resolved
internal TextLoader(IHostEnvironment env, Column[] columns, bool hasHeader = false, char separatorChar = '\t', IMultiStreamSource dataSample = null) | ||
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }), dataSample) | ||
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param> | ||
/// <param name="allowQuoting">Whether the file can contain numerical vectors in sparse format.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy-paste error here. I don't think this is what allowQuoting
means :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Got it changed to /// <param name="allowQuoting">Whether the content of a column can be parsed from a string starting and ending with quote.</param>
.
In reply to: 259418380 [](ancestors = 259418380)
IMultiStreamSource dataSample = null) | ||
=> new TextLoader(CatalogUtils.GetEnvironment(catalog), columns, hasHeader, separatorChar, dataSample); | ||
IMultiStreamSource dataSample = null, | ||
bool allowSparse = TextLoader.Defaults.AllowSparse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these go before dataSample
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reorder those arguments based on their usage frequencies (in my mind). In general, the more ML.NET-specific the later an argument appears.
In reply to: 259418984 [](ancestors = 259418984)
test/data/adult.tiny.with-schema.txt
Outdated
@@ -1,5 +1,6 @@ | |||
#@ TextLoader{ | |||
#@ header+ | |||
#@ sparse+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is sparse about this dataset? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -19,12 +19,16 @@ public static class TextLoaderSaverCatalog | |||
/// <param name="hasHeader">Whether the file has a header.</param> | |||
/// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param> | |||
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param> | |||
/// <param name="allowSparse">Whether the file can contain numerical vectors in sparse format.</param> | |||
/// <param name="allowQuoting">Whether the file can contain column defined by a quoted string.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadFromTextFile
calls these options:
bool allowQuotedStrings = TextLoader.Defaults.AllowQuoting,
bool supportSparse = TextLoader.Defaults.AllowSparse,
We should be consistent in the names everywhere. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I also checked other quote
and sparse
in this file.
In reply to: 259422639 [](ancestors = 259422639)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a issue in one of the samples when migrating to 0.11 because the Label column in the dataset file had numeric values in quotes. Best workaround was to remove the quotes from the dataset file... |
To fix #2576, this PR makes the default to
sparse-
everywhere including both of command line and public APIs.sparse-
as defaultquote-
as default