-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Make text loaders consistent #2710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param> | ||
public static TextLoader CreateTextLoader(this DataOperationsCatalog catalog, | ||
TextLoader.Column[] columns, | ||
char separatorChar = TextLoader.Defaults.Separator, | ||
bool hasHeader = TextLoader.Defaults.HasHeader, | ||
bool allowSparse = TextLoader.Defaults.AllowSparse, | ||
bool allowQuoting = TextLoader.Defaults.AllowQuoting, | ||
bool trimWhitespace = TextLoader.Defaults.TrimWhitespace, | ||
IMultiStreamSource dataSample = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data sample is really important. (See #2663.) If there are feature names, if there is a header, that's where they're coming from. It's really important. Burying it... what... eight levels deep seems a bit much. To bury it beyond such things as allowing sparsity and quoting itself seems really tough.
I might say the following order is most sensible:
- Columns,
- Separator char,
- Has header,
- Data sample,
- Quoting,
- Trimming,
- Sparsity.
I might even think that 5, 6, and 7 could be relegated to advanced options, but no need to insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just got them reordered. Thanks for sharing the list!
In reply to: 259963185 [](ancestors = 259963185)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if hasHeader is true, then the dataSample should not be allowed to be null (because if it is, there can be no slot names). Does it make sense to check this? Another alternative is instead of having a bool argument, to have an IMultiStreamSource header
instead. But on the other hand, if there are two IMultiStreamSource
s that might be confusing. Not sure what the right signature is here. @Ivanidzo4ka, @TomFinley, what do you think?
In reply to: 260017117 [](ancestors = 260017117,259963185)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if hasHeader is true, then the dataSample should not be allowed to be null (because if it is, there can be no slot names). Does it make sense to check this? Another alternative is instead of having a bool argument, to have an
IMultiStreamSource header
instead. But on the other hand, if there are twoIMultiStreamSource
s that might be confusing. Not sure what the right signature is here. @Ivanidzo4ka, @TomFinley, what do you think?In reply to: 260017117 [](ancestors = 260017117,259963185)
So, let's imagine that I declare that there is a header, but I provide no sample. I think this is fine. I am merely describing how to read data, and I am saying that when you read the data with this thing, that you will have a header to contend with. If headers were always only used for defining slot names, I might agree, but sometimes files have headers just because they have headers, and there's no larger reason for it.
The presence of absence of a sample does not mean that we cannot fully define the schema, even if header is true. (This is in contrast to some other options, like, "this is fixed size, but I would like you to auto-determine what that size is, which obviously requires a sample.)
So no check would be my reaction.
/// <param name="allowQuoting">Whether the input may include quoted values, | ||
/// which can contain separator characters, colons, | ||
/// and distinguish empty values from missing values. When true, consecutive separators | ||
/// denote a missing value and an empty value is denoted by \"\". | ||
/// When false, consecutive separators denote an empty value.</param> | ||
/// <param name="allowSparse">Whether the input may include sparse representations for example, | ||
/// if one of the row contains "5 2:6 4:3" that's mean there are 5 columns all zero | ||
/// except for 3rd and 5th columns which have values 6 and 3</param> | ||
/// <param name="trimWhitespace">Remove trailing whitespace from lines</param> | ||
public static TextLoader CreateTextLoader<TInput>(this DataOperationsCatalog catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTextLoader [](start = 33, length = 16)
This is not consistent. The file is missing altogether, which means it cannot be used to read slot names. Whoops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I wrongly thought TInput can be used to encode SlotNames. IMultiStreamSource dataSample = null
is just added.
In reply to: 259963380 [](ancestors = 259963380)
internal TextLoader(IHostEnvironment env, Column[] columns, char separatorChar = Defaults.Separator, | ||
bool hasHeader = Defaults.HasHeader, bool allowSparse = Defaults.AllowSparse, | ||
bool allowQuoting = Defaults.AllowQuoting, IMultiStreamSource dataSample = null) | ||
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }, allowSparse, allowQuoting), dataSample) | ||
bool allowQuoting = Defaults.AllowQuoting, IMultiStreamSource dataSample = null, bool trimWhitespace = Defaults.TrimWhitespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMultiStreamSource dataSample = null [](start = 55, length = 36)
If you wanted you could just get rid of this constructor opportunistically. It serves no purpose as far as I can tell. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -47,71 +49,78 @@ public static class TextLoaderSaverCatalog | |||
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param> | |||
/// <param name="separatorChar">Column separator character. Default is '\t'</param> | |||
/// <param name="hasHeader">Does the file contains header?</param> | |||
/// <param name="allowSparse">Whether the input may include sparse representations for example, | |||
/// if one of the row contains "5 2:6 4:3" that's mean there are 5 columns all zero | |||
/// except for 3rd and 5th columns which have values 6 and 3</param> | |||
/// <param name="allowQuoting">Whether the input may include quoted values, | |||
/// which can contain separator characters, colons, | |||
/// and distinguish empty values from missing values. When true, consecutive separators | |||
/// denote a missing value and an empty value is denoted by \"\". | |||
/// When false, consecutive separators denote an empty value.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could make the comments consistent as well it would be great.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
55cde9e
to
96d561e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #2710 +/- ##
=========================================
Coverage ? 71.66%
=========================================
Files ? 808
Lines ? 142292
Branches ? 16139
=========================================
Hits ? 101972
Misses ? 35879
Partials ? 4441
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wschin this looks like it's in a much better place now!
@@ -47,79 +61,98 @@ public static class TextLoaderSaverCatalog | |||
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param> | |||
/// <param name="separatorChar">Column separator character. Default is '\t'</param> | |||
/// <param name="hasHeader">Does the file contains header?</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasHeader [](start = 25, length = 9)
Nit pick - I would rephrase this as "Set to true if the file contains a header row".
public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, string path, TextLoader.Options options = null) | ||
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param> | ||
public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, string path, | ||
TextLoader.Options options = null, IMultiStreamSource dataSample = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, IMultiStreamSource dataSample = null [](start = 45, length = 38)
Hi @wschin, somehow I missed this -- definitely we don't need to have a data sample if we're talking about the path to a data file already.
/// <returns>The data view.</returns> | ||
public static IDataView ReadFromTextFile<TInput>(this DataOperationsCatalog catalog, | ||
string path, | ||
char separatorChar = TextLoader.Defaults.Separator, | ||
bool hasHeader = TextLoader.Defaults.HasHeader, | ||
IMultiStreamSource dataSample = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMultiStreamSource dataSample [](start = 12, length = 29)
Similar here.
bool trimWhitespace = TextLoader.Defaults.TrimWhitespace, | ||
bool allowSparse = TextLoader.Defaults.AllowSparse) | ||
=> TextLoader.CreateTextReader<TInput>(CatalogUtils.GetEnvironment(catalog), hasHeader, separatorChar, allowQuoting, | ||
allowSparse, trimWhitespace, dataSample: dataSample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataSample [](start = 45, length = 10)
why pass the arguments this way? Why not add dataSample after hasHeader?
/// <param name="allowQuoting">Whether the input may include quoted values, | ||
/// which can contain separator characters, colons, | ||
/// and distinguish empty values from missing values. When true, consecutive separators | ||
/// denote a missing value and an empty value is denoted by \"\". | ||
/// When false, consecutive separators denote an empty value.</param> | ||
/// <param name="trimWhitespace">Remove trailing whitespace from lines</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - Add a period.
Another attempt to fix #2472.