Skip to content

Make text loaders consistent #2710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 26, 2019
Merged

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 25, 2019

Another attempt to fix #2472.

@wschin wschin added the API Issues pertaining the friendly API label Feb 25, 2019
@wschin wschin self-assigned this Feb 25, 2019
@wschin wschin requested review from TomFinley and artidoro February 25, 2019 18:04
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param>
public static TextLoader CreateTextLoader(this DataOperationsCatalog catalog,
TextLoader.Column[] columns,
char separatorChar = TextLoader.Defaults.Separator,
bool hasHeader = TextLoader.Defaults.HasHeader,
bool allowSparse = TextLoader.Defaults.AllowSparse,
bool allowQuoting = TextLoader.Defaults.AllowQuoting,
bool trimWhitespace = TextLoader.Defaults.TrimWhitespace,
IMultiStreamSource dataSample = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data sample is really important. (See #2663.) If there are feature names, if there is a header, that's where they're coming from. It's really important. Burying it... what... eight levels deep seems a bit much. To bury it beyond such things as allowing sparsity and quoting itself seems really tough.

I might say the following order is most sensible:

  1. Columns,
  2. Separator char,
  3. Has header,
  4. Data sample,
  5. Quoting,
  6. Trimming,
  7. Sparsity.

I might even think that 5, 6, and 7 could be relegated to advanced options, but no need to insist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got them reordered. Thanks for sharing the list!


In reply to: 259963185 [](ancestors = 259963185)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if hasHeader is true, then the dataSample should not be allowed to be null (because if it is, there can be no slot names). Does it make sense to check this? Another alternative is instead of having a bool argument, to have an IMultiStreamSource header instead. But on the other hand, if there are two IMultiStreamSources that might be confusing. Not sure what the right signature is here. @Ivanidzo4ka, @TomFinley, what do you think?


In reply to: 260017117 [](ancestors = 260017117,259963185)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if hasHeader is true, then the dataSample should not be allowed to be null (because if it is, there can be no slot names). Does it make sense to check this? Another alternative is instead of having a bool argument, to have an IMultiStreamSource header instead. But on the other hand, if there are two IMultiStreamSources that might be confusing. Not sure what the right signature is here. @Ivanidzo4ka, @TomFinley, what do you think?

In reply to: 260017117 [](ancestors = 260017117,259963185)

So, let's imagine that I declare that there is a header, but I provide no sample. I think this is fine. I am merely describing how to read data, and I am saying that when you read the data with this thing, that you will have a header to contend with. If headers were always only used for defining slot names, I might agree, but sometimes files have headers just because they have headers, and there's no larger reason for it.

The presence of absence of a sample does not mean that we cannot fully define the schema, even if header is true. (This is in contrast to some other options, like, "this is fixed size, but I would like you to auto-determine what that size is, which obviously requires a sample.)

So no check would be my reaction.

/// <param name="allowQuoting">Whether the input may include quoted values,
/// which can contain separator characters, colons,
/// and distinguish empty values from missing values. When true, consecutive separators
/// denote a missing value and an empty value is denoted by \"\".
/// When false, consecutive separators denote an empty value.</param>
/// <param name="allowSparse">Whether the input may include sparse representations for example,
/// if one of the row contains "5 2:6 4:3" that's mean there are 5 columns all zero
/// except for 3rd and 5th columns which have values 6 and 3</param>
/// <param name="trimWhitespace">Remove trailing whitespace from lines</param>
public static TextLoader CreateTextLoader<TInput>(this DataOperationsCatalog catalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTextLoader [](start = 33, length = 16)

This is not consistent. The file is missing altogether, which means it cannot be used to read slot names. Whoops.

Copy link
Member Author

@wschin wschin Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I wrongly thought TInput can be used to encode SlotNames. IMultiStreamSource dataSample = null is just added.


In reply to: 259963380 [](ancestors = 259963380)

internal TextLoader(IHostEnvironment env, Column[] columns, char separatorChar = Defaults.Separator,
bool hasHeader = Defaults.HasHeader, bool allowSparse = Defaults.AllowSparse,
bool allowQuoting = Defaults.AllowQuoting, IMultiStreamSource dataSample = null)
: this(env, MakeArgs(columns, hasHeader, new[] { separatorChar }, allowSparse, allowQuoting), dataSample)
bool allowQuoting = Defaults.AllowQuoting, IMultiStreamSource dataSample = null, bool trimWhitespace = Defaults.TrimWhitespace)
Copy link
Contributor

@TomFinley TomFinley Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMultiStreamSource dataSample = null [](start = 55, length = 36)

If you wanted you could just get rid of this constructor opportunistically. It serves no purpose as far as I can tell. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.


In reply to: 259963695 [](ancestors = 259963695)

@@ -47,71 +49,78 @@ public static class TextLoaderSaverCatalog
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param>
/// <param name="separatorChar">Column separator character. Default is '\t'</param>
/// <param name="hasHeader">Does the file contains header?</param>
/// <param name="allowSparse">Whether the input may include sparse representations for example,
/// if one of the row contains "5 2:6 4:3" that's mean there are 5 columns all zero
/// except for 3rd and 5th columns which have values 6 and 3</param>
/// <param name="allowQuoting">Whether the input may include quoted values,
/// which can contain separator characters, colons,
/// and distinguish empty values from missing values. When true, consecutive separators
/// denote a missing value and an empty value is denoted by \"\".
/// When false, consecutive separators denote an empty value.</param>
Copy link
Contributor

@artidoro artidoro Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could make the comments consistent as well it would be great.
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 259975940 [](ancestors = 259975940)

@wschin wschin force-pushed the consistent-text-loaders branch from 55cde9e to 96d561e Compare February 25, 2019 21:59
Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@codecov
Copy link

codecov bot commented Feb 25, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@f6d55f3). Click here to learn what that means.
The diff coverage is 50.8%.

@@            Coverage Diff            @@
##             master    #2710   +/-   ##
=========================================
  Coverage          ?   71.66%           
=========================================
  Files             ?      808           
  Lines             ?   142292           
  Branches          ?    16139           
=========================================
  Hits              ?   101972           
  Misses            ?    35879           
  Partials          ?     4441
Flag Coverage Δ
#Debug 71.66% <50.8%> (?)
#production 67.92% <95.74%> (?)
#test 85.82% <23.37%> (?)
Impacted Files Coverage Δ
...t/Microsoft.ML.Benchmarks/PredictionEngineBench.cs 0% <0%> (ø)
...s/StochasticDualCoordinateAscentClassifierBench.cs 0% <0%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 81.71% <100%> (ø)
...osoft.ML.Functional.Tests/Datasets/TypeTestData.cs 96.34% <100%> (ø)
test/Microsoft.ML.Functional.Tests/Prediction.cs 100% <100%> (ø)
...ML.Data/Transforms/ValueToKeyMappingTransformer.cs 78.21% <100%> (ø)
...ML.Tests/TrainerEstimators/MetalinearEstimators.cs 100% <100%> (ø)
test/Microsoft.ML.Tests/TextLoaderTests.cs 98.9% <100%> (ø)
...soft.ML.Functional.Tests/Datasets/MnistOneClass.cs 83.33% <100%> (ø)
test/Microsoft.ML.Functional.Tests/Evaluation.cs 100% <100%> (ø)
... and 3 more

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wschin this looks like it's in a much better place now!

@wschin wschin merged commit c90fa51 into dotnet:master Feb 26, 2019
@wschin wschin deleted the consistent-text-loaders branch February 26, 2019 19:29
@@ -47,79 +61,98 @@ public static class TextLoaderSaverCatalog
/// <param name="catalog">The <see cref="DataOperationsCatalog"/> catalog.</param>
/// <param name="separatorChar">Column separator character. Default is '\t'</param>
/// <param name="hasHeader">Does the file contains header?</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasHeader [](start = 25, length = 9)

Nit pick - I would rephrase this as "Set to true if the file contains a header row".

public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, string path, TextLoader.Options options = null)
/// <param name="dataSample">The optional location of a data sample. The sample can be used to infer column names and number of slots in each column.</param>
public static IDataView ReadFromTextFile(this DataOperationsCatalog catalog, string path,
TextLoader.Options options = null, IMultiStreamSource dataSample = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, IMultiStreamSource dataSample = null [](start = 45, length = 38)

Hi @wschin, somehow I missed this -- definitely we don't need to have a data sample if we're talking about the path to a data file already.

/// <returns>The data view.</returns>
public static IDataView ReadFromTextFile<TInput>(this DataOperationsCatalog catalog,
string path,
char separatorChar = TextLoader.Defaults.Separator,
bool hasHeader = TextLoader.Defaults.HasHeader,
IMultiStreamSource dataSample = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMultiStreamSource dataSample [](start = 12, length = 29)

Similar here.

bool trimWhitespace = TextLoader.Defaults.TrimWhitespace,
bool allowSparse = TextLoader.Defaults.AllowSparse)
=> TextLoader.CreateTextReader<TInput>(CatalogUtils.GetEnvironment(catalog), hasHeader, separatorChar, allowQuoting,
allowSparse, trimWhitespace, dataSample: dataSample);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataSample [](start = 45, length = 10)

why pass the arguments this way? Why not add dataSample after hasHeader?

/// <param name="allowQuoting">Whether the input may include quoted values,
/// which can contain separator characters, colons,
/// and distinguish empty values from missing values. When true, consecutive separators
/// denote a missing value and an empty value is denoted by \"\".
/// When false, consecutive separators denote an empty value.</param>
/// <param name="trimWhitespace">Remove trailing whitespace from lines</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - Add a period.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any reason for inconsistency in params for TextLoader?
5 participants