-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Textloader internalizations, documentation, and Arguments refactoring #2417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2417 +/- ##
==========================================
- Coverage 71.55% 71.53% -0.02%
==========================================
Files 800 800
Lines 141846 141847 +1
Branches 16119 16119
==========================================
- Hits 101497 101474 -23
- Misses 35893 35922 +29
+ Partials 4456 4451 -5
|
@@ -199,6 +237,9 @@ internal bool IsValid() | |||
} | |||
} | |||
|
|||
/// <summary> | |||
/// Specifies the range of indices of input columns that should be mapped to an output column. | |||
/// </summary> | |||
public sealed class Range | |||
{ | |||
public Range() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public Range() { } [](start = 11, length = 19)
is this used? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is! And I think you are right it looks a bit strange. After discussing with Tom, I am going to open an issue about reshaping the public surface of TextLoader and fix it separately. Let's keep this PR only for internalization and refactoring of Arguments.
In reply to: 254397852 [](ancestors = 254397852)
I'm glad what you awere of git fu, but rebase and force push screw up codeflow, it get;s hard to track comments. Can you just merge and push? On other note, do you plan to update this PR? #Resolved |
Related issue: #2472 "Any reason for inconsistency in params for TextLoader?" #Resolved |
I am still trying to learn from master Tom about git magic, but I understand the problem with rebase once the PR is out. In reply to: 461882216 [](ancestors = 461882216) |
Absolutely, I will address that and further concerns regarding TextLoader in another PR. This one is just to reduce the current public surface to what is essential. In that issue, and in a future PR I think we should rethink what we expose and possibly simplify it. In reply to: 461952978 [](ancestors = 461952978) |
I have resolved Ivan's comments, so if @Ivanidzo4ka and @sfilipi could take another look it would be great. #Resolved |
/// Describes how an input column should be mapped to an <see cref="IDataView"/> column. | ||
/// </summary> | ||
/// <param name="name">Name of the column.</param> | ||
/// <param name="type"><see cref="DataKind"/> of the items in the column. If <see langword="null"/> defaults to a <see cref="DataKind.R4"/>.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If [](start = 86, length = 25)
Just a rant.
Only reason why we have it is CommandLine in which you can say something like `col=ColA:0" and expect it to work.
On programmatic level idea of "Hey its a nullable field but if you set it null we will treat it as float" is weird.
If you want to change it, one tricky place is TryParse method line 123, just set Type straight to DataKind.Num since in other cases it can be never setup.
#Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it does not make much sense to have this as a nullable type. We should just have a default. I would like to address this in another PR as I mention in my other comment.
In reply to: 255805797 [](ancestors = 255805797)
@@ -334,66 +395,98 @@ internal bool TryUnparse(StringBuilder sb) | |||
} | |||
} | |||
|
|||
public class ArgumentsCore | |||
/// <summary> | |||
/// The settings for <see cref="TextLoader"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
settings [](start = 16, length = 8)
options. #Resolved
/// <summary> | ||
/// Whether the input may include quoted values, which can contain separator characters, colons, | ||
/// and distinguish empty values from missing values. When true, consecutive separators denote a | ||
/// missing value and an empty value is denoted by \"\". When false, consecutive separators denote an empty value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"" [](start = 63, length = 4)
where should be specific documentation magic to properly represent such strings. ` maybe? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment, but let me know if you think otherwise.
In reply to: 255806299 [](ancestors = 255806299)
|
||
/// <summary> | ||
/// Whether the input may include sparse representations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether the input may include sparse representations. [](start = 16, length = 53)
[citation needed] (https://xkcd.com/285/)
not sure how it should look like, but we should specify how sparse format looks like. it's a index:value separated pairs where each index should be bigger than previous one. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also not sure can we have multiple sparse columns or not.
In reply to: 255806745 [](ancestors = 255806745)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to do another PR focusing on improving the public surface of text loader. I think these are all important points, and I will come back to them to improve the documentation.
Before I focus on improving the documentation for these public fields I would like to make sure that we want to expose them all and that the current form is the final one.
In reply to: 255806866 [](ancestors = 255806866,255806745)
|
||
/// <summary> | ||
/// Number of source columns in the text data. Default is that sparse rows contain their size information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number [](start = 16, length = 6)
would be nice to have example
theoretically we can have sparse format like SparseSize separtor pairs of index:value devided by separator
743 1:2 5:-1 8:0.4 742:1 743:2
not sure what would happen if we have 744:7 in that line. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment regarding another PR after discussing the public surface of TextLoader.
In reply to: 255807082 [](ancestors = 255807082)
|
||
/// <summary> | ||
/// The characters that should be used as separators column separator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separators column separator [](start = 53, length = 28)
does it sound right for you? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string tmp = CmdParser.GetSettings(host, options, new Options() | ||
{ | ||
// It's fine if the user sets the following three arguments that are not considered core arguments. | ||
// Setting the defaults to the user provided values will avoid these in the output of the call CmdParser.GetSettings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we override them here? Can you override them in upstream function? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetSettings
builds a string with the fields in the second parameter options
that are different from the fields in the third parameter in this case new Options () {...}
. Previously, only the core arguments were considered when taking the difference between the user provided arguments and the defaults. Since now there is no distinction between core and non core arguments, I achieve exactly the same behavior as before by setting the non core arguments in the third parameter to the user provided values.
In reply to: 255808097 [](ancestors = 255808097)
@@ -1145,7 +1238,14 @@ private sealed class LoaderHolder | |||
|
|||
// Verify that the current schema-defining arguments are default. | |||
// Get settings just for core arguments, not everything. | |||
string tmp = CmdParser.GetSettings(host, args, new ArgumentsCore()); | |||
string tmp = CmdParser.GetSettings(host, options, new Options() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null check.
You still have conflicting files :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @artidoro sorry for not looking at this sooner, I thought from the context that this was more a "public surface API" change moreso than anything architectural. Looks good, thank you.
goto LDone; | ||
|
||
cols = argsNew.Columns; | ||
// Set the non core arguments in the new options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non core arguments [](start = 27, length = 18)
perhaps we should clarify what are "non core" arguments here ? would be especially useful for the future.
without #2046 i did not have enough context to know what this meant
// If not, set error to true if there was an error condition. | ||
private static bool TryParseSchema(IHost host, IMultiStreamSource files, | ||
ref Options options, out Column[] cols, out bool error) | ||
{ | ||
host.AssertValue(host); | ||
host.AssertValue(files); | ||
host.CheckValue(options, nameof(options)); | ||
|
||
cols = null; | ||
error = false; | ||
|
||
// Verify that the current schema-defining arguments are default. | ||
// Get settings just for core arguments, not everything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core arguments [](start = 37, length = 14)
perhaps we should rephrase this , so it makes sense for future readers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #2046.
Related to #1798.
This PR does the following:
Arguments
class from the original two classesArguments
andArgumentsCore
. The issue suggested the first approach out of two possible ways to do this, which is the one I followed in this PR.Arguments
toOptions
.