-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Code generate TextLoader API and enhance it with convenience API. #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…test data file instead of train file.
…to loader # Conflicts: # src/Microsoft.ML/TextLoader.cs
[Argument(ArgumentType.Required, ShortName = "data", HelpText = "Location of the input file", SortOrder = 1)] | ||
public IFileHandle InputFile; | ||
|
||
[Argument(ArgumentType.Required, ShortName = "args", HelpText = "Arguments", SortOrder = 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 [](start = 101, length = 1)
SortOrder
is the same here. #Closed
else if (type == typeof(Double)) | ||
kind = DataKind.R8; | ||
else if (type == typeof(DvText)) | ||
else if (type == typeof(DvText) || type == typeof(string)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Zeeshan, this mapping here is inappropriate. Actually the below bool
below is wrong as well. This is not a many-to-one utility mapping of "close enough." It means that if something has that data kind, then the underlying .NET representation type will be that, and vice versa.
Remove both. You'll unfortunately have to figure out what code is relying on the bool
mistake below and fix that too, to use something more appropriate... #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another utility function I mentioned in the other code review, GetDataType
, in the ModuleArgs.cs
that has this more "informal" mapping. But assuredly the core definitions of IDataView
and its supporting type structures are not the place to have this sort of stuff. :D
In reply to: 188000280 [](ancestors = 188000280)
[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")] | ||
public string Separator = "tab"; | ||
|
||
[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly, HelpText = "Source column separator. Option: single character ONLY", ShortName = "del")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option: single character ONLY [](start = 147, length = 29)
I think the fact that it's a single character is implied by the fact that the argument is of type char
. We don't usually clarify that non-array arguments can only have a single value specified, since that information is part of the type anyway. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it would be nice if this argument was as capable as Separator
, which can support multiple delimiters.
In reply to: 188001984 [](ancestors = 188001984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you shift to multiple delimiters, take the same care as is taken with Separator
to ensure uniqueness, etc.
In reply to: 188006909 [](ancestors = 188006909,188001984)
|
||
_host.CheckNonEmpty(args.Separator, nameof(args.Separator), "Must specify a separator"); | ||
string sep = args.Separator.ToLowerInvariant(); | ||
_host.CheckNonEmpty(args.Delimiter.ToString(), nameof(args.Delimiter), "Must specify a delimeter"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args.Delimiter.ToString(), [](start = 32, length = 26)
What is the purpose of this check? Under what circumstances do we suppose that a character, when represented as a string, will be empty? #Closed
if (sep == ",") | ||
_separators = new char[] { ',' }; | ||
if (args.Delimiter != '\t') | ||
_separators = new char[] { args.Delimiter }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe comment on the code how this relies on the fact that the default arg.Separator
is "tab"
, since to a reader this will look odd. That is, it may look like a mistake, even though it isn't one. #Closed
|
||
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'"> | ||
<NoWarn>;1591;0618</NoWarn> | ||
</PropertyGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clarify this? What warnings are these, and what is the relationship to your change? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added 0618 but now I have removed it and applied to code sections.
In reply to: 188006377 [](ancestors = 188006377)
[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")] | ||
public string Separator = "tab"; | ||
|
||
[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly, HelpText = "Source column separator. Option: single character ONLY", ShortName = "del")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument [](start = 13, length = 8)
This is a little unfortunate... on the one hand we have Separator
. On the other hand we have Delimiter
. I know you're doing this to avoid a name collision, but it would be nice if, from the user's perspective, the same concept had the same name in API usage and command line usage.
I think the Name
attribute on Argument
exists for this purpose. You can name the field whatever (maybe Delimiter
could instead be SeparatorChar
, or SeparatorChars
if you fix it to be char[]
instead). Then you can use Name = nameof(Separator)
to make sure it has the same name in command line usage. The shortname I hope can I think be consistent between the two as well. #Closed
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, separator: "tab", supportSparse: false)); | ||
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, separator: "tab", allowQuotedStrings: false)); | ||
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", delimeter: ',')); | ||
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, delimeter: ',')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
',' [](start = 94, length = 3)
I'm not sure it matters given the nature of this test, but was there any particular reason to shift from tab
to ,
?
Given how you've made \t
an exceptional case in the constructor, might want to do both. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'"> | ||
<AllowUnsafeBlocks>true</AllowUnsafeBlocks> | ||
<PlatformTarget>AnyCPU</PlatformTarget> | ||
<NoWarn>1701;1702;1705;0618</NoWarn> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps if the warning really is inappropriate in one very specific case, we could mark the very specific case rather than suppressing it. By doing this, forevermore, we will never, under any circumstances, be aware when these warning is generated. And if I look up some of these, some of these are good warnings, things we will probably want to be aware of. #Closed
new TextLoaderColumn() | ||
{ | ||
Name = "Id", | ||
Source = new [] { new TextLoaderRange() { Min = 0, Max = 0} }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 40, length = 1)
Maybe VS could format the file to clean up thse whitespace issues, I think default VS formatting is new[]
not new []
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// </summary> | ||
public KeyRange KeyRange { get; set; } | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An observation:
These classes are generated from the entry-point, which is somewhat unfortunate since we don't have the ability to set some common conveniences. Like say, we have TextLoaderRange[] Source
, but there are two common cases: the case where only a single range is specified for the vector valued column case, and the case where only a single value is specified. If we could somehow add to this class, we could add some conveniences to allow this to be set.
I wonder: if we changed the CSharpCodeGenerator.cs
code to make these partial classes, then we could keep this auto-generated file. But then there could be another file in the same assembly containing hand written enhancements to these classes to make them more convenient. I suppose it all depends on whether we view these classes as being something the user will often directly interact with (in which case, a convenience may be appropriate), or whether our view is that these entry-point generated APIs will generally be wrapped, as is currently the case in this Microsoft.ML./Data/TextLoader.cs
file. (in which case convenience at this level are irrelevant).
If we did have partial classes, then we could add some facilities to TextLoaderColumn
maybe, to allow (1) single column setting, or (2) single range setting. Or add them as constructors to TextLoaderRange
itself, either way. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too wanted to do this as I converted test cases to use the code generated API and realized it was inconvenient to specify an ordinal twice when it could be done by specifying once.
In reply to: 188015114 [](ancestors = 188015114)
src/Microsoft.ML/Data/TextLoader.cs
Outdated
var col = Runtime.Data.TextLoader.Column.Parse( | ||
$"{name}:" + | ||
$"{dk.ToString()}:" + | ||
$"{mappingAttr.Ordinal}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary. We only needed to parse out Ordinal
. We could use TextLoader.Range(Ordinal)
for that, and take the name and kind as is.
Alternately, we could exploit the fact that we now have a range type defined, and use that instead of making our users learn the mini-language. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a big improvement over the old system of course.
In reply to: 188021126 [](ancestors = 188021126)
/// Import a dataset from a text file | ||
/// </summary> | ||
[Obsolete("Use TextLoader instead.")] | ||
public sealed partial class CustomTextLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice #ByDesign
|
||
[TlcModule.EntryPoint(Name = "Data.TextLoader", Desc = "Import a dataset from a text file")] | ||
#pragma warning disable 0618 | ||
[Obsolete("Use TextLoader instead.", false)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, false [](start = 43, length = 7)
This is the default for the overload without it specified anyway, so it can be omitted. #Closed
src/Microsoft.ML/CSharpApi.cs
Outdated
/// Import a dataset from a text file | ||
/// </summary> | ||
public sealed partial class TextLoader | ||
public partial class TextLoader : Microsoft.ML.ILearningPipelineLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public partial [](start = 8, length = 14)
So why not sealed? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Because of that other issue. Reseal, maybe make another part of the partial class to introduce a Create<TInput>
convenience method that returns a TextLoader
.
In reply to: 188022924 [](ancestors = 188022924)
src/Microsoft.ML/Data/TextLoader.cs
Outdated
|
||
namespace Microsoft.ML.Data | ||
{ | ||
public sealed class TextLoader<TInput> : TextLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader [](start = 45, length = 10)
What is the purpose of making this descend from the other?
AFAIK this thing has to be a ILearningPipelineLoader
, and is intended to be a convenience wrapper over the original TextLoader
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read this, it looks like this "subclass"'s only job is to take the type and a few constructor arguments, and that's the beginning and end of what it is used for.
This is a fairly odd use of inheritance, one without parallel in our codebase. Further it violates the general practice of non-sealed/abstract classes. This ought to be structured differently, either as a static generic utility method on top of TextLoader
, but to declare a new class just for that convenience is odd. If you really want it to be a separate class (which, really, it shouldn't be), then it'll have to be a wrapper as the old loader was structured to be.
In reply to: 188032656 [](ancestors = 188032656)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also misleading. Let's imagine that I have a TextLoader<T>
. Is T
used anywhere? Nope. Is T
even slightly relevant? Not really -- by defining this to be an inheritance relationship, you are retaining the mutable properties that usual usage of TextLoader
depends on. Which means, a user is free to change/ignore any information garnered from T
. :) I know what a List<T>
is. I know what an IEnumerable<T>
is. I, even if I'm totally familiar with this class, have absolutely no idea whatsoever what a TextLoader<T>
is.
We're using inheritance, but as soon as we rely on any of the schema setting functionality of that base class, the fact that this type is generic on <T>
is no longer useful information. That's I'd say another sign something's wrong.
In reply to: 188038035 [](ancestors = 188038035,188032656)
public sealed class EntryPointInfo | ||
{ | ||
public readonly bool NoSeal; | ||
public readonly string Name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this change of NoSeal
. (The other change, the ObsoleteAttribute
change is good, so please don't revert the whole file.)
Firstly, there's the point as made elsewhere that inheritance wasn't the right way to solve that problem, but there's a larger point...
The EntryPointAttribute
is a description of the component. You wanted that class to be unsealed specifically because you had an idea of how to structure the C# API specifically. Setting aside the merits of that idea, or even supposing that it was a good idea, we should avoid putting things that are so extremely specific stuff in one environment (in this case, the code in CSharpApiGenerator.cs
) in the entry-point definition meant to service everyone. Let's think about everything else in this info class... it's useful and necessary practically everywhere. This is not the case with NoSeal
.
#Closed
private string _inputFilePath; | ||
private string CustomSchema; | ||
private Data.TextLoader ImportTextInput; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this old code, but it had the benefit that TInput
meant something, since we had assurances (as it was wrapped as a private field) that the schema produced by this thing would be faithful to the type. (That's not a good reason to make a generic type, but it is at least somewhat understandable.) Now we're doing inheritance, it means nothing. If you want to generate a TextLoader
, make a method to generate a text loader. #Closed
…to loader # Conflicts: # src/Microsoft.ML/LearningPipeline.cs
src/Microsoft.ML/Data/TextLoader.cs
Outdated
public TextLoaderRange(int ordinal) | ||
{ | ||
|
||
Contracts.Assert(ordinal >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts.Assert(ordinal >= 0); [](start = 12, length = 31)
Not asserts! These are public facing methods. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More detail: Asserts are debug only constructs, and we only use them to indicate what we expect is true -- e.g., verification of invariants in code. These should be .CheckParam
, etc.
In reply to: 189048133 [](ancestors = 189048133)
src/Microsoft.ML/Data/TextLoader.cs
Outdated
|
||
/// <summary> | ||
/// Convenience constructor for the scalar case. | ||
/// Min and Max are set to the single value ordinal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Min [](start = 12, length = 3)
<see
, also ordinal
should be in a <paramref
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomFinley These rules could be enforced as errors too. Though the problem currently is that many of the interfaces aren't commented. I'm not sure if going in stages could be a viable alternative.
A good take on commenting: dotnet/orleans#209 (comment). #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the <paramref
. Min and max still not in a <see
In reply to: 189082232 [](ancestors = 189082232)
src/Microsoft.ML/Data/TextLoader.cs
Outdated
/// Convenience constructor for the scalar case. | ||
/// Min and Max are set to the single value ordinal. | ||
/// When a given column spans only a single column in the | ||
/// dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using two distinct meanings of the word column in one sentence. This is fairly confusing. #Closed
src/Microsoft.ML/Data/TextLoader.cs
Outdated
/// Convenience constructor for the scalar case. | ||
/// Min and Max are set to the single value ordinal. | ||
/// When a given column spans only a single column in the | ||
/// dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a sentence. It almost works if you were to append it to the first sentence, followed by a comma. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, as a non-native speaker, I'd go to read the code (and be slightly annoyed). :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not a sentence. Perhaps reading this summary comment aloud to yourself would help?
In reply to: 189083569 [](ancestors = 189083569)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK... I see what went wrong here.
Append this third "sentence" to the first sentence followed by a comma.
By this I mean, I am not proposing the following transformation.
"I like to run. When the sun shines." => "I like to run. When the sun shines," because that's still not a sentence and its punctuation makes no sense.
I am proposing the following transformation.
"I like to run, when the sun shines."
In reply to: 189340150 [](ancestors = 189340150,189083569)
src/Microsoft.ML/Data/TextLoader.cs
Outdated
/// <summary> | ||
/// Convenience constructor for the vector case. | ||
/// When a given column in the schema spans contiguous columns in the dataset, | ||
/// <see cref="ML.Runtime.Data.TextLoader.Range"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 46)
So, why are we having code in the public API reference classes in the runtime namespaces? #Closed
src/Microsoft.ML/Data/TextLoader.cs
Outdated
{ | ||
|
||
Contracts.CheckParam(min >= 0, nameof(min)); | ||
Contracts.CheckParam(max >= 0, nameof(max)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max should be >= than min. You also do not have any message at all for the user for what actually happened.
The TextLoader
class is arguably one of the most important components in this API, wouldn't we say? If a user misuses it, and we throw an exception, we really ought to be clear. #Closed
/// When a given column in the schema spans contiguous columns in the dataset, | ||
/// <see cref="ML.Runtime.Data.TextLoader.Range"/> | ||
/// </summary> | ||
public TextLoaderRange(int min, int max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<param
tags? Isn't it important to explain these? #Closed
src/Microsoft.ML/Data/TextLoader.cs
Outdated
{ | ||
|
||
Contracts.CheckParam(min >= 0, nameof(min), "Cannot be negative number."); | ||
Contracts.CheckParam(max >= min, nameof(max), $"Cannot be less than {nameof(min)}."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$"Cannot be less than {nameof(min)}." [](start = 58, length = 37)
Note that this $
interpolation means this isn't a constant. Note however, that "Cannot be less than " + nameof(min) + "."
would be a constant, and so we could use it in a Check
error message. #Resolved
src/Microsoft.ML/Data/TextLoader.cs
Outdated
public TextLoaderRange(int min, int max) | ||
{ | ||
|
||
Contracts.CheckParam(min >= 0, nameof(min), "Cannot be negative number."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot be negative number. [](start = 57, length = 26)
Make consistent with the above "Cannot be a negative number." #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…to loader # Conflicts: # src/Microsoft.ML.Core/Data/DataKind.cs
*Code generate support for IDataLoader
*Make TextLoader API code generated so that it's at functional parity with the text loader in the ML.Net infrastructure.
*Move TextLoader API under Microsoft.ML.Data namespace
*Add convenience TextLoader API.
*Add error checking for invalid loader arguments such as ordinal, column names.
*Update baselines.
*Update samples with new loader API and backward compatibility with old loader API.
ADDRESS COMMENTS FROM PR# 38
fixes #15