Skip to content

Conversation

codemzs
Copy link
Member

@codemzs codemzs commented May 14, 2018

*Code generate support for IDataLoader
*Make TextLoader API code generated so that it's at functional parity with the text loader in the ML.Net infrastructure.
*Move TextLoader API under Microsoft.ML.Data namespace
*Add convenience TextLoader API.
*Add error checking for invalid loader arguments such as ordinal, column names.
*Update baselines.
*Update samples with new loader API and backward compatibility with old loader API.

ADDRESS COMMENTS FROM PR# 38
fixes #15

@codemzs codemzs requested review from eerhardt and TomFinley May 14, 2018 08:43
[Argument(ArgumentType.Required, ShortName = "data", HelpText = "Location of the input file", SortOrder = 1)]
public IFileHandle InputFile;

[Argument(ArgumentType.Required, ShortName = "args", HelpText = "Arguments", SortOrder = 1)]
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 [](start = 101, length = 1)

SortOrder is the same here. #Closed

else if (type == typeof(Double))
kind = DataKind.R8;
else if (type == typeof(DvText))
else if (type == typeof(DvText) || type == typeof(string))
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Zeeshan, this mapping here is inappropriate. Actually the below bool below is wrong as well. This is not a many-to-one utility mapping of "close enough." It means that if something has that data kind, then the underlying .NET representation type will be that, and vice versa.

Remove both. You'll unfortunately have to figure out what code is relying on the bool mistake below and fix that too, to use something more appropriate... #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another utility function I mentioned in the other code review, GetDataType, in the ModuleArgs.cs that has this more "informal" mapping. But assuredly the core definitions of IDataView and its supporting type structures are not the place to have this sort of stuff. :D


In reply to: 188000280 [](ancestors = 188000280)

[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")]
public string Separator = "tab";

[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly, HelpText = "Source column separator. Option: single character ONLY", ShortName = "del")]
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option: single character ONLY [](start = 147, length = 29)

I think the fact that it's a single character is implied by the fact that the argument is of type char. We don't usually clarify that non-array arguments can only have a single value specified, since that information is part of the type anyway. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it would be nice if this argument was as capable as Separator, which can support multiple delimiters.


In reply to: 188001984 [](ancestors = 188001984)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you shift to multiple delimiters, take the same care as is taken with Separator to ensure uniqueness, etc.


In reply to: 188006909 [](ancestors = 188006909,188001984)


_host.CheckNonEmpty(args.Separator, nameof(args.Separator), "Must specify a separator");
string sep = args.Separator.ToLowerInvariant();
_host.CheckNonEmpty(args.Delimiter.ToString(), nameof(args.Delimiter), "Must specify a delimeter");
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args.Delimiter.ToString(), [](start = 32, length = 26)

What is the purpose of this check? Under what circumstances do we suppose that a character, when represented as a string, will be empty? #Closed

if (sep == ",")
_separators = new char[] { ',' };
if (args.Delimiter != '\t')
_separators = new char[] { args.Delimiter };
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe comment on the code how this relies on the fact that the default arg.Separator is "tab", since to a reader this will look odd. That is, it may look like a mistake, even though it isn't one. #Closed


<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
<NoWarn>;1591;0618</NoWarn>
</PropertyGroup>
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify this? What warnings are these, and what is the relationship to your change? #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added 0618 but now I have removed it and applied to code sections.


In reply to: 188006377 [](ancestors = 188006377)

[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, HelpText = "Source column separator. Options: tab, space, comma, single character", ShortName = "sep")]
public string Separator = "tab";

[Argument(ArgumentType.AtMostOnce, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly, HelpText = "Source column separator. Option: single character ONLY", ShortName = "del")]
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument [](start = 13, length = 8)

This is a little unfortunate... on the one hand we have Separator. On the other hand we have Delimiter. I know you're doing this to avoid a name collision, but it would be nice if, from the user's perspective, the same concept had the same name in API usage and command line usage.

I think the Name attribute on Argument exists for this purpose. You can name the field whatever (maybe Delimiter could instead be SeparatorChar, or SeparatorChars if you fix it to be char[] instead). Then you can use Name = nameof(Separator) to make sure it has the same name in command line usage. The shortname I hope can I think be consistent between the two as well. #Closed

Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, separator: "tab", supportSparse: false));
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, separator: "tab", allowQuotedStrings: false));
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", delimeter: ','));
Assert.NotNull(new TextLoader<Input>("fakeFile.txt", useHeader: false, delimeter: ','));
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

',' [](start = 94, length = 3)

I'm not sure it matters given the nature of this test, but was there any particular reason to shift from tab to ,?

Given how you've made \t an exceptional case in the constructor, might want to do both. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was on an accident. Reverted back.


In reply to: 188011098 [](ancestors = 188011098)

<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<PlatformTarget>AnyCPU</PlatformTarget>
<NoWarn>1701;1702;1705;0618</NoWarn>
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if the warning really is inappropriate in one very specific case, we could mark the very specific case rather than suppressing it. By doing this, forevermore, we will never, under any circumstances, be aware when these warning is generated. And if I look up some of these, some of these are good warnings, things we will probably want to be aware of. #Closed

new TextLoaderColumn()
{
Name = "Id",
Source = new [] { new TextLoaderRange() { Min = 0, Max = 0} },
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 40, length = 1)

Maybe VS could format the file to clean up thse whitespace issues, I think default VS formatting is new[] not new []. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried ctrl-k-d but still there is space.


In reply to: 188013046 [](ancestors = 188013046)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. How strange. Oh well.


In reply to: 188121093 [](ancestors = 188121093,188013046)

/// </summary>
public KeyRange KeyRange { get; set; }

}
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An observation:

These classes are generated from the entry-point, which is somewhat unfortunate since we don't have the ability to set some common conveniences. Like say, we have TextLoaderRange[] Source, but there are two common cases: the case where only a single range is specified for the vector valued column case, and the case where only a single value is specified. If we could somehow add to this class, we could add some conveniences to allow this to be set.

I wonder: if we changed the CSharpCodeGenerator.cs code to make these partial classes, then we could keep this auto-generated file. But then there could be another file in the same assembly containing hand written enhancements to these classes to make them more convenient. I suppose it all depends on whether we view these classes as being something the user will often directly interact with (in which case, a convenience may be appropriate), or whether our view is that these entry-point generated APIs will generally be wrapped, as is currently the case in this Microsoft.ML./Data/TextLoader.cs file. (in which case convenience at this level are irrelevant).

If we did have partial classes, then we could add some facilities to TextLoaderColumn maybe, to allow (1) single column setting, or (2) single range setting. Or add them as constructors to TextLoaderRange itself, either way. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too wanted to do this as I converted test cases to use the code generated API and realized it was inconvenient to specify an ordinal twice when it could be done by specifying once.


In reply to: 188015114 [](ancestors = 188015114)

var col = Runtime.Data.TextLoader.Column.Parse(
$"{name}:" +
$"{dk.ToString()}:" +
$"{mappingAttr.Ordinal}"
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. We only needed to parse out Ordinal. We could use TextLoader.Range(Ordinal) for that, and take the name and kind as is.

Alternately, we could exploit the fact that we now have a range type defined, and use that instead of making our users learn the mini-language. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a big improvement over the old system of course.


In reply to: 188021126 [](ancestors = 188021126)

/// Import a dataset from a text file
/// </summary>
[Obsolete("Use TextLoader instead.")]
public sealed partial class CustomTextLoader
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice #ByDesign


[TlcModule.EntryPoint(Name = "Data.TextLoader", Desc = "Import a dataset from a text file")]
#pragma warning disable 0618
[Obsolete("Use TextLoader instead.", false)]
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, false [](start = 43, length = 7)

This is the default for the overload without it specified anyway, so it can be omitted. #Closed

/// Import a dataset from a text file
/// </summary>
public sealed partial class TextLoader
public partial class TextLoader : Microsoft.ML.ILearningPipelineLoader
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public partial [](start = 8, length = 14)

So why not sealed? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Because of that other issue. Reseal, maybe make another part of the partial class to introduce a Create<TInput> convenience method that returns a TextLoader.


In reply to: 188022924 [](ancestors = 188022924)


namespace Microsoft.ML.Data
{
public sealed class TextLoader<TInput> : TextLoader
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader [](start = 45, length = 10)

What is the purpose of making this descend from the other?

AFAIK this thing has to be a ILearningPipelineLoader, and is intended to be a convenience wrapper over the original TextLoader. #Closed

Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read this, it looks like this "subclass"'s only job is to take the type and a few constructor arguments, and that's the beginning and end of what it is used for.

This is a fairly odd use of inheritance, one without parallel in our codebase. Further it violates the general practice of non-sealed/abstract classes. This ought to be structured differently, either as a static generic utility method on top of TextLoader, but to declare a new class just for that convenience is odd. If you really want it to be a separate class (which, really, it shouldn't be), then it'll have to be a wrapper as the old loader was structured to be.


In reply to: 188032656 [](ancestors = 188032656)

Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also misleading. Let's imagine that I have a TextLoader<T>. Is T used anywhere? Nope. Is T even slightly relevant? Not really -- by defining this to be an inheritance relationship, you are retaining the mutable properties that usual usage of TextLoader depends on. Which means, a user is free to change/ignore any information garnered from T. :) I know what a List<T> is. I know what an IEnumerable<T> is. I, even if I'm totally familiar with this class, have absolutely no idea whatsoever what a TextLoader<T> is.

We're using inheritance, but as soon as we rely on any of the schema setting functionality of that base class, the fact that this type is generic on <T> is no longer useful information. That's I'd say another sign something's wrong.


In reply to: 188038035 [](ancestors = 188038035,188032656)

public sealed class EntryPointInfo
{
public readonly bool NoSeal;
public readonly string Name;
Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this change of NoSeal. (The other change, the ObsoleteAttribute change is good, so please don't revert the whole file.)

Firstly, there's the point as made elsewhere that inheritance wasn't the right way to solve that problem, but there's a larger point...

The EntryPointAttribute is a description of the component. You wanted that class to be unsealed specifically because you had an idea of how to structure the C# API specifically. Setting aside the merits of that idea, or even supposing that it was a good idea, we should avoid putting things that are so extremely specific stuff in one environment (in this case, the code in CSharpApiGenerator.cs) in the entry-point definition meant to service everyone. Let's think about everything else in this info class... it's useful and necessary practically everywhere. This is not the case with NoSeal.
#Closed

private string _inputFilePath;
private string CustomSchema;
private Data.TextLoader ImportTextInput;

Copy link
Contributor

@TomFinley TomFinley May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this old code, but it had the benefit that TInput meant something, since we had assurances (as it was wrapped as a private field) that the schema produced by this thing would be faithful to the type. (That's not a good reason to make a generic type, but it is at least somewhat understandable.) Now we're doing inheritance, it means nothing. If you want to generate a TextLoader, make a method to generate a text loader. #Closed

@shauheen shauheen added the API Issues pertaining the friendly API label May 17, 2018
public TextLoaderRange(int ordinal)
{

Contracts.Assert(ordinal >= 0);
Copy link
Contributor

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts.Assert(ordinal >= 0); [](start = 12, length = 31)

Not asserts! These are public facing methods. #Closed

Copy link
Contributor

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detail: Asserts are debug only constructs, and we only use them to indicate what we expect is true -- e.g., verification of invariants in code. These should be .CheckParam, etc.


In reply to: 189048133 [](ancestors = 189048133)


/// <summary>
/// Convenience constructor for the scalar case.
/// Min and Max are set to the single value ordinal.
Copy link
Contributor

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min [](start = 12, length = 3)

<see, also ordinal should be in a <paramref. #Closed

Copy link
Contributor

@veikkoeeva veikkoeeva May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley These rules could be enforced as errors too. Though the problem currently is that many of the interfaces aren't commented. I'm not sure if going in stages could be a viable alternative.

A good take on commenting: dotnet/orleans#209 (comment). #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Veikko.


In reply to: 189102883 [](ancestors = 189102883)

Copy link
Contributor

@TomFinley TomFinley May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the <paramref. Min and max still not in a <see


In reply to: 189082232 [](ancestors = 189082232)

/// Convenience constructor for the scalar case.
/// Min and Max are set to the single value ordinal.
/// When a given column spans only a single column in the
/// dataset.
Copy link
Contributor

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using two distinct meanings of the word column in one sentence. This is fairly confusing. #Closed

/// Convenience constructor for the scalar case.
/// Min and Max are set to the single value ordinal.
/// When a given column spans only a single column in the
/// dataset.
Copy link
Contributor

@TomFinley TomFinley May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a sentence. It almost works if you were to append it to the first sentence, followed by a comma. #Closed

Copy link
Contributor

@veikkoeeva veikkoeeva May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, as a non-native speaker, I'd go to read the code (and be slightly annoyed). :) #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not a sentence. Perhaps reading this summary comment aloud to yourself would help?


In reply to: 189083569 [](ancestors = 189083569)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK... I see what went wrong here.

Append this third "sentence" to the first sentence followed by a comma.

By this I mean, I am not proposing the following transformation.

"I like to run. When the sun shines." => "I like to run. When the sun shines," because that's still not a sentence and its punctuation makes no sense.

I am proposing the following transformation.

"I like to run, when the sun shines."


In reply to: 189340150 [](ancestors = 189340150,189083569)

/// <summary>
/// Convenience constructor for the vector case.
/// When a given column in the schema spans contiguous columns in the dataset,
/// <see cref="ML.Runtime.Data.TextLoader.Range"/>
Copy link
Contributor

@TomFinley TomFinley May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 12, length = 46)

So, why are we having code in the public API reference classes in the runtime namespaces? #Closed

{

Contracts.CheckParam(min >= 0, nameof(min));
Contracts.CheckParam(max >= 0, nameof(max));
Copy link
Contributor

@TomFinley TomFinley May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max should be >= than min. You also do not have any message at all for the user for what actually happened.

The TextLoader class is arguably one of the most important components in this API, wouldn't we say? If a user misuses it, and we throw an exception, we really ought to be clear. #Closed

/// When a given column in the schema spans contiguous columns in the dataset,
/// <see cref="ML.Runtime.Data.TextLoader.Range"/>
/// </summary>
public TextLoaderRange(int min, int max)
Copy link
Contributor

@TomFinley TomFinley May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<param tags? Isn't it important to explain these? #Closed

{

Contracts.CheckParam(min >= 0, nameof(min), "Cannot be negative number.");
Contracts.CheckParam(max >= min, nameof(max), $"Cannot be less than {nameof(min)}.");
Copy link
Contributor

@TomFinley TomFinley May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$"Cannot be less than {nameof(min)}." [](start = 58, length = 37)

Note that this $ interpolation means this isn't a constant. Note however, that "Cannot be less than " + nameof(min) + "." would be a constant, and so we could use it in a Check error message. #Resolved

public TextLoaderRange(int min, int max)
{

Contracts.CheckParam(min >= 0, nameof(min), "Cannot be negative number.");
Copy link
Contributor

@TomFinley TomFinley May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be negative number. [](start = 57, length = 26)

Make consistent with the above "Cannot be a negative number." #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

codemzs added 2 commits May 21, 2018 19:11
…to loader

# Conflicts:
#	src/Microsoft.ML.Core/Data/DataKind.cs
@codemzs codemzs merged commit 86f5ee6 into dotnet:master May 22, 2018
@codemzs codemzs deleted the loader branch May 22, 2018 02:32
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to support more types than string and float
5 participants