Skip to content

DataOperationsCatalog SaveAsText extension method is evil #2452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Feb 7, 2019 · 0 comments · Fixed by #2630
Closed

DataOperationsCatalog SaveAsText extension method is evil #2452

TomFinley opened this issue Feb 7, 2019 · 0 comments · Fixed by #2630
Assignees
Labels
API Issues pertaining the friendly API
Milestone

Comments

@TomFinley
Copy link
Contributor

TomFinley commented Feb 7, 2019

I was attempting to migrate some of our tests, when I discovered we have a few problems in our new API on text saving. I know @artidoro had some thoughts on text saving/loading so mentioning him. Also I know @sfilipi and @rogancarr are handling many issues w.r.t. API completeness and consistency as they work on samples and docs, so maybe they have some thoughts on this. (Of course everyone is free to chime in.)

From what I can see, whoever authored this method confused the defaults with the text loader and the text saver. (Note how the defaults used in the text saving utility method are coming from TextLoader, which is incorrect.) It might seem intuitive if you don't think about it too hard that text saving and loading should have the same defaults, but practically it becomes clear they should not. The situations where one is "loading" into ML.NET and "saving" out of ML.NET are in fact very different situations. When someone is using a text loader with non-default settings they're usually asking us to ingest their format (so we chose the most helpful defaults for that more common scenario), whereas our text saver makes some attempt at schema. (Note also that under default settings, the text loader loads our own format without trouble, since it detects that a schema and settings is embedded in the file itself.)

This is the offending method:

public static void SaveAsText(this DataOperationsCatalog catalog,
IDataView data,
Stream stream,
char separatorChar = TextLoader.DefaultArguments.Separator,
bool headerRow = TextLoader.DefaultArguments.HasHeader,
bool schema = true,
bool keepHidden = false)

As point of reference, this is that constant:

internal const bool HasHeader = false;

Now, compare this with the actual defaults on our text saver:

[Argument(ArgumentType.AtMostOnce, HelpText = "Separator", ShortName = "sep")]
public string Separator = "tab";
[Argument(ArgumentType.AtMostOnce, HelpText = "Force dense format", ShortName = "dense")]
public bool Dense;
// REVIEW: This and the corresponding BinarySaver option should be removed,
// with the silence being handled, somehow, at the environment level. (Task 6158846.)
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Suppress any info output (not warnings or errors)", Hide = true)]
public bool Silent;
[Argument(ArgumentType.AtMostOnce, HelpText = "Output the comment containing the loader settings", ShortName = "schema")]
public bool OutputSchema = true;
[Argument(ArgumentType.AtMostOnce, HelpText = "Output the header", ShortName = "header")]
public bool OutputHeader = true;

  • The default for whether header row is saved has gone from true to false. The primary practical effect of this is, we're now dropping feature names (or more precisely, slot names) by default. This seems silly. Feature names are important. I think we ought to keep them by default.

  • We've lost the ability to force saving as dense format at all through this new API. This is often important for comprehensibility.

I ran into it while I was trying to clean up some of our tests to use more of the actual public surface. Consider this test.

using (var ch = Env.Start("save"))
{
var saver = new TextSaver(Env, new TextSaver.Arguments { Silent = true, Dense = true });
using (var fs = File.Create(outputPath))
DataSaverUtils.SaveDataView(ch, saver, data, fs, keepHidden: false);
}

That's kind of obnoxious, and not using our public API. I'd love to migrate it over to something like this:

using (var fs = File.Create(outputPath)) 
    ML.Data.SaveAsText(data, fs, keepHidden: false, forceDense: true);

But I can't specify equivalent settings because there's no way to force dense.

So I suggest this: we change the defaults of this text saving utility function to use the text saver defaults, instead of the text loader defaults, and also restore the ability to force a dense format.

Why forcing dense is kind of useful sometimes...

It may not be obvious why forcing to dense is kind of useful. So imagine this input file foo.txt.

1,2,3,4,5,6
0,0,1,0,0,0

Then imagine I have this MML command line:

dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo1.txt

The resulting foo1.txt file is this:

#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=Label:R4:0
#@   col=Features:R4:1-5
#@ }
Label   5       0:""
1       2       3       4       5       6
6       2:1

What is that 6 2:1 line? It is encoding the information that this line has 6 fields, and the field at index 2 is the only one with a non-default value. That is, it has detected, "hey, this can be sparsely encoded," and it has done so. Similarly with this seemingly crazy Label 5 0:"" line. But sometimes we find that confusing!! I've spent over the years probably, cumulatively, days trying to explain the ins and outs of the mixed sparse/dense format. So we have this setting to say, "you know what, I don't care about efficient, give me a dense format."

dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo2.txt saver=text{dense+}

The result is this:

#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=Label:R4:0
#@   col=Features:R4:1-5
#@ }
Label   ""      ""      ""      ""      ""
1       2       3       4       5       6
0       0       1       0       0       0

Less efficient? Sure. Eaier to understand? I'd say so. And lots of our tests use it, since our tests are usually writing small amounts of data and we found comprehensibility of test files to be valuable.

@TomFinley TomFinley added the API Issues pertaining the friendly API label Feb 7, 2019
@wschin wschin self-assigned this Feb 15, 2019
@shauheen shauheen added this to the 0219 milestone Feb 20, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants