DataOperationsCatalog SaveAsText extension method is evil

I was attempting to migrate some of our tests, when I discovered we have a few problems in our new API on text saving. I know @artidoro had some thoughts on text saving/loading so mentioning him. Also I know @sfilipi and @rogancarr are handling many issues w.r.t. API completeness and consistency as they work on samples and docs, so maybe they have some thoughts on this. (Of course everyone is free to chime in.)

From what I can see, whoever authored this method confused the defaults with the text *loader* and the text *saver*. (Note how the defaults used in the text saving utility method are coming from `TextLoader`, which is incorrect.) It might seem intuitive if you don't think about it too hard that text saving and loading should have the same defaults, but practically it becomes clear they should not. The situations where one is "loading" into ML.NET and "saving" out of ML.NET are in fact very different situations. When someone is using a text loader ***with non-default settings*** they're usually asking us to ingest *their* format (so we chose the most helpful defaults for that more common scenario), whereas our text saver makes some attempt at schema. (Note also that under default settings, the text loader loads our own format without trouble, since it detects that a schema and settings is embedded in the file itself.)

This is the offending method:

https://github.com/dotnet/machinelearning/blob/e192a18d553a71ded3728495e9aaf4349169290d/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs#L147-L153

As point of reference, this is that constant:

https://github.com/dotnet/machinelearning/blob/e192a18d553a71ded3728495e9aaf4349169290d/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs#L401

Now, compare this with the *actual* defaults on our text saver:

https://github.com/dotnet/machinelearning/blob/e192a18d553a71ded3728495e9aaf4349169290d/src/Microsoft.ML.Data/DataLoadSave/Text/TextSaver.cs#L28-L43

* The default for whether header row is saved has gone from `true` to `false`. The primary practical effect of this is, we're now dropping feature names (or more precisely, slot names) by default. This seems silly. Feature names are *important*. I think we ought to keep them by default.

* We've lost the ability to force saving as dense format *at all* through this new API. This is often important for comprehensibility.

I ran into it while I was trying to clean up some of our tests to use more of the actual public surface. Consider this test.

https://github.com/dotnet/machinelearning/blob/e192a18d553a71ded3728495e9aaf4349169290d/test/Microsoft.ML.Tests/Transformers/ConcatTests.cs#L133-L138

That's kind of obnoxious, and not using our public API. I'd love to migrate it over to something like this:

```csharp
using (var fs = File.Create(outputPath)) 
    ML.Data.SaveAsText(data, fs, keepHidden: false, forceDense: true);
```

But I can't specify equivalent settings because there's no way to force dense.

***So I suggest this:*** we change the defaults of this text *saving* utility function to use the text *saver* defaults, instead of the text loader defaults, and also restore the ability to force a dense format.

## Why forcing dense is kind of useful sometimes...

It may not be obvious why forcing to dense is kind of useful. So imagine this input file `foo.txt`.

```
1,2,3,4,5,6
0,0,1,0,0,0
```

Then imagine I have this MML command line:

    dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo1.txt

The resulting `foo1.txt` file is this:

```
#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=Label:R4:0
#@   col=Features:R4:1-5
#@ }
Label   5       0:""
1       2       3       4       5       6
6       2:1
```

What is that `6 2:1` line? It is encoding the information that this line has 6 fields, and the field at index 2 is the only one with a non-default value. That is, it has detected, "hey, this can be sparsely encoded," and it has done so. Similarly with this seemingly crazy `Label   5       0:""` line. But sometimes we find that confusing!! I've spent over the years probably, cumulatively, *days* trying to explain the ins and outs of the mixed sparse/dense format. So we have this setting to say, "you know what, I don't care about efficient, give me a dense format."

    dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo2.txt saver=text{dense+}

The result is this:

```
#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=Label:R4:0
#@   col=Features:R4:1-5
#@ }
Label   ""      ""      ""      ""      ""
1       2       3       4       5       6
0       0       1       0       0       0
```

Less efficient? Sure. Eaier to understand? I'd say so. And lots of our tests use it, since our tests are usually writing small amounts of data and we found comprehensibility of test files to be valuable.

	public static void SaveAsText(this DataOperationsCatalog catalog,
	IDataView data,
	Stream stream,
	char separatorChar = TextLoader.DefaultArguments.Separator,
	bool headerRow = TextLoader.DefaultArguments.HasHeader,
	bool schema = true,
	bool keepHidden = false)

	[Argument(ArgumentType.AtMostOnce, HelpText = "Separator", ShortName = "sep")]
	public string Separator = "tab";

	[Argument(ArgumentType.AtMostOnce, HelpText = "Force dense format", ShortName = "dense")]
	public bool Dense;

	// REVIEW: This and the corresponding BinarySaver option should be removed,
	// with the silence being handled, somehow, at the environment level. (Task 6158846.)
	[Argument(ArgumentType.LastOccurenceWins, HelpText = "Suppress any info output (not warnings or errors)", Hide = true)]
	public bool Silent;

	[Argument(ArgumentType.AtMostOnce, HelpText = "Output the comment containing the loader settings", ShortName = "schema")]
	public bool OutputSchema = true;

	[Argument(ArgumentType.AtMostOnce, HelpText = "Output the header", ShortName = "header")]
	public bool OutputHeader = true;

	using (var ch = Env.Start("save"))
	{
	var saver = new TextSaver(Env, new TextSaver.Arguments { Silent = true, Dense = true });
	using (var fs = File.Create(outputPath))
	DataSaverUtils.SaveDataView(ch, saver, data, fs, keepHidden: false);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataOperationsCatalog SaveAsText extension method is evil #2452

Why forcing dense is kind of useful sometimes...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataOperationsCatalog SaveAsText extension method is evil #2452

Description

Why forcing dense is kind of useful sometimes...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions