Description
I was attempting to migrate some of our tests, when I discovered we have a few problems in our new API on text saving. I know @artidoro had some thoughts on text saving/loading so mentioning him. Also I know @sfilipi and @rogancarr are handling many issues w.r.t. API completeness and consistency as they work on samples and docs, so maybe they have some thoughts on this. (Of course everyone is free to chime in.)
From what I can see, whoever authored this method confused the defaults with the text loader and the text saver. (Note how the defaults used in the text saving utility method are coming from TextLoader
, which is incorrect.) It might seem intuitive if you don't think about it too hard that text saving and loading should have the same defaults, but practically it becomes clear they should not. The situations where one is "loading" into ML.NET and "saving" out of ML.NET are in fact very different situations. When someone is using a text loader with non-default settings they're usually asking us to ingest their format (so we chose the most helpful defaults for that more common scenario), whereas our text saver makes some attempt at schema. (Note also that under default settings, the text loader loads our own format without trouble, since it detects that a schema and settings is embedded in the file itself.)
This is the offending method:
As point of reference, this is that constant:
Now, compare this with the actual defaults on our text saver:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextSaver.cs
Lines 28 to 43 in e192a18
-
The default for whether header row is saved has gone from
true
tofalse
. The primary practical effect of this is, we're now dropping feature names (or more precisely, slot names) by default. This seems silly. Feature names are important. I think we ought to keep them by default. -
We've lost the ability to force saving as dense format at all through this new API. This is often important for comprehensibility.
I ran into it while I was trying to clean up some of our tests to use more of the actual public surface. Consider this test.
machinelearning/test/Microsoft.ML.Tests/Transformers/ConcatTests.cs
Lines 133 to 138 in e192a18
That's kind of obnoxious, and not using our public API. I'd love to migrate it over to something like this:
using (var fs = File.Create(outputPath))
ML.Data.SaveAsText(data, fs, keepHidden: false, forceDense: true);
But I can't specify equivalent settings because there's no way to force dense.
So I suggest this: we change the defaults of this text saving utility function to use the text saver defaults, instead of the text loader defaults, and also restore the ability to force a dense format.
Why forcing dense is kind of useful sometimes...
It may not be obvious why forcing to dense is kind of useful. So imagine this input file foo.txt
.
1,2,3,4,5,6
0,0,1,0,0,0
Then imagine I have this MML command line:
dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo1.txt
The resulting foo1.txt
file is this:
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=Label:R4:0
#@ col=Features:R4:1-5
#@ }
Label 5 0:""
1 2 3 4 5 6
6 2:1
What is that 6 2:1
line? It is encoding the information that this line has 6 fields, and the field at index 2 is the only one with a non-default value. That is, it has detected, "hey, this can be sparsely encoded," and it has done so. Similarly with this seemingly crazy Label 5 0:""
line. But sometimes we find that confusing!! I've spent over the years probably, cumulatively, days trying to explain the ins and outs of the mixed sparse/dense format. So we have this setting to say, "you know what, I don't care about efficient, give me a dense format."
dotnet MML.dll savedata loader=text{sep=comma} data=foo.txt dout=foo2.txt saver=text{dense+}
The result is this:
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=Label:R4:0
#@ col=Features:R4:1-5
#@ }
Label "" "" "" "" ""
1 2 3 4 5 6
0 0 1 0 0 0
Less efficient? Sure. Eaier to understand? I'd say so. And lots of our tests use it, since our tests are usually writing small amounts of data and we found comprehensibility of test files to be valuable.