You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was attempting to migrate some of our tests, when I discovered we have a few problems in our new API on text saving. I know @artidoro had some thoughts on text saving/loading so mentioning him. Also I know @sfilipi and @rogancarr are handling many issues w.r.t. API completeness and consistency as they work on samples and docs, so maybe they have some thoughts on this. (Of course everyone is free to chime in.)
From what I can see, whoever authored this method confused the defaults with the text loader and the text saver. (Note how the defaults used in the text saving utility method are coming from TextLoader, which is incorrect.) It might seem intuitive if you don't think about it too hard that text saving and loading should have the same defaults, but practically it becomes clear they should not. The situations where one is "loading" into ML.NET and "saving" out of ML.NET are in fact very different situations. When someone is using a text loader with non-default settings they're usually asking us to ingest their format (so we chose the most helpful defaults for that more common scenario), whereas our text saver makes some attempt at schema. (Note also that under default settings, the text loader loads our own format without trouble, since it detects that a schema and settings is embedded in the file itself.)
// REVIEW: This and the corresponding BinarySaver option should be removed,
// with the silence being handled, somehow, at the environment level. (Task 6158846.)
[Argument(ArgumentType.LastOccurenceWins,HelpText="Suppress any info output (not warnings or errors)",Hide=true)]
publicboolSilent;
[Argument(ArgumentType.AtMostOnce,HelpText="Output the comment containing the loader settings",ShortName="schema")]
publicboolOutputSchema=true;
[Argument(ArgumentType.AtMostOnce,HelpText="Output the header",ShortName="header")]
publicboolOutputHeader=true;
The default for whether header row is saved has gone from true to false. The primary practical effect of this is, we're now dropping feature names (or more precisely, slot names) by default. This seems silly. Feature names are important. I think we ought to keep them by default.
We've lost the ability to force saving as dense format at all through this new API. This is often important for comprehensibility.
I ran into it while I was trying to clean up some of our tests to use more of the actual public surface. Consider this test.
But I can't specify equivalent settings because there's no way to force dense.
So I suggest this: we change the defaults of this text saving utility function to use the text saver defaults, instead of the text loader defaults, and also restore the ability to force a dense format.
Why forcing dense is kind of useful sometimes...
It may not be obvious why forcing to dense is kind of useful. So imagine this input file foo.txt.
What is that 6 2:1 line? It is encoding the information that this line has 6 fields, and the field at index 2 is the only one with a non-default value. That is, it has detected, "hey, this can be sparsely encoded," and it has done so. Similarly with this seemingly crazy Label 5 0:"" line. But sometimes we find that confusing!! I've spent over the years probably, cumulatively, days trying to explain the ins and outs of the mixed sparse/dense format. So we have this setting to say, "you know what, I don't care about efficient, give me a dense format."
Less efficient? Sure. Eaier to understand? I'd say so. And lots of our tests use it, since our tests are usually writing small amounts of data and we found comprehensibility of test files to be valuable.
The text was updated successfully, but these errors were encountered:
I was attempting to migrate some of our tests, when I discovered we have a few problems in our new API on text saving. I know @artidoro had some thoughts on text saving/loading so mentioning him. Also I know @sfilipi and @rogancarr are handling many issues w.r.t. API completeness and consistency as they work on samples and docs, so maybe they have some thoughts on this. (Of course everyone is free to chime in.)
From what I can see, whoever authored this method confused the defaults with the text loader and the text saver. (Note how the defaults used in the text saving utility method are coming from
TextLoader
, which is incorrect.) It might seem intuitive if you don't think about it too hard that text saving and loading should have the same defaults, but practically it becomes clear they should not. The situations where one is "loading" into ML.NET and "saving" out of ML.NET are in fact very different situations. When someone is using a text loader with non-default settings they're usually asking us to ingest their format (so we chose the most helpful defaults for that more common scenario), whereas our text saver makes some attempt at schema. (Note also that under default settings, the text loader loads our own format without trouble, since it detects that a schema and settings is embedded in the file itself.)This is the offending method:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderSaverCatalog.cs
Lines 147 to 153 in e192a18
As point of reference, this is that constant:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs
Line 401 in e192a18
Now, compare this with the actual defaults on our text saver:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextSaver.cs
Lines 28 to 43 in e192a18
The default for whether header row is saved has gone from
true
tofalse
. The primary practical effect of this is, we're now dropping feature names (or more precisely, slot names) by default. This seems silly. Feature names are important. I think we ought to keep them by default.We've lost the ability to force saving as dense format at all through this new API. This is often important for comprehensibility.
I ran into it while I was trying to clean up some of our tests to use more of the actual public surface. Consider this test.
machinelearning/test/Microsoft.ML.Tests/Transformers/ConcatTests.cs
Lines 133 to 138 in e192a18
That's kind of obnoxious, and not using our public API. I'd love to migrate it over to something like this:
But I can't specify equivalent settings because there's no way to force dense.
So I suggest this: we change the defaults of this text saving utility function to use the text saver defaults, instead of the text loader defaults, and also restore the ability to force a dense format.
Why forcing dense is kind of useful sometimes...
It may not be obvious why forcing to dense is kind of useful. So imagine this input file
foo.txt
.Then imagine I have this MML command line:
The resulting
foo1.txt
file is this:What is that
6 2:1
line? It is encoding the information that this line has 6 fields, and the field at index 2 is the only one with a non-default value. That is, it has detected, "hey, this can be sparsely encoded," and it has done so. Similarly with this seemingly crazyLabel 5 0:""
line. But sometimes we find that confusing!! I've spent over the years probably, cumulatively, days trying to explain the ins and outs of the mixed sparse/dense format. So we have this setting to say, "you know what, I don't care about efficient, give me a dense format."The result is this:
Less efficient? Sure. Eaier to understand? I'd say so. And lots of our tests use it, since our tests are usually writing small amounts of data and we found comprehensibility of test files to be valuable.
The text was updated successfully, but these errors were encountered: