SlotNames for TextLoader are lost #2663

Ivanidzo4ka · 2019-02-20T18:03:41Z

Before refactoring if we had header in file and we read it we filled slot names metadata with values in that header for columns.
This way we can have mapping between field "A" in csv file and slot number 5 in feature vector.

This functionality is lost right now.
Mostly because we split functionality of schema construction which done without file and reading data from file with already defined schema.
If I have this header:
Label A B C D E F G ....
and this source code:

            var reader = mlContext.Data.CreateTextLoader(new Microsoft.ML.Data.TextLoader.Column[] { new Microsoft.ML.Data.TextLoader.Column("Label", Microsoft.ML.Data.DataKind.R4, 0), new Microsoft.ML.Data.TextLoader.Column("Features", Microsoft.ML.Data.DataKind.R4, 1, 100) },
                hasHeader: true, separatorChar: ' ');
            var data = reader.Read("data.txt");

I expect Features column to have SlotNames metadata with values A B C D E F G, etc.

The text was updated successfully, but these errors were encountered:

Ivanidzo4ka · 2019-02-20T18:04:19Z

Sparked by #2240

Additionally, the model is not explorable, because the class I have to represent a single prediction has each column mapped to a different field, but those fields cannot be identified in the model.

CESARDELATORRE · 2019-02-20T18:10:17Z

@Ivanidzo4ka - Days ago, using 0.10, we changed the two multi-class classification samples in the ML.NET samples repo and both of them are using this info coming from the schema so we know the "best three labels" for the classification based on the scores.

This should work on 0.11 or it'll break the code that customers might also be using.

@prathyusha12345 - Be aware of this issue. Can yo test these apps with 0.11, as well?

prathyusha12345 · 2019-02-20T18:17:56Z

@CESARDELATORRE Sure I will test this in 0.11 preview today at sometime and update whether its executing fine or not.

Ivanidzo4ka · 2019-02-20T18:18:26Z

@CESARDELATORRE Can you point me on changes? I think we speak about slightly different things.

What I'm talking is you have option of reading each column separately and then concat them into feature columns and I think we can look into SlotNames metadata for Features which would have information of previous column names.

Or you can read one big Feature column by specifying range of columns in text loader.
And in this case you have no way to trace back original names.

Ivanidzo4ka · 2019-02-20T18:23:53Z

@rogancarr I know you working on samples do you have any examples where slot names can be useful? Like feature importance?

CESARDELATORRE · 2019-02-20T18:30:17Z

@Ivanidzo4ka - OK, it might be a different thing. For the samples we just need to be able to get the slot names and the best scores.

We get the array indexes of the best 3 scores and obtain their related label names based on those array indexes.

Can you confirm that still works in 0.11?

Ivanidzo4ka · 2019-02-20T18:34:29Z

@CESARDELATORRE
https://github.com/dotnet/machinelearning/pull/2651/files#diff-385bfdbfa8bede51aa194c8f50f2b54c is our current work with metadata.
We will still have: GetSlotNames, GetKeyValues, IsNormalized,HasSlotNames, HasKeyValues feel free to provide your feedback in that PR.

yaeldekel · 2019-02-20T19:08:12Z

@Ivanidzo4ka, should this be fixed by changing the signature of CreateTextLoader to take a file name instead of a bool? Or adding an overload that takes a file name? Or another solution?

Ivanidzo4ka · 2019-02-20T19:17:00Z

@yaeldekel I think were is a reason why we separate read method from TextLoader. But I'm honestly not sure. Maybe @TomFinley remembers.

Can we augment schema during reading of the file? In same time I question myself why we have that hasHeader, separator and anything else other than schema definition in text loader if we have two separate steps as Schema Creation and data reading separated. Or at least why I can't override them during reading.

Sorry for not bringing clarity,

yaeldekel · 2019-02-21T18:28:13Z

TextLoader is an IDataReader which has a GetOutputSchema method, so as Tom mentioned, the schema cannot change after Read is called.
The other solution that I can think of is adding an overload of CreateTextLoader that takes an IMultiStreamSource to read the header from.

Also, the current API should probably be changed as well:

CreateTextLoader(this DataOperationsCatalog catalog,
            TextLoader.Column[] columns,
            bool hasHeader = TextLoader.Defaults.HasHeader,
            char separatorChar = TextLoader.Defaults.Separator,
            IMultiStreamSource dataSample = null)

The hasHeader argument is used to get the header if dataSample is not null, but if it is null, then it is only used to determine whether to skip the first row or not, which is also important information, but could be confusing to a user.

TomFinley · 2019-02-21T18:34:09Z

@Ivanidzo4ka, should this be fixed by changing the signature of CreateTextLoader to take a file name instead of a bool?

This suggestion by @yaeldekel is the correct solution, and in fact what we have always done with this loader for years.

The text loader takes an optional file path for precisely this reason, so that it can "sniff" the header and determine things like options, sizes of columns. (In fact, the constructor of a text loader will itself fail, because it cannot determine the schema, which includes not only things like feature names, but auto-detected lengths of the default Features column, to list just the most obvious example.) This is in fact why we have this argument.

Can we augment schema during reading of the file?

Just to be clear, we cannot have schema change; the schema advertised by a reader must be the "same" as the schema returned by actually reading the data. We depend on this to do meaningful schema propagation. Similar reasonings apply for why things like IDataView and estimators and transformers are, once created, immutable w.r.t. their schema -- it would be impossible to do any meaningful schema propagation if they were to suddenly start changing. Changing metadata would break it "less," but would still break it.

Composability is very important to this API. We form pipelines all the time. One of the important tenants to that composability is that schemas remain consistent.

TomFinley · 2019-02-21T18:54:07Z

Let's work through the practical effects of this. Let's imagine IDataReader no longer contains the correct schema, that is, it is no longer responsible for holding the feature names. That is inferred from the data, and the schema can change once we have data (including as this issue suggests, having it returned once Read is called).

Then let's imagine we train a learner on this. We then save the model, including the IDataReader, the chain of whatever ITransformer`, as we have done elsewhere.

Now consider, in actual deployment of models, you load a model without reading from another data file. You're done training. If those feature names only happen when you read data, even if (miraculously!) somehow we structured our code to be flexible to this scenario of mutating schemas, we'd have a bunch of machine learning models with no feature names. Because, the schema that produces the feature names has never been produced, since read does not enter into this! So you now have models with no feature names.

So, following this policy would ensure we never have feature names in this circumstance, which is broken.

Hopefully this practical worked example makes clear why this is why it is a broken scenario. We simply can't have composability of pipelines without immutability among the key IDataView components, including especially DataViewSchema. That is the most important thing we have to get right. And, I'd say getting that head of the schema "right" is the most important thing exactly to serve this scenario.

Recall in prior versions of this toolkit, we always considered that the old interface IDataLoader, when loaded over no data, should produce the same schema as before, including feature names. We must have the same policy today. If we don't, we've broken something.

rogancarr · 2019-02-21T18:58:25Z

@rogancarr I know you working on samples do you have any examples where slot names can be useful? Like feature importance?

Yes, explainability is the key here:

PFI when you want to load as a feature vector off a file, then present the results with the name of a column.
GAMs when you want to inspect shape functions and associate them with a name
Really any sort of feature importance because it helps to know which feature you're talking about — linear model weights, feature gains for trees, WTF/FCC for local interpretability, etc.

Those are just for loading off a file. I have a bunch more examples if you generate them via a pipeline.

TomFinley · 2019-02-21T20:51:11Z

GAMs when you want to inspect shape functions and associate them with a name

Right. For all these @Ivanidzo4ka note that the key issue is that we need to save the feature names with the model. We can't just load them willy-nilly per read from a file, because often there is no file.

yaeldekel · 2019-02-26T16:46:28Z

I think this issue is about the problem that currently if we create a text loader with hasHeader=true but without a dataSample we don't get slot names. I have created another issue #2735 for the problem of not being able to load the schema with the slot names from a serialized model.

yaeldekel · 2019-03-14T21:31:48Z

@Ivanidzo4ka , I want to make sure I understand the issue you raised here. Was it about losing the slot names when you load a model from a file, or about not having slot names when you create a TextLoader using the API with hasHeader=true?

If it the former, then I guess we can close this issue as a duplicate of #2735 .
If it is the latter, then perhaps it is just a documentation issue? Either that, or we need to split the hasHeader argument into two arguments, one to indicate to TextLoader that the first line should be skipped, and the other can either be a bool (then we would throw if dataSample is not there), or an IMultiStreamSource that we use to read the slot names. What is your preference?
cc @TomFinley .

Ivanidzo4ka · 2019-03-15T17:23:31Z

It's probably documentation issue at this point.
I'm still think it's slightly confusing to name hasHeader parameter considering we don't do anything with header. Maybe we should rename it to skipFirstLine in case if data file is not provided.
Or leave it as is, and provide better documentation.

TomFinley · 2019-03-21T15:31:24Z

I feel like the original "bug" was based on what amounted to a misunderstanding and a lack of capability, which @yaeldekel has already fixed via #2735 and #2858, and which I refine just a bit more in #3025. So we should have an issue I suppose to describe how schema propagation actually works, which is as @yaeldekel says above mostly a documentation issue. (It totally does work! You just have to use it, is the only problem. 😉 And we ought to describe how.) Whether that documentation issue is just this issue repurposed or a new issue (with this one closed), I am somewhat indifferent to.

I feel like though this issue at this point probably does not still belong in project 13, since the relevant public API issues are addressed, have already been addressed, or something.

yaeldekel · 2019-03-21T20:30:48Z

Created a documentation issue for the remaining work, and closing this issue.

Ivanidzo4ka added the bug Something isn't working label Feb 20, 2019

yaeldekel self-assigned this Feb 20, 2019

shauheen added the API Issues pertaining the friendly API label Feb 22, 2019

TomFinley mentioned this issue Feb 25, 2019

Make text loaders consistent #2710

Merged

yaeldekel mentioned this issue Feb 26, 2019

Add API to save/load models with their input schema #2735

Closed

TomFinley mentioned this issue Mar 19, 2019

About models, ITransform, and IDataLoader, saving/loading #3025

Closed

yaeldekel mentioned this issue Mar 21, 2019

Defining a TextLoader with hasHeader=true and no data sample results in no slot names #3053

Closed

yaeldekel closed this as completed Mar 21, 2019

TomFinley mentioned this issue Mar 21, 2019

Remove model saving/loading inconsistencies #3044

Merged

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlotNames for TextLoader are lost #2663

SlotNames for TextLoader are lost #2663

Ivanidzo4ka commented Feb 20, 2019 •

edited

Loading

Ivanidzo4ka commented Feb 20, 2019

CESARDELATORRE commented Feb 20, 2019

prathyusha12345 commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

CESARDELATORRE commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

yaeldekel commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019 •

edited

Loading

yaeldekel commented Feb 21, 2019

TomFinley commented Feb 21, 2019

TomFinley commented Feb 21, 2019

rogancarr commented Feb 21, 2019 •

edited

Loading

TomFinley commented Feb 21, 2019

yaeldekel commented Feb 26, 2019

yaeldekel commented Mar 14, 2019

Ivanidzo4ka commented Mar 15, 2019

TomFinley commented Mar 21, 2019

yaeldekel commented Mar 21, 2019

SlotNames for TextLoader are lost #2663

SlotNames for TextLoader are lost #2663

Comments

Ivanidzo4ka commented Feb 20, 2019 • edited Loading

Ivanidzo4ka commented Feb 20, 2019

CESARDELATORRE commented Feb 20, 2019

prathyusha12345 commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

CESARDELATORRE commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019

yaeldekel commented Feb 20, 2019

Ivanidzo4ka commented Feb 20, 2019 • edited Loading

yaeldekel commented Feb 21, 2019

TomFinley commented Feb 21, 2019

TomFinley commented Feb 21, 2019

rogancarr commented Feb 21, 2019 • edited Loading

TomFinley commented Feb 21, 2019

yaeldekel commented Feb 26, 2019

yaeldekel commented Mar 14, 2019

Ivanidzo4ka commented Mar 15, 2019

TomFinley commented Mar 21, 2019

yaeldekel commented Mar 21, 2019

Ivanidzo4ka commented Feb 20, 2019 •

edited

Loading

Ivanidzo4ka commented Feb 20, 2019 •

edited

Loading

rogancarr commented Feb 21, 2019 •

edited

Loading