Skip to content

SlotNames for TextLoader are lost #2663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ivanidzo4ka opened this issue Feb 20, 2019 · 19 comments
Closed

SlotNames for TextLoader are lost #2663

Ivanidzo4ka opened this issue Feb 20, 2019 · 19 comments
Assignees
Labels
API Issues pertaining the friendly API bug Something isn't working

Comments

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 20, 2019

Before refactoring if we had header in file and we read it we filled slot names metadata with values in that header for columns.
This way we can have mapping between field "A" in csv file and slot number 5 in feature vector.

This functionality is lost right now.
Mostly because we split functionality of schema construction which done without file and reading data from file with already defined schema.
If I have this header:
Label A B C D E F G ....
and this source code:

            var reader = mlContext.Data.CreateTextLoader(new Microsoft.ML.Data.TextLoader.Column[] { new Microsoft.ML.Data.TextLoader.Column("Label", Microsoft.ML.Data.DataKind.R4, 0), new Microsoft.ML.Data.TextLoader.Column("Features", Microsoft.ML.Data.DataKind.R4, 1, 100) },
                hasHeader: true, separatorChar: ' ');
            var data = reader.Read("data.txt");

I expect Features column to have SlotNames metadata with values A B C D E F G, etc.

@Ivanidzo4ka Ivanidzo4ka added the bug Something isn't working label Feb 20, 2019
@Ivanidzo4ka
Copy link
Contributor Author

Sparked by #2240

Additionally, the model is not explorable, because the class I have to represent a single prediction has each column mapped to a different field, but those fields cannot be identified in the model.

@CESARDELATORRE
Copy link
Contributor

@Ivanidzo4ka - Days ago, using 0.10, we changed the two multi-class classification samples in the ML.NET samples repo and both of them are using this info coming from the schema so we know the "best three labels" for the classification based on the scores.

This should work on 0.11 or it'll break the code that customers might also be using.

@prathyusha12345 - Be aware of this issue. Can yo test these apps with 0.11, as well?

@prathyusha12345
Copy link

@CESARDELATORRE Sure I will test this in 0.11 preview today at sometime and update whether its executing fine or not.

@Ivanidzo4ka
Copy link
Contributor Author

@CESARDELATORRE Can you point me on changes? I think we speak about slightly different things.

What I'm talking is you have option of reading each column separately and then concat them into feature columns and I think we can look into SlotNames metadata for Features which would have information of previous column names.

Or you can read one big Feature column by specifying range of columns in text loader.
And in this case you have no way to trace back original names.

@Ivanidzo4ka
Copy link
Contributor Author

@rogancarr I know you working on samples do you have any examples where slot names can be useful? Like feature importance?

@CESARDELATORRE
Copy link
Contributor

@Ivanidzo4ka - OK, it might be a different thing. For the samples we just need to be able to get the slot names and the best scores.

We get the array indexes of the best 3 scores and obtain their related label names based on those array indexes.

Can you confirm that still works in 0.11?

@Ivanidzo4ka
Copy link
Contributor Author

@CESARDELATORRE
https://github.com/dotnet/machinelearning/pull/2651/files#diff-385bfdbfa8bede51aa194c8f50f2b54c is our current work with metadata.
We will still have: GetSlotNames, GetKeyValues, IsNormalized,HasSlotNames, HasKeyValues feel free to provide your feedback in that PR.

@yaeldekel
Copy link

@Ivanidzo4ka, should this be fixed by changing the signature of CreateTextLoader to take a file name instead of a bool? Or adding an overload that takes a file name? Or another solution?

@yaeldekel yaeldekel self-assigned this Feb 20, 2019
@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented Feb 20, 2019

@yaeldekel I think were is a reason why we separate read method from TextLoader. But I'm honestly not sure. Maybe @TomFinley remembers.

Can we augment schema during reading of the file? In same time I question myself why we have that hasHeader, separator and anything else other than schema definition in text loader if we have two separate steps as Schema Creation and data reading separated. Or at least why I can't override them during reading.

Sorry for not bringing clarity,

@yaeldekel
Copy link

TextLoader is an IDataReader which has a GetOutputSchema method, so as Tom mentioned, the schema cannot change after Read is called.
The other solution that I can think of is adding an overload of CreateTextLoader that takes an IMultiStreamSource to read the header from.

Also, the current API should probably be changed as well:

CreateTextLoader(this DataOperationsCatalog catalog,
            TextLoader.Column[] columns,
            bool hasHeader = TextLoader.Defaults.HasHeader,
            char separatorChar = TextLoader.Defaults.Separator,
            IMultiStreamSource dataSample = null)

The hasHeader argument is used to get the header if dataSample is not null, but if it is null, then it is only used to determine whether to skip the first row or not, which is also important information, but could be confusing to a user.

@TomFinley
Copy link
Contributor

@Ivanidzo4ka, should this be fixed by changing the signature of CreateTextLoader to take a file name instead of a bool?

This suggestion by @yaeldekel is the correct solution, and in fact what we have always done with this loader for years.

The text loader takes an optional file path for precisely this reason, so that it can "sniff" the header and determine things like options, sizes of columns. (In fact, the constructor of a text loader will itself fail, because it cannot determine the schema, which includes not only things like feature names, but auto-detected lengths of the default Features column, to list just the most obvious example.) This is in fact why we have this argument.

Can we augment schema during reading of the file?

Just to be clear, we cannot have schema change; the schema advertised by a reader must be the "same" as the schema returned by actually reading the data. We depend on this to do meaningful schema propagation. Similar reasonings apply for why things like IDataView and estimators and transformers are, once created, immutable w.r.t. their schema -- it would be impossible to do any meaningful schema propagation if they were to suddenly start changing. Changing metadata would break it "less," but would still break it.

Composability is very important to this API. We form pipelines all the time. One of the important tenants to that composability is that schemas remain consistent.

@TomFinley
Copy link
Contributor

Let's work through the practical effects of this. Let's imagine IDataReader no longer contains the correct schema, that is, it is no longer responsible for holding the feature names. That is inferred from the data, and the schema can change once we have data (including as this issue suggests, having it returned once Read is called).

Then let's imagine we train a learner on this. We then save the model, including the IDataReader, the chain of whatever ITransformer`, as we have done elsewhere.

Now consider, in actual deployment of models, you load a model without reading from another data file. You're done training. If those feature names only happen when you read data, even if (miraculously!) somehow we structured our code to be flexible to this scenario of mutating schemas, we'd have a bunch of machine learning models with no feature names. Because, the schema that produces the feature names has never been produced, since read does not enter into this! So you now have models with no feature names.

So, following this policy would ensure we never have feature names in this circumstance, which is broken.

Hopefully this practical worked example makes clear why this is why it is a broken scenario. We simply can't have composability of pipelines without immutability among the key IDataView components, including especially DataViewSchema. That is the most important thing we have to get right. And, I'd say getting that head of the schema "right" is the most important thing exactly to serve this scenario.

Recall in prior versions of this toolkit, we always considered that the old interface IDataLoader, when loaded over no data, should produce the same schema as before, including feature names. We must have the same policy today. If we don't, we've broken something.

@rogancarr
Copy link
Contributor

rogancarr commented Feb 21, 2019

@rogancarr I know you working on samples do you have any examples where slot names can be useful? Like feature importance?

Yes, explainability is the key here:

  • PFI when you want to load as a feature vector off a file, then present the results with the name of a column.
  • GAMs when you want to inspect shape functions and associate them with a name
  • Really any sort of feature importance because it helps to know which feature you're talking about — linear model weights, feature gains for trees, WTF/FCC for local interpretability, etc.

Those are just for loading off a file. I have a bunch more examples if you generate them via a pipeline.

@TomFinley
Copy link
Contributor

  • GAMs when you want to inspect shape functions and associate them with a name

Right. For all these @Ivanidzo4ka note that the key issue is that we need to save the feature names with the model. We can't just load them willy-nilly per read from a file, because often there is no file.

@yaeldekel
Copy link

I think this issue is about the problem that currently if we create a text loader with hasHeader=true but without a dataSample we don't get slot names. I have created another issue #2735 for the problem of not being able to load the schema with the slot names from a serialized model.

@yaeldekel
Copy link

@Ivanidzo4ka , I want to make sure I understand the issue you raised here. Was it about losing the slot names when you load a model from a file, or about not having slot names when you create a TextLoader using the API with hasHeader=true?

If it the former, then I guess we can close this issue as a duplicate of #2735 .
If it is the latter, then perhaps it is just a documentation issue? Either that, or we need to split the hasHeader argument into two arguments, one to indicate to TextLoader that the first line should be skipped, and the other can either be a bool (then we would throw if dataSample is not there), or an IMultiStreamSource that we use to read the slot names. What is your preference?
cc @TomFinley .

@Ivanidzo4ka
Copy link
Contributor Author

It's probably documentation issue at this point.
I'm still think it's slightly confusing to name hasHeader parameter considering we don't do anything with header. Maybe we should rename it to skipFirstLine in case if data file is not provided.
Or leave it as is, and provide better documentation.

@TomFinley
Copy link
Contributor

I feel like the original "bug" was based on what amounted to a misunderstanding and a lack of capability, which @yaeldekel has already fixed via #2735 and #2858, and which I refine just a bit more in #3025. So we should have an issue I suppose to describe how schema propagation actually works, which is as @yaeldekel says above mostly a documentation issue. (It totally does work! You just have to use it, is the only problem. 😉 And we ought to describe how.) Whether that documentation issue is just this issue repurposed or a new issue (with this one closed), I am somewhat indifferent to.

I feel like though this issue at this point probably does not still belong in project 13, since the relevant public API issues are addressed, have already been addressed, or something.

@yaeldekel
Copy link

Created a documentation issue for the remaining work, and closing this issue.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants