-
Notifications
You must be signed in to change notification settings - Fork 1.9k
SlotNames for TextLoader are lost #2663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sparked by #2240
|
@Ivanidzo4ka - Days ago, using 0.10, we changed the two multi-class classification samples in the ML.NET samples repo and both of them are using this info coming from the schema so we know the "best three labels" for the classification based on the scores. This should work on 0.11 or it'll break the code that customers might also be using. @prathyusha12345 - Be aware of this issue. Can yo test these apps with 0.11, as well? |
@CESARDELATORRE Sure I will test this in 0.11 preview today at sometime and update whether its executing fine or not. |
@CESARDELATORRE Can you point me on changes? I think we speak about slightly different things. What I'm talking is you have option of reading each column separately and then concat them into feature columns and I think we can look into SlotNames metadata for Features which would have information of previous column names. Or you can read one big Feature column by specifying range of columns in text loader. |
@rogancarr I know you working on samples do you have any examples where slot names can be useful? Like feature importance? |
@Ivanidzo4ka - OK, it might be a different thing. For the samples we just need to be able to get the slot names and the best scores. We get the array indexes of the best 3 scores and obtain their related label names based on those array indexes. Can you confirm that still works in 0.11? |
@CESARDELATORRE |
@Ivanidzo4ka, should this be fixed by changing the signature of CreateTextLoader to take a file name instead of a bool? Or adding an overload that takes a file name? Or another solution? |
@yaeldekel I think were is a reason why we separate read method from TextLoader. But I'm honestly not sure. Maybe @TomFinley remembers. Can we augment schema during reading of the file? In same time I question myself why we have that hasHeader, separator and anything else other than schema definition in text loader if we have two separate steps as Schema Creation and data reading separated. Or at least why I can't override them during reading. Sorry for not bringing clarity, |
Also, the current API should probably be changed as well:
The |
This suggestion by @yaeldekel is the correct solution, and in fact what we have always done with this loader for years. The text loader takes an optional file path for precisely this reason, so that it can "sniff" the header and determine things like options, sizes of columns. (In fact, the constructor of a text loader will itself fail, because it cannot determine the schema, which includes not only things like feature names, but auto-detected lengths of the default
Just to be clear, we cannot have schema change; the schema advertised by a reader must be the "same" as the schema returned by actually reading the data. We depend on this to do meaningful schema propagation. Similar reasonings apply for why things like Composability is very important to this API. We form pipelines all the time. One of the important tenants to that composability is that schemas remain consistent. |
Let's work through the practical effects of this. Let's imagine Then let's imagine we train a learner on this. We then save the model, including the Now consider, in actual deployment of models, you load a model without reading from another data file. You're done training. If those feature names only happen when you read data, even if (miraculously!) somehow we structured our code to be flexible to this scenario of mutating schemas, we'd have a bunch of machine learning models with no feature names. Because, the schema that produces the feature names has never been produced, since read does not enter into this! So you now have models with no feature names. So, following this policy would ensure we never have feature names in this circumstance, which is broken. Hopefully this practical worked example makes clear why this is why it is a broken scenario. We simply can't have composability of pipelines without immutability among the key Recall in prior versions of this toolkit, we always considered that the old interface |
Yes, explainability is the key here:
Those are just for loading off a file. I have a bunch more examples if you generate them via a pipeline. |
Right. For all these @Ivanidzo4ka note that the key issue is that we need to save the feature names with the model. We can't just load them willy-nilly per read from a file, because often there is no file. |
I think this issue is about the problem that currently if we create a text loader with |
@Ivanidzo4ka , I want to make sure I understand the issue you raised here. Was it about losing the slot names when you load a model from a file, or about not having slot names when you create a If it the former, then I guess we can close this issue as a duplicate of #2735 . |
It's probably documentation issue at this point. |
I feel like the original "bug" was based on what amounted to a misunderstanding and a lack of capability, which @yaeldekel has already fixed via #2735 and #2858, and which I refine just a bit more in #3025. So we should have an issue I suppose to describe how schema propagation actually works, which is as @yaeldekel says above mostly a documentation issue. (It totally does work! You just have to use it, is the only problem. 😉 And we ought to describe how.) Whether that documentation issue is just this issue repurposed or a new issue (with this one closed), I am somewhat indifferent to. I feel like though this issue at this point probably does not still belong in project 13, since the relevant public API issues are addressed, have already been addressed, or something. |
Created a documentation issue for the remaining work, and closing this issue. |
Before refactoring if we had header in file and we read it we filled slot names metadata with values in that header for columns.
This way we can have mapping between field "A" in csv file and slot number 5 in feature vector.
This functionality is lost right now.
Mostly because we split functionality of schema construction which done without file and reading data from file with already defined schema.
If I have this header:
Label A B C D E F G ....
and this source code:
I expect Features column to have SlotNames metadata with values
A B C D E F G
, etc.The text was updated successfully, but these errors were encountered: