You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Metadata (FYI to be later named annotations per #2297) is an essential mechanism for attaching optional information about columns. This ranges from publicly facing stuff that user's should be aware of (slot names, which we use for feature names on feature columns, and key values), versus a bunch of stuff that is arguably useful for users but primarily for our internal infrastructure (e.g., whether something has already been normalized), versus stuff intended purely for our internal infrastructure.
We ought to decide what we really want to be part of our initial public surface (as small as possible but no smaller), and internalize the rest of it.
So, we will keep as is Metadata (to be Annotations):
This by itself is little more than an arbitrary string/object store, which is as intended. So that will not change. What will change however is the class we've made to make access a little more structured.
This has stuff that is "good" in that we want to keep it as part of the public surface, but also stuff that is internal and should not be part of the public surface.
The good
A small amount of this stuff we probably want to keep.
However we should probably move it somewhere else... perhaps, the static class SchemaColumnAnnotationsExtensions as a series of extension methods on top of DataViewSchema.Column to access the associated metadata.
So for example, we have this static class of Kinds of metadata. Absolutely
quite nice a thing to have for our own infrastructure for consistency, but this is not what we want to show users. Similar with this sort of labels for types of scorings, which is a scenario irrelevant to the ML.NET API as defined (since people evaluate scores by saying, "here, evaluate these scores" explicitly by calling some code). Also stuff on the ranges of categorical variables which, while essential, are mostly for the benefit of trainers downstream consuming data. (User's that want the raw categoricals can, by just programming, consume the source data themselves, since they control the pipeline.)
There's also a lot of stuff built around implementing metadata, which is of questionable worth at this time given the changes that have happened to schema in the past year, and which is of no use whatever.
Metadata (FYI to be later named annotations per #2297) is an essential mechanism for attaching optional information about columns. This ranges from publicly facing stuff that user's should be aware of (slot names, which we use for feature names on feature columns, and key values), versus a bunch of stuff that is arguably useful for users but primarily for our internal infrastructure (e.g., whether something has already been normalized), versus stuff intended purely for our internal infrastructure.
We ought to decide what we really want to be part of our initial public surface (as small as possible but no smaller), and internalize the rest of it.
So, we will keep as is
Metadata
(to beAnnotations
):machinelearning/src/Microsoft.Data.DataView/DataViewSchema.cs
Line 172 in a56caee
This by itself is little more than an arbitrary string/object store, which is as intended. So that will not change. What will change however is the class we've made to make access a little more structured.
machinelearning/src/Microsoft.ML.Core/Data/MetadataUtils.cs
Line 17 in a56caee
This has stuff that is "good" in that we want to keep it as part of the public surface, but also stuff that is internal and should not be part of the public surface.
The good
A small amount of this stuff we probably want to keep.
However we should probably move it somewhere else... perhaps, the static class
SchemaColumnAnnotationsExtensions
as a series of extension methods on top ofDataViewSchema.Column
to access the associated metadata.This might include things like these methods.
machinelearning/src/Microsoft.ML.Core/Data/MetadataUtils.cs
Line 297 in a56caee
machinelearning/src/Microsoft.ML.Core/Data/MetadataUtils.cs
Line 321 in a56caee
The bad
Much of this class though should be internal.
So for example, we have this static class of
Kinds
of metadata. Absolutelyquite nice a thing to have for our own infrastructure for consistency, but this is not what we want to show users. Similar with this sort of labels for types of scorings, which is a scenario irrelevant to the ML.NET API as defined (since people evaluate scores by saying, "here, evaluate these scores" explicitly by calling some code). Also stuff on the ranges of categorical variables which, while essential, are mostly for the benefit of trainers downstream consuming data. (User's that want the raw categoricals can, by just programming, consume the source data themselves, since they control the pipeline.)
There's also a lot of stuff built around implementing metadata, which is of questionable worth at this time given the changes that have happened to schema in the past year, and which is of no use whatever.
/cc @Ivanidzo4ka , @eerhardt , @rogancarr , @sfilipi
The text was updated successfully, but these errors were encountered: