Skip to content

Metadata/annotations public surface of the API #2622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Feb 19, 2019 · 0 comments
Closed

Metadata/annotations public surface of the API #2622

TomFinley opened this issue Feb 19, 2019 · 0 comments
Labels
API Issues pertaining the friendly API
Milestone

Comments

@TomFinley
Copy link
Contributor

Metadata (FYI to be later named annotations per #2297) is an essential mechanism for attaching optional information about columns. This ranges from publicly facing stuff that user's should be aware of (slot names, which we use for feature names on feature columns, and key values), versus a bunch of stuff that is arguably useful for users but primarily for our internal infrastructure (e.g., whether something has already been normalized), versus stuff intended purely for our internal infrastructure.

We ought to decide what we really want to be part of our initial public surface (as small as possible but no smaller), and internalize the rest of it.

So, we will keep as is Metadata (to be Annotations):

This by itself is little more than an arbitrary string/object store, which is as intended. So that will not change. What will change however is the class we've made to make access a little more structured.

public static class MetadataUtils

This has stuff that is "good" in that we want to keep it as part of the public surface, but also stuff that is internal and should not be part of the public surface.

The good

A small amount of this stuff we probably want to keep.

However we should probably move it somewhere else... perhaps, the static class SchemaColumnAnnotationsExtensions as a series of extension methods on top of DataViewSchema.Column to access the associated metadata.

This might include things like these methods.

public static bool HasSlotNames(this DataViewSchema.Column column)

public static void GetSlotNames(this DataViewSchema.Column column, ref VBuffer<ReadOnlyMemory<char>> slotNames)

The bad

Much of this class though should be internal.

So for example, we have this static class of Kinds of metadata. Absolutely
quite nice a thing to have for our own infrastructure for consistency, but this is not what we want to show users. Similar with this sort of labels for types of scorings, which is a scenario irrelevant to the ML.NET API as defined (since people evaluate scores by saying, "here, evaluate these scores" explicitly by calling some code). Also stuff on the ranges of categorical variables which, while essential, are mostly for the benefit of trainers downstream consuming data. (User's that want the raw categoricals can, by just programming, consume the source data themselves, since they control the pipeline.)

There's also a lot of stuff built around implementing metadata, which is of questionable worth at this time given the changes that have happened to schema in the past year, and which is of no use whatever.

/cc @Ivanidzo4ka , @eerhardt , @rogancarr , @sfilipi

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

2 participants