Skip to content

Metadata/annotations public surface of the API #2622

Closed
@TomFinley

Description

@TomFinley

Metadata (FYI to be later named annotations per #2297) is an essential mechanism for attaching optional information about columns. This ranges from publicly facing stuff that user's should be aware of (slot names, which we use for feature names on feature columns, and key values), versus a bunch of stuff that is arguably useful for users but primarily for our internal infrastructure (e.g., whether something has already been normalized), versus stuff intended purely for our internal infrastructure.

We ought to decide what we really want to be part of our initial public surface (as small as possible but no smaller), and internalize the rest of it.

So, we will keep as is Metadata (to be Annotations):

This by itself is little more than an arbitrary string/object store, which is as intended. So that will not change. What will change however is the class we've made to make access a little more structured.

public static class MetadataUtils

This has stuff that is "good" in that we want to keep it as part of the public surface, but also stuff that is internal and should not be part of the public surface.

The good

A small amount of this stuff we probably want to keep.

However we should probably move it somewhere else... perhaps, the static class SchemaColumnAnnotationsExtensions as a series of extension methods on top of DataViewSchema.Column to access the associated metadata.

This might include things like these methods.

public static bool HasSlotNames(this DataViewSchema.Column column)

public static void GetSlotNames(this DataViewSchema.Column column, ref VBuffer<ReadOnlyMemory<char>> slotNames)

The bad

Much of this class though should be internal.

So for example, we have this static class of Kinds of metadata. Absolutely
quite nice a thing to have for our own infrastructure for consistency, but this is not what we want to show users. Similar with this sort of labels for types of scorings, which is a scenario irrelevant to the ML.NET API as defined (since people evaluate scores by saying, "here, evaluate these scores" explicitly by calling some code). Also stuff on the ranges of categorical variables which, while essential, are mostly for the benefit of trainers downstream consuming data. (User's that want the raw categoricals can, by just programming, consume the source data themselves, since they control the pipeline.)

There's also a lot of stuff built around implementing metadata, which is of questionable worth at this time given the changes that have happened to schema in the past year, and which is of no use whatever.

/cc @Ivanidzo4ka , @eerhardt , @rogancarr , @sfilipi

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly API

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions