Proposal for Major Change in API

In this issue we describe a proposal to change the API. The core of the
proposal is, instead of working via the entry-point runtime abstraction lying
on top of the implementing code, we encourage people to use the implementing
code directly.

# Current State

Within ML.NET, for a component to be exposed in the "public" API, a component
author follows the following steps (from an extremely high level):

1. The author writes a component, implementing some sort of central interface.
   Often this is something like `IDataLoader`, `IDataTransform`, `ITrainer,`
   or some other such type of object.
2. An "entry-point" wrapping object is created for that component. This is a
   purely functional view of components as having inputs (as fields in some
   sort of input class) and outputs (as fields in some sort of output class).
   This is decorated with attributes, to allow the dependency injection
   framework to do its work.
3. A JSON "manifest" describing all such components is created, through some
   process involving a scan of all `.dll`s and the aforementioned attributes.
4. Some other code reads this JSON "manifest" and out of it generates a number
   of C# classes. (This process being the code in `CSharpApiGenerator.cs`, the
   artifact of which is described in `CSharpApi.cs`.)

A user then works with this component in the following fashion.

1. The user constructs a `LearningPipeline` object.
2. They adds implementations of `ILearningPipelineItem`, which are sort of
   configuration objects. (These are some of the objects that were code
   generated.)
3. Through some process that is probably too complex to describe here, these
   `ILearningPipelineItem` are transmuted into a sort of abstract "graph"
   structure comprised of inputs and outputs. (This is an "entry-point"
   experiment graph.)
4. This graph structure is then serialized to JSON, de-serialized back out of
   JSON, then the actual underlying code that implements the operations is
   loaded using dependency injection.
5. Once loaded, the associated "settings" objects (which are actual types
   explicitly written in ML.NET) have their fields populated from values in
   this JSON.
6. There is some higher level runtime coordinating this process of graph nodes
   (the entry-point graph runner). This is a sort of runtime for the nodes,
   and handles job scheduling, variable setting, and whatnot.

The way this process works is via something called entry-points. Entry-points
were conceived as a mechanism to enable a "regular" way to invoke ML.NET
components from native code, that was more expressive and powerful than the
command line. Essentially: they are a command-line on steroids, that instead
of inventing a new DSL utilizes JSON. This is effective at alleviating the
burden of writing "bridges" from R and Python into ML.NET. It also has
advantages in situations where you need to send a sequence of commands "over
the wire" in some complex fashion. While a few types would need to be handled
(e.g., standard numeric types, `IDataView`, `IFileHandle`, and some others),
so long as the entry-points used *only* those supported types, composing an
experiment in those non-.NET environments would be possible.

# Possible Alternate State

Instead of working indirectly with ML.NET components through the entry-point
abstraction, you could just instantiate and use the existing classes directly.
That is, the aforementioned `IDataLoader`, `IDataTransform`, `ITrainer,` and
so forth would be instantiated and operated on directly.

While entry-points would still be necessary for any components we wished to
expose through R or Python, we would constrain our usage to those applications
where the added level of abstraction served some purpose.

This alternate pattern of usage is already well tested, as it actually
reflects how ML.NET itself is written.

# Changes for ML.NET

In order to move towards this state, a few high level adjustments will be
necessary.

* Low level API is based direct instantiations of `IDataViews`/`ITrainer` and
  other fundamental types and utilities already used within ML.NET code.
* We will work to actively identify and improve that low level API from the
  point of view of usage. See the sequel for more in depth discussion of this
  point.
* Writing higher level abstractions to make things easier should be
  encouraged, however always with the aim of making them non-opaque. That is,
  in edge cases when the abstraction fails, integrating what *can* be done
  with the abstraction with the lower level explicit API should be possible.
  Generally: Easy things should be easy and hard things should be possible.
* To clarify: We are not getting rid of entry-points, because it remains the
  mechanism by which interop from non-.NET programming environments into TLC
  will continue to happen, and is therefore important. The shift is: the lower
  level C# API will not use entry-points. For the purpose of servicing
  GUI/Python/non-.NET bindings, we will continue in our own code to provide
  entry points, while allowing user code to work by implementing the core
  interfaces directly.

# Examples of Potential Improvements in "Direct Access" API

We give the following concrete examples of areas that probably need
improvement. The examples are meant to be illustrative only. That is: the list
is not exhaustive, nor are specific "solutions" to problems meant to convey
that something *must* be done in a particular way.

* Instantiation of late binding components was previously always done via
  dependency injection. Therefore, all components have constructors or static
  create methods that have had *identical* signatures (e.g., for transforms,
  `IHostEnvironment env, Arguments args, IDataView input`). Direct
  instantiation by the user *could* use that, but would doubtless be better
  served by a more contextually appropriate constructor that reflects common
  use-cases. For example, this:

  ```csharp
  IDataTransform trans = new ConcatTransform(env, new ConcatTransform.Arguments()
  {
      Column = new[] {
      new ConcatTransform.Column()
      {
          Name = "NumericalFeatures",
          Source = new[] { "SqftLiving", "SqftLot", "SqftAbove",   "SqftBasement",
              "Lat", "Long", "SqftLiving15", "SqftLot15" }
      }}
  }, loader);
  ```

  may become this:

  ```csharp
  IDataTransform trans = new ConcatTransform(env, loader, "NumericalFeatures",
      "SqftLiving", "SqftLot", "SqftAbove", "SqftBasement", "Lat", "Long",
      "SqftLiving15", "SqftLot15");
  ```

  This can work both ways: if these objects are directly instantiated, the
  objects could provide richer information than merely being an
  `IDataTransform`, or what have you. Due to working via the command line,
  entry-points, or a GUI, it is considered almost useless for a component to
  have any purely programmatic access. So for example: we could have had the
  `AffineNormalizer` expose its slope and intercept, but we instead expose it
  by metadata instead. A direct accessor in ML.NET may be appropriate if we
  directly use these components.

* Creating a transform and loader feels similar. However, creating a trainer,
  using it to provide a predictor, and then ultimately parameterizing a scorer
  transform with that predictor. Where possible we can try to harmonize the
  interfaces to make them seem more consistent. (Obviously not always possible
  since the underlying abstraction may in fact be genuinely different.)

* Some parts of the current library introduce needless complexity: `Train`
  method on trainer is `void`, always followed by `CreatePredictor`. Other
  incidents of needless complexity may be less easy to resolve.

* Some parts of the current library introduce *needful* complexity, but could
  probably be improved somehow. `RoleMappedData` creation and usage, while
  providing an essential service ("use this column for this purpose"), is
  incredibly difficult to use. When it was just an "internal" structure we
  just sort of dealt with it, but we would like to improve it. (In some cases
  we can hide its creation into auxillary helper methods, for example.)

* Simple things like improving naming of things may just help a lot. For
  example: `ScoreUtils.GetScorer` returns a transform with the predictor's
  scores applied to data. `ScoreUtils.GetScoredData` or something may be a
  better name.

* Our so-called "internal" methods do not always direct people towards pits of
  success. For example: some pipeline components should probably apply only
  during training (e.g., filtering, sampling, caching). Some distinction or
  other engineering nicety (e.g., have the utilities for saving models throw
  by default) may help warn people off this common misuse case.

* Components of the existing API that deal with
  late-binding/dependency-injection stuff could potentially use delegates or
  something like entry-point style factory interfaces instead. This means
  among other things lifting out things like `SubComponent` from most code.
  Whether these delegates happen to be composed from the command line parser
  calling `SubComponent.CreateInstance`, or some entry-point "subgraph"
  generating a delegate out of its own graph, is the business of the command
  line parser and entry-point engine, not the component code itself. (Maybe
  the delegate just calls Run graph or something then binds the values.)

  So for example what is currently this:

  ```csharp
  new Ova(env, new Ova.Argumnets() { Trainer = new SubComponent("sdcaR") );
  ```

  might become this:

  ```csharp
  new Ova(env, host => new SdcaRegression(host));
  ```

* When we think about transform chains and pipelines, both the existing and suggested systems have a need for an intermediate object capable of representing a pipeline *before* it is instantiated. That intermediate form must be something you can reason over, both to pre-verify pipelines, as well as for certain applications like suggested transforms/auto-ML. One example is issue #267.

  Entry-points were *an* intermediate object, but being logically only `JObject`s you could not get rich information about what or how they would operate. (Given a pipeline in entry-points you could tell that something might be outputting *a* `IDataView`, for example, but have no information about what columns were actually in that output.)

  This suggests that the API will want something *like* `LearningPipeline`, though I am quite confident `LearningPipeline` is an incorrect level of abstraction. (See the previous point about opaque abstractions, among other points.)

Note that many of these enhancements will serve not only users, but component
authors (including us), and so improve the whole platform.

# Miscellaneous Details

Note that C# code generation from entry-point graphs will still be possible:
all entry-point invocations come down to (1) defining input objects, (2)
calling a static method and (3) doing something with the output object.
However it will probably not be possible to make it seem "natural" any more
than an attempt to do code-generation from a `mml` command line would seem
"natural."

When we decided to make the public facing API entry-points based, this
necessarily required shifting related infrastructure (e.g., `GraphRunner`,
`JsonManifestUtils`) into more central assemblies. Once that "idiom" is
deconstructed, this infrastructure should resume its prior state of being in
an isolated assembly.

Along similar lines of isolation, once we shift the components to not use
`SubComponent` directly, we can "uplift" what is currently the command line
parsing code out into a separate assembly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal for Major Change in API #371

Current State

Possible Alternate State

Changes for ML.NET

Examples of Potential Improvements in "Direct Access" API

Miscellaneous Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal for Major Change in API #371

Description

Current State

Possible Alternate State

Changes for ML.NET

Examples of Potential Improvements in "Direct Access" API

Miscellaneous Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions