Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs

The central concept in the ML.NET API in the aftermath of #581 has become the `IEstimator`/`ITransformer`/`IDataView` triad, with the less essential but still important `IDataReader`/`IDataReaderEstimator`. That old issue, as you read it, defines a loose outline of what those now key interfaces should do, but in a somewhat vague and indefinite form, because it was not always obvious from the outset what would be correct or incorrect.

Yet, over the course of the last half-year or so as we pursued the practical work of making these structures and working with them, we've refined what was once indefinite to more definite statements, about what makes a correct vs. incorrect, what invariants we assume they do and do not apply.

This documentation will take, as far as I can tell, two forms. One is refinements on the XML documentation of the appropriate types themselves, to clarify the "pointwise" accuracy of the individual components. The second is a more general document describing how all those components work together, to give a broader context not just of what they must do (which mostly belongs in the pointwise XML documentation), but also *why* things are the way they are.

To give a simple example of this:

https://github.com/dotnet/machinelearning/blob/366332047340bdb12cf5fe2a11f9a6e257af9b84/src/Microsoft.ML.Core/Data/IEstimator.cs#L268-L278

We have here arguably the two most important methods in `ITransformer`. Now these two descriptions are not *wrong*, but they are lacking a critical bit of information, specifically: given an `IDataView data`, we have come to rely on the fact that `GetOutputSchema(data.Schema)` will return the "same" schema (not in the reference sense) as `Transform(data).Schema` would. That should be described.

Speaking to the second point, we should explain *why* this must be so, that is the correctness of our composability, chaining, or ability to extract useful attribute information has come to rely upon that.

I sort of view the second part as a companion to, or even a successor to, the existing [IDataView implementation](https://github.com/dotnet/machinelearning/blob/366332047340bdb12cf5fe2a11f9a6e257af9b84/docs/code/IDataViewImplementation.md) document, except one that also treats on `IEstimator` and `ITransformer`. That document, like this proposed document, encapsulated information that *existed* in the form of PR comments and other discussions, but concentrating that documentation "in one place" has proven an enormous time saver over the years to be able to point to that document to explain why things must be so, rather than isolated harder-to-find parts of conversations.

While it may be of use to end users (certainly if I am a user of an API, knowing what I can expect out of the key interfaces for a library is of some worth), the primary goal of the documentation is to ensure that, going forward, people "do the right thing" and have the right set of expectations.

	/// <summary>
	/// Schema propagation for transformers.
	/// Returns the output schema of the data, if the input schema is like the one provided.
	/// </summary>
	DataViewSchema GetOutputSchema(DataViewSchema inputSchema);

	/// <summary>
	/// Take the data in, make transformations, output the data.
	/// Note that <see cref="IDataView"/>'s are lazy, so no actual transformations happen here, just schema validation.
	/// </summary>
	IDataView Transform(IDataView input);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs #3096

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs #3096

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions