Skip to content

Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs #3096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomFinley opened this issue Mar 26, 2019 · 0 comments
Assignees
Labels
code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. documentation Related to documentation of ML.NET P3 Doc bugs, questions, minor issues, etc.

Comments

@TomFinley
Copy link
Contributor

The central concept in the ML.NET API in the aftermath of #581 has become the IEstimator/ITransformer/IDataView triad, with the less essential but still important IDataReader/IDataReaderEstimator. That old issue, as you read it, defines a loose outline of what those now key interfaces should do, but in a somewhat vague and indefinite form, because it was not always obvious from the outset what would be correct or incorrect.

Yet, over the course of the last half-year or so as we pursued the practical work of making these structures and working with them, we've refined what was once indefinite to more definite statements, about what makes a correct vs. incorrect, what invariants we assume they do and do not apply.

This documentation will take, as far as I can tell, two forms. One is refinements on the XML documentation of the appropriate types themselves, to clarify the "pointwise" accuracy of the individual components. The second is a more general document describing how all those components work together, to give a broader context not just of what they must do (which mostly belongs in the pointwise XML documentation), but also why things are the way they are.

To give a simple example of this:

/// <summary>
/// Schema propagation for transformers.
/// Returns the output schema of the data, if the input schema is like the one provided.
/// </summary>
DataViewSchema GetOutputSchema(DataViewSchema inputSchema);
/// <summary>
/// Take the data in, make transformations, output the data.
/// Note that <see cref="IDataView"/>'s are lazy, so no actual transformations happen here, just schema validation.
/// </summary>
IDataView Transform(IDataView input);

We have here arguably the two most important methods in ITransformer. Now these two descriptions are not wrong, but they are lacking a critical bit of information, specifically: given an IDataView data, we have come to rely on the fact that GetOutputSchema(data.Schema) will return the "same" schema (not in the reference sense) as Transform(data).Schema would. That should be described.

Speaking to the second point, we should explain why this must be so, that is the correctness of our composability, chaining, or ability to extract useful attribute information has come to rely upon that.

I sort of view the second part as a companion to, or even a successor to, the existing IDataView implementation document, except one that also treats on IEstimator and ITransformer. That document, like this proposed document, encapsulated information that existed in the form of PR comments and other discussions, but concentrating that documentation "in one place" has proven an enormous time saver over the years to be able to point to that document to explain why things must be so, rather than isolated harder-to-find parts of conversations.

While it may be of use to end users (certainly if I am a user of an API, knowing what I can expect out of the key interfaces for a library is of some worth), the primary goal of the documentation is to ensure that, going forward, people "do the right thing" and have the right set of expectations.

@TomFinley TomFinley added documentation Related to documentation of ML.NET code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. labels Mar 26, 2019
@TomFinley TomFinley self-assigned this Mar 26, 2019
@shmoradims shmoradims added P1 Priority of the issue for triage purpose: Needs to be fixed soon. P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed P1 Priority of the issue for triage purpose: Needs to be fixed soon. P2 Priority of the issue for triage purpose: Needs to be fixed at some point. labels May 21, 2019
@ganik ganik added P3 Doc bugs, questions, minor issues, etc. and removed P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. documentation Related to documentation of ML.NET P3 Doc bugs, questions, minor issues, etc.
Projects
None yet
Development

No branches or pull requests

3 participants