Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs #3096
Labels
code-sanitation
Code consistency, maintainability, and best practices, moreso than any public API.
documentation
Related to documentation of ML.NET
P3
Doc bugs, questions, minor issues, etc.
The central concept in the ML.NET API in the aftermath of #581 has become the
IEstimator
/ITransformer
/IDataView
triad, with the less essential but still importantIDataReader
/IDataReaderEstimator
. That old issue, as you read it, defines a loose outline of what those now key interfaces should do, but in a somewhat vague and indefinite form, because it was not always obvious from the outset what would be correct or incorrect.Yet, over the course of the last half-year or so as we pursued the practical work of making these structures and working with them, we've refined what was once indefinite to more definite statements, about what makes a correct vs. incorrect, what invariants we assume they do and do not apply.
This documentation will take, as far as I can tell, two forms. One is refinements on the XML documentation of the appropriate types themselves, to clarify the "pointwise" accuracy of the individual components. The second is a more general document describing how all those components work together, to give a broader context not just of what they must do (which mostly belongs in the pointwise XML documentation), but also why things are the way they are.
To give a simple example of this:
machinelearning/src/Microsoft.ML.Core/Data/IEstimator.cs
Lines 268 to 278 in 3663320
We have here arguably the two most important methods in
ITransformer
. Now these two descriptions are not wrong, but they are lacking a critical bit of information, specifically: given anIDataView data
, we have come to rely on the fact thatGetOutputSchema(data.Schema)
will return the "same" schema (not in the reference sense) asTransform(data).Schema
would. That should be described.Speaking to the second point, we should explain why this must be so, that is the correctness of our composability, chaining, or ability to extract useful attribute information has come to rely upon that.
I sort of view the second part as a companion to, or even a successor to, the existing IDataView implementation document, except one that also treats on
IEstimator
andITransformer
. That document, like this proposed document, encapsulated information that existed in the form of PR comments and other discussions, but concentrating that documentation "in one place" has proven an enormous time saver over the years to be able to point to that document to explain why things must be so, rather than isolated harder-to-find parts of conversations.While it may be of use to end users (certainly if I am a user of an API, knowing what I can expect out of the key interfaces for a library is of some worth), the primary goal of the documentation is to ensure that, going forward, people "do the right thing" and have the right set of expectations.
The text was updated successfully, but these errors were encountered: