-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hide more of Microsoft.ML.Data #2842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// uniformity on the public API surface, wherever it is judged where columns with default names should be consumed. | ||
/// </summary> | ||
[BestFriend] | ||
internal static class DefaultColumnNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While helpful for enabling a uniform public surface, this is of course not the same as it needing to be in the public surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I agree we should internalize this - just raising as a general question in this area)
How do we imagine users to find out about these "magic" strings? Currently, one way is the default values in the MLContext APIs. But I wonder if that is enough. For example, when making a Prediction "output" class, how do users know to name/attribute their properties as PredictedLabel
, Probability
, Score
? Just through documentation?
In reply to: 262298366 [](ancestors = 262298366)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately I think both helpful displays in the debugger (for live discovery) and documentation (for offline discovery) are the only real answer we have right now -- both seem insufficient or at least suboptimal. This is a problem I think in most places we've chosen to believe that producing "default" columns names is more helpful than requiring the user to specify them. I am unsure how to address that problem. Something resembling the static typed API will I think be a solution to that problem, since the columns will be typed and addressable and right there.
Edit: Just to clarify, I am not saying that producing default column names is unhelpful or we should always insist that people provide them, just, this is one of the effects of that conscious choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen quite a few instances where users use DefaultColumnNames
. I don't understand why we want to hide it. We will always have default column names, why not allow users to access them instead of hard coding them in their applications?
searching the samples repo shows there are quite a few instances where people use DefaultColumnNames
.
https://github.com/dotnet/machinelearning-samples/search?q=defaultcolumnnames&unscoped_q=defaultcolumnnames
or this other user provided code:
#2486 (comment)
In reply to: 262302369 [](ancestors = 262302369)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In those places, what would have happened in those samples if instead of DefaultColumnNames.Features
we instead wrote "Features"
? I understand the appeal inside the tool of having a consistent "location" for them. For user code, I am less certain.
I am very uncomfortable with having as part of our public surface a class that is merely a collection of "magic strings" that we've used internally to denote certain concepts. That's not great API design.
The samples can and should change. To use this class of magic constants was, frankly, a bad move by the sample author, since it was totally needless and it obscures what's going on more than it clarifies it.
@@ -148,11 +152,13 @@ IRowToRowMapper ITransformer.GetRowToRowMapper(DataViewSchema inputSchema) | |||
/// <summary> | |||
/// Estimator for trained wrapped transformers. | |||
/// </summary> | |||
public abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper> | |||
internal abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a funny one. I was able to remove all subclasses of it, but some notes I read elsewhere led me to suspect there may be more than this than meets the eye. Started an issue about this for #2841. (Though that does not touch the public surface and is therefore inessential.)
{ | ||
protected readonly IHost Host; | ||
[BestFriend] | ||
private protected readonly IHost Host; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked and in the public surface there is currently no protected
member anywhere in Microsoft.ML.Data
.
@@ -28,16 +28,6 @@ internal ModelOperationsCatalog(IHostEnvironment env) | |||
Explainability = new ExplainabilityTransforms(this); | |||
} | |||
|
|||
public abstract class SubCatalogBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may do more work to get rid of this sort of thing. These mysterious subclasses mean nothing to the user, and as part of the "central" object MLContext
, we should probably choose some alternate mechanism. (I got rid of it here, but there's similar code elsewhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good, I have done as I suggested, and got rid of the rest. We typically hide infrastructure behind this using the tricks of explicit implementations of internal interfaces, so I used the same in introducing a new interface IInternalCatalog
(not the best name, perhaps, but as it is internal we are free to improve it later). This also allowed me to simplify CatalogUtils
considerably. So, now unless it has some other reason to exist (e.g., train catalog base does have a reason to exist, whereas SubCatalogBase
did not), these mysterious "infrastructure" subclasses for catalogs are now gone.
@@ -103,7 +103,7 @@ public abstract class PredictionEngineBase<TSrc, TDst> : IDisposable | |||
/// <summary> | |||
/// Provides output schema. | |||
/// </summary> | |||
public DataViewSchema OutputSchema; | |||
public DataViewSchema OutputSchema { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mutable schema on the base class of the prediction engine. That would have been embarrassing. :)
@@ -14,8 +15,9 @@ namespace Microsoft.ML.Transforms.Text | |||
/// Produces a bag of counts of ngrams (sequences of consecutive words) in a given text. | |||
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag. | |||
/// </summary> | |||
public sealed class WordBagEstimator : TrainedWrapperEstimatorBase | |||
public sealed class WordBagEstimator : IEstimator<ITransformer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TrainedWrapperEstimatorBase
was meant to facilitate the translation from the legacy IDataTransform
interface to the new ITransformer
interface. It is intended to be a temporary situation; however, by making this class descend from that abstract base class, that temporary shim would have had to become permanent -- you cannot "undescend" from a class without it being a breaking change in your API. Therefore we would have had to live with it forever.
So we had to duplicate the logic that existed in that base class. Fortunately there were only two subclasses remaining (both in this file, as it happened), so it was not such a big problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good idea! until we do the proper translation
In reply to: 262299912 [](ancestors = 262299912)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #2842 +/- ##
==========================================
- Coverage 71.71% 71.69% -0.02%
==========================================
Files 809 809
Lines 142494 142522 +28
Branches 16114 16114
==========================================
+ Hits 102185 102186 +1
- Misses 35874 35907 +33
+ Partials 4435 4429 -6
|
} | ||
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a summary for this new public method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could. While acknowledging the worth of documentation, I would point out documentation can be added any time. We have three weeks and only three weeks to finalize the public shape of the API. Anything that is there must remain there forever.
This method is not really "new" as such. It is merely copied from the existing base class, where it has precisely the same documentation as you see now.
If other people are making similar requests on other PRs, I might advise them to desist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there’s an “inheritdoc” syntax you can use.
Aspnet uses it:
It looks like some doc tools respect these. Do ours?
Here’s the C# Lang feature request to add it as an official support in the language.
return new TransformWrapper(_host, WordHashBagProducingTransformer.Create(_host, options, input), true); | ||
} | ||
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchemaShape [](start = 15, length = 11)
Same here: Can you add a summary for this new public method?
…ractically unsubclassable.
…lic surface. Simplify CatalogUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is wrong, but this line (381) is not at least showing as a change for me either on the Github web review tool, nor in the non-web tools we use to review, so, maybe something somehow got screwed up on your end? The last line it shows that I've changed is 258, and the next line is 409. This line (381) definitely did not change. I agree that the documentation should be changed to actually reflect reality, but I hope we can refine that later. In reply to: 469504933 [](ancestors = 469504933,469501477) Refers to: src/Microsoft.ML.Data/TrainCatalog.cs:381 in 658a128. [](commit_id = 658a128, deletion_comment = False) |
Towards #1602. When performing what I hope is one of my final reviews of the public surface area of
Microsoft.ML.Data
I saw many "small" items, each too petty to warrant separate issues, but that we nonetheless do not want in the public surface. This is in the vein of #2300 and other similar PRs. Special emphasis was placed on making sure we don't have abstract protected members visible as part of the public surface, and other such things as this.In my review of the assembly I did not address those types or members I knew were being taken care of through other channels. (E.g., the attribute bearing marker interfaces for
IComponentFactory
, or the calibrator.)