Skip to content

Hide more of Microsoft.ML.Data #2842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 5, 2019
Merged

Hide more of Microsoft.ML.Data #2842

merged 8 commits into from
Mar 5, 2019

Conversation

TomFinley
Copy link
Contributor

Towards #1602. When performing what I hope is one of my final reviews of the public surface area of Microsoft.ML.Data I saw many "small" items, each too petty to warrant separate issues, but that we nonetheless do not want in the public surface. This is in the vein of #2300 and other similar PRs. Special emphasis was placed on making sure we don't have abstract protected members visible as part of the public surface, and other such things as this.

In my review of the assembly I did not address those types or members I knew were being taken care of through other channels. (E.g., the attribute bearing marker interfaces for IComponentFactory, or the calibrator.)

/// uniformity on the public API surface, wherever it is judged where columns with default names should be consumed.
/// </summary>
[BestFriend]
internal static class DefaultColumnNames
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While helpful for enabling a uniform public surface, this is of course not the same as it needing to be in the public surface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree we should internalize this - just raising as a general question in this area)

How do we imagine users to find out about these "magic" strings? Currently, one way is the default values in the MLContext APIs. But I wonder if that is enough. For example, when making a Prediction "output" class, how do users know to name/attribute their properties as PredictedLabel, Probability, Score? Just through documentation?


In reply to: 262298366 [](ancestors = 262298366)

Copy link
Contributor Author

@TomFinley TomFinley Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately I think both helpful displays in the debugger (for live discovery) and documentation (for offline discovery) are the only real answer we have right now -- both seem insufficient or at least suboptimal. This is a problem I think in most places we've chosen to believe that producing "default" columns names is more helpful than requiring the user to specify them. I am unsure how to address that problem. Something resembling the static typed API will I think be a solution to that problem, since the columns will be typed and addressable and right there.

Edit: Just to clarify, I am not saying that producing default column names is unhelpful or we should always insist that people provide them, just, this is one of the effects of that conscious choice.

Copy link
Contributor

@artidoro artidoro Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen quite a few instances where users use DefaultColumnNames. I don't understand why we want to hide it. We will always have default column names, why not allow users to access them instead of hard coding them in their applications?

searching the samples repo shows there are quite a few instances where people use DefaultColumnNames.
https://github.com/dotnet/machinelearning-samples/search?q=defaultcolumnnames&unscoped_q=defaultcolumnnames

or this other user provided code:
#2486 (comment)


In reply to: 262302369 [](ancestors = 262302369)

Copy link
Contributor Author

@TomFinley TomFinley Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In those places, what would have happened in those samples if instead of DefaultColumnNames.Features we instead wrote "Features"? I understand the appeal inside the tool of having a consistent "location" for them. For user code, I am less certain.

I am very uncomfortable with having as part of our public surface a class that is merely a collection of "magic strings" that we've used internally to denote certain concepts. That's not great API design.

The samples can and should change. To use this class of magic constants was, frankly, a bad move by the sample author, since it was totally needless and it obscures what's going on more than it clarifies it.

@@ -148,11 +152,13 @@ IRowToRowMapper ITransformer.GetRowToRowMapper(DataViewSchema inputSchema)
/// <summary>
/// Estimator for trained wrapped transformers.
/// </summary>
public abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper>
internal abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a funny one. I was able to remove all subclasses of it, but some notes I read elsewhere led me to suspect there may be more than this than meets the eye. Started an issue about this for #2841. (Though that does not touch the public surface and is therefore inessential.)

{
protected readonly IHost Host;
[BestFriend]
private protected readonly IHost Host;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked and in the public surface there is currently no protected member anywhere in Microsoft.ML.Data.

@@ -28,16 +28,6 @@ internal ModelOperationsCatalog(IHostEnvironment env)
Explainability = new ExplainabilityTransforms(this);
}

public abstract class SubCatalogBase
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may do more work to get rid of this sort of thing. These mysterious subclasses mean nothing to the user, and as part of the "central" object MLContext, we should probably choose some alternate mechanism. (I got rid of it here, but there's similar code elsewhere.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. 👍


In reply to: 262299101 [](ancestors = 262299101)

Copy link
Contributor Author

@TomFinley TomFinley Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, I have done as I suggested, and got rid of the rest. We typically hide infrastructure behind this using the tricks of explicit implementations of internal interfaces, so I used the same in introducing a new interface IInternalCatalog (not the best name, perhaps, but as it is internal we are free to improve it later). This also allowed me to simplify CatalogUtils considerably. So, now unless it has some other reason to exist (e.g., train catalog base does have a reason to exist, whereas SubCatalogBase did not), these mysterious "infrastructure" subclasses for catalogs are now gone.

@@ -103,7 +103,7 @@ public abstract class PredictionEngineBase<TSrc, TDst> : IDisposable
/// <summary>
/// Provides output schema.
/// </summary>
public DataViewSchema OutputSchema;
public DataViewSchema OutputSchema { get; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mutable schema on the base class of the prediction engine. That would have been embarrassing. :)

@@ -14,8 +15,9 @@ namespace Microsoft.ML.Transforms.Text
/// Produces a bag of counts of ngrams (sequences of consecutive words) in a given text.
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
/// </summary>
public sealed class WordBagEstimator : TrainedWrapperEstimatorBase
public sealed class WordBagEstimator : IEstimator<ITransformer>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TrainedWrapperEstimatorBase was meant to facilitate the translation from the legacy IDataTransform interface to the new ITransformer interface. It is intended to be a temporary situation; however, by making this class descend from that abstract base class, that temporary shim would have had to become permanent -- you cannot "undescend" from a class without it being a breaking change in your API. Therefore we would have had to live with it forever.

So we had to duplicate the logic that existed in that base class. Fortunately there were only two subclasses remaining (both in this file, as it happened), so it was not such a big problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good idea! until we do the proper translation


In reply to: 262299912 [](ancestors = 262299912)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@codecov
Copy link

codecov bot commented Mar 5, 2019

Codecov Report

Merging #2842 into master will decrease coverage by 0.01%.
The diff coverage is 82.78%.

@@            Coverage Diff             @@
##           master    #2842      +/-   ##
==========================================
- Coverage   71.71%   71.69%   -0.02%     
==========================================
  Files         809      809              
  Lines      142494   142522      +28     
  Branches    16114    16114              
==========================================
+ Hits       102185   102186       +1     
- Misses      35874    35907      +33     
+ Partials     4435     4429       -6
Flag Coverage Δ
#Debug 71.69% <82.78%> (-0.02%) ⬇️
#production 67.92% <82.78%> (-0.02%) ⬇️
#test 85.9% <ø> (ø) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.StaticPipe/SchemaBearing.cs 100% <ø> (ø) ⬆️
src/Microsoft.ML.FastTree/FastTree.cs 80.77% <ø> (ø) ⬆️
src/Microsoft.ML.Data/DataLoadSave/FakeSchema.cs 83.33% <ø> (ø) ⬆️
src/Microsoft.ML.Data/Transforms/ValueMapping.cs 84.37% <ø> (ø) ⬆️
...Microsoft.ML.Data/Scorers/PredictionTransformer.cs 97.02% <ø> (ø) ⬆️
...ML.Data/Transforms/ValueToKeyMappingTransformer.cs 78.21% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/OneHotEncoding.cs 85.44% <ø> (ø) ⬆️
...Microsoft.ML.Data/DataLoadSave/TransformWrapper.cs 22% <0%> (-11%) ⬇️
...soft.ML.Data/Transforms/OneToOneTransformerBase.cs 95.08% <100%> (ø) ⬆️
...Microsoft.ML.Data/Training/TrainerEstimatorBase.cs 78.26% <100%> (ø) ⬆️
... and 21 more

}

public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a summary for this new public method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could. While acknowledging the worth of documentation, I would point out documentation can be added any time. We have three weeks and only three weeks to finalize the public shape of the API. Anything that is there must remain there forever.

This method is not really "new" as such. It is merely copied from the existing base class, where it has precisely the same documentation as you see now.

If other people are making similar requests on other PRs, I might advise them to desist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there’s an “inheritdoc” syntax you can use.

Aspnet uses it:

https://github.com/aspnet/AspNetCore/blob/c565386a3ed135560bc2e9017aa54a950b4e35dd/src/Mvc/Mvc.Core/src/Infrastructure/NonDisposableStream.cs

It looks like some doc tools respect these. Do ours?

Here’s the C# Lang feature request to add it as an official support in the language.

dotnet/csharplang#313

return new TransformWrapper(_host, WordHashBagProducingTransformer.Create(_host, options, input), true);
}

public SchemaShape GetOutputSchema(SchemaShape inputSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaShape [](start = 15, length = 11)

Same here: Can you add a summary for this new public method?

@artidoro
Copy link
Contributor

artidoro commented Mar 5, 2019

    /// </summary>

Should this be more specific to calibrators? #Resolved


Refers to: src/Microsoft.ML.Data/TrainCatalog.cs:381 in 392cee6. [](commit_id = 392cee6, deletion_comment = False)

@artidoro
Copy link
Contributor

artidoro commented Mar 5, 2019

    /// </summary>

Why is this showing as a change? I see it was added as part of ivan's PR #2766


In reply to: 469501477 [](ancestors = 469501477)


Refers to: src/Microsoft.ML.Data/TrainCatalog.cs:381 in 392cee6. [](commit_id = 392cee6, deletion_comment = False)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@TomFinley
Copy link
Contributor Author

    /// </summary>

The documentation is wrong, but this line (381) is not at least showing as a change for me either on the Github web review tool, nor in the non-web tools we use to review, so, maybe something somehow got screwed up on your end?

The last line it shows that I've changed is 258, and the next line is 409. This line (381) definitely did not change.

I agree that the documentation should be changed to actually reflect reality, but I hope we can refine that later.


In reply to: 469504933 [](ancestors = 469504933,469501477)


Refers to: src/Microsoft.ML.Data/TrainCatalog.cs:381 in 658a128. [](commit_id = 658a128, deletion_comment = False)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants