-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Reduce public surface area of ColumnType and family. #1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Contracts.Assert(IsKey); | ||
} | ||
|
||
public KeyType(Type type, ulong min, int count, bool contiguous = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially controversial. We need the ability at least in the tests to be able to create column types, but if not over DataKind
then the options as I see it are static create methods for the 4 different types (which might be acceptable) or we make the type somehow itself generic (which again might be acceptable, though I'd prefer to think a bit more about that), or do what I do here and just have basically the same public constructor except over System.Type
and not DataKind
, which seemed like the least revolutionary approach given the current state.
return size; | ||
} | ||
|
||
public int DimCount { get { return _dims != null ? _dims.Length : _size > 0 ? 1 : 0; } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also potentially controversial; this was not part of the original review comments but it was so bad I felt compelled to improve it. I've replaced all this machinery around _dims
and DimCount
and GetDim
with a single ImmutableArray<int> Dimensions
. (In general I think it would be positive to start using this library. Roslyn uses it a lot and we have a similar desire for immutable collections of numbers.)
DimCount
was an incredibly odd structure, reflecting _dims
. Let's imagine you created something with dims [2, 5]
, [7]
, and [0, 5]
. In those cases, the size would be 10
, 7
, 0
respectively, and the dimensions would be equal to what you had passed in. So far so good. But if you passed in [0]
, the number of dimensions would equal 0, not 1. This was confusing. Also of course we have this thing that is kind of a "list light" because we could not expose _dims
directly, but neither I guess did we want to expose some sort of .NET type like IReadOnlyList
or somesuch, so we had this DimCount
and GetDim
. Exposing this other structure makes a lot more sense to me.
|
||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")] | ||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework" + PublicKey.TestValue)] | ||
//[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.StaticPipelineTesting" + PublicKey.TestValue)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remove this commented out section.
@@ -844,20 +843,20 @@ public void TextNormalizeStatic() | |||
|
|||
Assert.True(schema.TryGetColumnIndex("norm", out int norm)); | |||
var type = schema.GetColumnType(norm); | |||
Assert.True(!type.IsVector && type.ItemType.IsText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read this I think we're probably doing a good thing by removing the conveniences, since I feel like while they can be convenient they can also lead to confusion. Here we have someone that tested to make sure something wasn't vector and that it was also text, thinking that perhaps IsText
would return for a vector of text (a totally reasonable expectation), when in fact it is just some shorthand for is TextType
.
var expectedType = TensorFlowUtils.Tf2MlNetType(_parent.TFInputTypes[i]); | ||
if (type.ItemType != expectedType) | ||
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", _parent.Inputs[i], expectedType.ToString(), type.ToString()); | ||
var originalShape = _parent.TFInputShapes[i]; | ||
var shape = originalShape.ToIntArray(); | ||
|
||
var colTypeDims = Enumerable.Range(0, type.AsVector.DimCount + 1).Select(d => d == 0 ? 1 : (long)type.AsVector.GetDim(d - 1)).ToArray(); | ||
var colTypeDims = vecType.Dimensions.Prepend(1).Select(dim => (long)dim).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where I feel like the conveniences worked against us. By having this AsVector
here, I think the API was forming the expectation of whoever wrote this code that something "magical" would happen, but no, it is quite literally just as VectorType
, so this could conceivably be null. This is not to say that some other bug wouldn't have arisen, but at least under standard syntax perhaps this would have been more obvious.
@@ -123,7 +123,7 @@ public void HashScalarDouble() | |||
[GlobalSetup(Target = nameof(HashScalarKey))] | |||
public void SetupHashScalarKey() | |||
{ | |||
InitMap(6u, new KeyType(DataKind.U4, 0, 100)); | |||
InitMap(6u, new KeyType(typeof(uint), 0, 100)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually kind of like this better.
4e6bc14
to
e0ad0bd
Compare
@@ -14,7 +14,7 @@ | |||
namespace Microsoft.ML.Runtime.Data | |||
{ | |||
/// <summary> | |||
/// Utilities for implementing and using the metadata API of ISchema. | |||
/// Utilities for implementing and using the metadata API of <see cref="ISchema"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISchema [](start = 76, length = 7)
Schema
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Zruty0 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestFriend] | ||
internal static class PublicKey | ||
{ | ||
public const string Value = ", PublicKey=00240000048000009400000006020000002400005253413100040000010001004b86c4cb78549b34bab61a3b1800e23bfeb5b3ec390074041536a7e3cbd97f5f04cf0f857155a8928eaa29ebfd11cfbbad3ba70efea7bda3226c6a8d370a4cd303f714486b6ebc225985a638471e6ef571cc92a4613c00b8fa65d61ccee0cbe5f36330c9a01f4183559f1bef24cc2917c6d913e3a541333a1d05d9bed22b38cb"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, [](start = 37, length = 2)
I would prefer the separator not to be part of Value
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what alternative do you envision? Are you suggesting I have a separate constant with the ", "
in it? That I put that with all the assembly names, so that instead of having this awkward ",
at the head of this string, we instead have an awkward `, " at the end of every assembly name?
Perhaps you meant something else though? If you did not though, sorry, I will let it stand as is since either of those suggestions seem worse.
private readonly bool _isVector; | ||
private readonly bool _isNumber; | ||
private readonly bool _isKey; | ||
|
||
// This private constructor sets all the _isXxx flags. It is invoked by other ctors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) _isXxx flags
is out of date.
|
||
/// <summary> | ||
/// The DataKind corresponding to RawType, if there is one (zero otherwise). It is equivalent | ||
/// to the result produced by DataKindExtensions.TryGetDataKind(RawType, out kind). | ||
/// </summary> | ||
public DataKind RawKind { get { return _rawKind; } } | ||
[BestFriend] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if we were able to design ColumnType in such a way that it didn't need [BestFriend] internal
on any of its members. Imagine when we try to extract IDataView from ML.NET. We will need to redesign this type again in order to get rid of this internal usage of ColumnType to the rest of ML.NET.
It would be better to design it up front in such a way that would work anywhere.
For example, the comment below: External code should use <c>is <see cref="PrimitiveType"/></c>.
- why isn't that good enough for everyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to design it up front in such a way that would work anywhere.
Hi @eerhardt , it does not need any of these internal conveniences, since everything you can get through the internal properties can be fetched through the public properties. So let's first distinguish whether we could use the public surface internally (yes, in case not clear) from the question of whether we should do that work right now.
I think it's pretty clear we could, so you are I guess insisting we do that right now. Perhaps you're not aware of the amount of effort entailed by this request, so I will enumerate it.
If I right now just remove all these things in my change where I added this BestFriend
attribute in this file, I get 1549 errors. Each one of those requires, based on my experience in doing the tests yesterday when I was preparing this PR, about one minute to resolve each as I think about how to restructure the code so that it is equivalent -- the changes are not mechanical. They are easy, but not mechanical -- great first issues, by the way. 26 hours of work, let's say. Factor in breaks, meetings, and other work, I'd be surprised if I could actually reasonably get that done in a week.
I was actually kind of hoping to be done with this today so that I could move on to hiding many other public surfaces in time for the API review. Now then, that's not going to happen for various reasons (since @Zruty0 is the only other one that has signed off so far), which is a pity, but whatever.
It may be that having someone do this work eventually would be a good idea for the reasons you have outlined, but I'd say in this precise moment insisting I do it when there are many other things that almost certainly require my unique attention would not be wise. What I proposed in #1533, and which I still believe is the wiser course of action, was that I not do that, and instead make the public surface what we want it to be.
* Introduce internals-visible-to on core, and [BestFriend] attributes on key members of ColumnType so internally the implementation can use the old conveniences. * Make type-specific data available only on the relevant type, for example, Size on VectorType, which replaces all of VectorSize, ValueCount, IsKnownSizeVectorType. * All `IsX` and `AsX` should be replaced with `is XType` or `as XType`. * Also in the spirit of the above, hide other redundant conveniences. * De-emphasize rather entirely DataKind as having anything to do with ColumnType, at least publicly. * Validate new public surface by having the tests use it instead of the old way.
e0ad0bd
to
6de7f20
Compare
@@ -17,12 +17,16 @@ public void TestEqualAndGetHashCode() | |||
{ | |||
var dict = new Dictionary<ColumnType, string>(); | |||
// add PrimitiveTypes, KeyType & corresponding VectorTypes | |||
PrimitiveType tmp; | |||
//PrimitiveType tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//PrimitiveType tmp; [](start = 12, length = 20)
This can be deleted. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh yeah. That's pretty bad. Should be only keys and R4/R8 I think. Also the error message makes associated with the thing that calls this makes absolutely no sense whatsoever, and can be improved. In reply to: 436456100 [](ancestors = 436456100) Refers to: src/Microsoft.ML.Data/Transforms/NAFilter.cs:198 in 6de7f20. [](commit_id = 6de7f20, deletion_comment = False) |
Fixes #1533 .
Introduce internals-visible-to on core, and [BestFriend] attributes on key members of ColumnType (see issue Best friends to limit friend access #1519) so internally the implementation can use the old conveniences.
Make type-specific data available only on the relevant type, for example, Size on VectorType, which replaces all of VectorSize, ValueCount, IsKnownSizeVectorType.
All
IsX
andAsX
should be replaced withis XType
oras XType
.Also in the spirit of the above, hide other redundant conveniences.
De-emphasize rather entirely DataKind as having anything to do with ColumnType, at least publicly.
Validate new public surface by having the tests use it instead of the old way.
No more odd
DimCount
/GetDim
accessors on vector types, instead anImmutableArray<int> Dimensions
.Note that I was deliberately trying to err on the side of hiding as much as possible, but I believe the set of actual information you can get out of a type is identical. (Up to the presence of
DataKind
, which I am trying to de-emphasize possibly to the point of killing it outright, though not in this PR.)