Skip to content

Reduce public surface area of ColumnType and family. #1543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 7, 2018

Conversation

TomFinley
Copy link
Contributor

@TomFinley TomFinley commented Nov 6, 2018

Fixes #1533 .

  • Introduce internals-visible-to on core, and [BestFriend] attributes on key members of ColumnType (see issue Best friends to limit friend access #1519) so internally the implementation can use the old conveniences.

  • Make type-specific data available only on the relevant type, for example, Size on VectorType, which replaces all of VectorSize, ValueCount, IsKnownSizeVectorType.

  • All IsX and AsX should be replaced with is XType or as XType.

  • Also in the spirit of the above, hide other redundant conveniences.

  • De-emphasize rather entirely DataKind as having anything to do with ColumnType, at least publicly.

  • Validate new public surface by having the tests use it instead of the old way.

  • No more odd DimCount/GetDim accessors on vector types, instead an ImmutableArray<int> Dimensions.

Note that I was deliberately trying to err on the side of hiding as much as possible, but I believe the set of actual information you can get out of a type is identical. (Up to the presence of DataKind, which I am trying to de-emphasize possibly to the point of killing it outright, though not in this PR.)

Contracts.Assert(IsKey);
}

public KeyType(Type type, ulong min, int count, bool contiguous = true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially controversial. We need the ability at least in the tests to be able to create column types, but if not over DataKind then the options as I see it are static create methods for the 4 different types (which might be acceptable) or we make the type somehow itself generic (which again might be acceptable, though I'd prefer to think a bit more about that), or do what I do here and just have basically the same public constructor except over System.Type and not DataKind, which seemed like the least revolutionary approach given the current state.

return size;
}

public int DimCount { get { return _dims != null ? _dims.Length : _size > 0 ? 1 : 0; } }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also potentially controversial; this was not part of the original review comments but it was so bad I felt compelled to improve it. I've replaced all this machinery around _dims and DimCount and GetDim with a single ImmutableArray<int> Dimensions. (In general I think it would be positive to start using this library. Roslyn uses it a lot and we have a similar desire for immutable collections of numbers.)

DimCount was an incredibly odd structure, reflecting _dims. Let's imagine you created something with dims [2, 5], [7], and [0, 5]. In those cases, the size would be 10, 7, 0 respectively, and the dimensions would be equal to what you had passed in. So far so good. But if you passed in [0], the number of dimensions would equal 0, not 1. This was confusing. Also of course we have this thing that is kind of a "list light" because we could not expose _dims directly, but neither I guess did we want to expose some sort of .NET type like IReadOnlyList or somesuch, so we had this DimCount and GetDim. Exposing this other structure makes a lot more sense to me.


[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework" + PublicKey.TestValue)]
//[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.StaticPipelineTesting" + PublicKey.TestValue)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove this commented out section.

@@ -844,20 +843,20 @@ public void TextNormalizeStatic()

Assert.True(schema.TryGetColumnIndex("norm", out int norm));
var type = schema.GetColumnType(norm);
Assert.True(!type.IsVector && type.ItemType.IsText);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read this I think we're probably doing a good thing by removing the conveniences, since I feel like while they can be convenient they can also lead to confusion. Here we have someone that tested to make sure something wasn't vector and that it was also text, thinking that perhaps IsText would return for a vector of text (a totally reasonable expectation), when in fact it is just some shorthand for is TextType.

var expectedType = TensorFlowUtils.Tf2MlNetType(_parent.TFInputTypes[i]);
if (type.ItemType != expectedType)
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", _parent.Inputs[i], expectedType.ToString(), type.ToString());
var originalShape = _parent.TFInputShapes[i];
var shape = originalShape.ToIntArray();

var colTypeDims = Enumerable.Range(0, type.AsVector.DimCount + 1).Select(d => d == 0 ? 1 : (long)type.AsVector.GetDim(d - 1)).ToArray();
var colTypeDims = vecType.Dimensions.Prepend(1).Select(dim => (long)dim).ToArray();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where I feel like the conveniences worked against us. By having this AsVector here, I think the API was forming the expectation of whoever wrote this code that something "magical" would happen, but no, it is quite literally just as VectorType, so this could conceivably be null. This is not to say that some other bug wouldn't have arisen, but at least under standard syntax perhaps this would have been more obvious.

@@ -123,7 +123,7 @@ public void HashScalarDouble()
[GlobalSetup(Target = nameof(HashScalarKey))]
public void SetupHashScalarKey()
{
InitMap(6u, new KeyType(DataKind.U4, 0, 100));
InitMap(6u, new KeyType(typeof(uint), 0, 100));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually kind of like this better.

@@ -14,7 +14,7 @@
namespace Microsoft.ML.Runtime.Data
{
/// <summary>
/// Utilities for implementing and using the metadata API of ISchema.
/// Utilities for implementing and using the metadata API of <see cref="ISchema"/>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISchema [](start = 76, length = 7)

Schema now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Zruty0 !

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

[BestFriend]
internal static class PublicKey
{
public const string Value = ", PublicKey=00240000048000009400000006020000002400005253413100040000010001004b86c4cb78549b34bab61a3b1800e23bfeb5b3ec390074041536a7e3cbd97f5f04cf0f857155a8928eaa29ebfd11cfbbad3ba70efea7bda3226c6a8d370a4cd303f714486b6ebc225985a638471e6ef571cc92a4613c00b8fa65d61ccee0cbe5f36330c9a01f4183559f1bef24cc2917c6d913e3a541333a1d05d9bed22b38cb";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, [](start = 37, length = 2)

I would prefer the separator not to be part of Value :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what alternative do you envision? Are you suggesting I have a separate constant with the ", " in it? That I put that with all the assembly names, so that instead of having this awkward ", at the head of this string, we instead have an awkward `, " at the end of every assembly name?

Perhaps you meant something else though? If you did not though, sorry, I will let it stand as is since either of those suggestions seem worse.

private readonly bool _isVector;
private readonly bool _isNumber;
private readonly bool _isKey;

// This private constructor sets all the _isXxx flags. It is invoked by other ctors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) _isXxx flags is out of date.


/// <summary>
/// The DataKind corresponding to RawType, if there is one (zero otherwise). It is equivalent
/// to the result produced by DataKindExtensions.TryGetDataKind(RawType, out kind).
/// </summary>
public DataKind RawKind { get { return _rawKind; } }
[BestFriend]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we were able to design ColumnType in such a way that it didn't need [BestFriend] internal on any of its members. Imagine when we try to extract IDataView from ML.NET. We will need to redesign this type again in order to get rid of this internal usage of ColumnType to the rest of ML.NET.

It would be better to design it up front in such a way that would work anywhere.

For example, the comment below: External code should use <c>is <see cref="PrimitiveType"/></c>. - why isn't that good enough for everyone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to design it up front in such a way that would work anywhere.

Hi @eerhardt , it does not need any of these internal conveniences, since everything you can get through the internal properties can be fetched through the public properties. So let's first distinguish whether we could use the public surface internally (yes, in case not clear) from the question of whether we should do that work right now.

I think it's pretty clear we could, so you are I guess insisting we do that right now. Perhaps you're not aware of the amount of effort entailed by this request, so I will enumerate it.

If I right now just remove all these things in my change where I added this BestFriend attribute in this file, I get 1549 errors. Each one of those requires, based on my experience in doing the tests yesterday when I was preparing this PR, about one minute to resolve each as I think about how to restructure the code so that it is equivalent -- the changes are not mechanical. They are easy, but not mechanical -- great first issues, by the way. 26 hours of work, let's say. Factor in breaks, meetings, and other work, I'd be surprised if I could actually reasonably get that done in a week.

I was actually kind of hoping to be done with this today so that I could move on to hiding many other public surfaces in time for the API review. Now then, that's not going to happen for various reasons (since @Zruty0 is the only other one that has signed off so far), which is a pity, but whatever.

It may be that having someone do this work eventually would be a good idea for the reasons you have outlined, but I'd say in this precise moment insisting I do it when there are many other things that almost certainly require my unique attention would not be wise. What I proposed in #1533, and which I still believe is the wiser course of action, was that I not do that, and instead make the public surface what we want it to be.

* Introduce internals-visible-to on core, and [BestFriend] attributes on key
  members of ColumnType so internally the implementation can use the old
  conveniences.

* Make type-specific data available only on the relevant type, for example,
  Size on VectorType, which replaces all of VectorSize, ValueCount,
  IsKnownSizeVectorType.

* All `IsX` and `AsX` should be replaced with `is XType` or `as XType`.

* Also in the spirit of the above, hide other redundant conveniences.

* De-emphasize rather entirely DataKind as having anything to do with
  ColumnType, at least publicly.

* Validate new public surface by having the tests use it instead of the old
  way.
@yaeldekel
Copy link

            case DataKind.I8:

Not related to this PR, but this looks wrong. All the Time types should also return false.


Refers to: src/Microsoft.ML.Data/Transforms/NAFilter.cs:198 in 6de7f20. [](commit_id = 6de7f20, deletion_comment = False)

@@ -17,12 +17,16 @@ public void TestEqualAndGetHashCode()
{
var dict = new Dictionary<ColumnType, string>();
// add PrimitiveTypes, KeyType & corresponding VectorTypes
PrimitiveType tmp;
//PrimitiveType tmp;
Copy link

@yaeldekel yaeldekel Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//PrimitiveType tmp; [](start = 12, length = 20)

This can be deleted. #Resolved

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@TomFinley
Copy link
Contributor Author

TomFinley commented Nov 7, 2018

            case DataKind.I8:

Huh yeah. That's pretty bad. Should be only keys and R4/R8 I think. Also the error message makes associated with the thing that calls this makes absolutely no sense whatsoever, and can be improved.


In reply to: 436456100 [](ancestors = 436456100)


Refers to: src/Microsoft.ML.Data/Transforms/NAFilter.cs:198 in 6de7f20. [](commit_id = 6de7f20, deletion_comment = False)

@TomFinley TomFinley merged commit 5566d2a into dotnet:master Nov 7, 2018
@TomFinley TomFinley deleted the tfinley/ColumnType branch November 7, 2018 03:34
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants