Skip to content

IDataView Cleanup: ColumnType cleanup #1533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Nov 5, 2018 · 1 comment
Closed

IDataView Cleanup: ColumnType cleanup #1533

TomFinley opened this issue Nov 5, 2018 · 1 comment
Assignees

Comments

@TomFinley
Copy link
Contributor

TomFinley commented Nov 5, 2018

The IDataView type system is extensible (as we see with ImageType).

This is fine, but there is something confusing about ColumnType as well, since there are lots of methods and properties on the base class ColumnType that are specific for derived types. For example, .IsVector, .KeyCount, and other such things, that are only really relevant if the type is either vector, key, or whatever.

Why clean up?

There are lots of things on ColumnType that are unappealing. There are things like AsVector and IsVector which, as the documentation states, are equivalent to as VectorType and is VectorType. I mean, why? You save a few characters here and there, but at the cost of complicating one of the most central classes in the API.

Some things are just plain old silly. Why IsTimeSpan? How useful is that, really? Some things are like this.

There's also DataKind. This is so strange. This has already caused a fair amount of confusion among some people: they see this, and they think, "oh the types are just from this enum." No, they're not.

The reality is, these things are conveniences, but they're conveniences I think that confuse people (multiple smart people have thought their presence meant the type system was not extensible), so maybe we ought not to expose them, at least, not in their current form.

Why not clean up?

In a sense, forming an analogue between the IDV and .NET type systems, there is some precedent for this sort of thing: if we consider System.Type. This has the property IsArray, with the methods GetArrayRank, which is only sensible to to use if the the IsArray property was true. However in our case, ColumnType is an abstract class, and while System.Type is abstract, its inheritance structure does not capture specific types of values in the same sense we do, e.g., there is no specific string type descended from Type. If, hypothetically, .GetType of an int[] returned some type System.ArrayType that descended from System.Type, then we might equally hypothetically imagine that the method to get the array rank would be on that derived ArrayType, rather than on Type directly.

There is also a practical consideration. The reality though is that some types are definitely more important and more heavily used than others.

Let's imagine that we kept the ColumnType inheritance structure as it is now, but removed any properties relevant only to any derived type, specifically. What would hypothetically happen? I picked this usage more or less randomly from our codebase.

var type = schema.GetColumnType(ScoreIndex);
if (!type.IsKnownSizeVector || type.ItemType != NumberType.Float)
throw Host.Except("Score column '{0}' has type {1}, but must be a float vector of known-size", ScoreCol, type);

Now then, let's imagine that we have none of the "specialty" properties on ColumnType used above, but instead have a IsKnownSize and ItemType on VectorType specifically, that is, not on the root class, and you must rephrase this thing as a VectorType if you wished to access these. The most clear way I can imagine to deliver equivalent logic to the linked condition is this:

if (!(type is VectorType vecType && vecType.IsKnownSize && vecType.ItemType == NumberType.Float))

That's not so bad really... the condition went from 60 to 92 characters, which while not great, is hardly ridiculous.

It's even conceivable that had we had pattern matching at the time this code is written, we would have done this. Prior to C# 7.0 (and this code is way prior to C# 7.0), there was no such things as this pattern matching, as we see used here in the type is VectorType vecType expression. So the equivalent in the pre-pattern match days would have been considerably more obnoxious and verbose.

Let's talk about DataKind. Unquestionably it is confusing, but if you take a look at it, it is also really, really helpful to have, for the common builtin types, an enum.

Proposed balance

I think it's possible to sort of have our cake and eat it too. Now that we have #1520, we can sort of make our public surface as sparse as possible, while allowing the conveniences we currently enjoy for the internal implementation to remain more or less unmolested.

  • Let us mark these questionable things as internal, but with BestFriend attributes on them. The internal code can retain is sparsity,
  • Let us add to the public surface the same information that we get from the types on the specific relevant types themselves. (E.g., the vector size would just be publicly part of VectorSize itself.)
  • In any event, some of the properties have no reason to exist in any world. For example, a property like IsTimeSpan exists just because someone misunderstood what was going on, and thought, "gee I'm adding a new class, I see we have tests for things like vector and text, let me just add this." Nope, we just have those special things as conveniences.

We could then, if we like either (1) make the conveniences public or (2) remove them altogether, at our own pace, without jeopardizing the public surface of the API at all.

/cc @Zruty0 @shauheen @terrajobst

@TomFinley
Copy link
Contributor Author

TomFinley commented Nov 6, 2018

Doing this now. For the most part, this is relatively straightforward. Lots of things being marked as internal with best friend attribute set. The only thing I actually had to add was a size property on VectorType, since aside from VectorSize (which existed on ColumnType) there was nothing.

VectorType has the notion of "dimensions," which is fine and valuable, but the interface to it is absolutely bewildering. The number of dimensions is equal to the rank of the output vectors, and can be multi-dimensional (e.g., tensors).

I'd also like to do something to KeyType, but just filing a separate issue for this #1540, since I feel like that might prove a bit too controversial.

eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 7, 2019
- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 11, 2019
* Remove "KeyType" specific members on ColumnType.

- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 14, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 16, 2019
* Remove "VectorType" specific members on ColumnType.

Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for #1860 and contributes to #1533.

* Address review comments.

- Make extension methods verbs.
- Add doc to GetItemType extension.
- Fix one place using Size > 0 => IsKnownSize.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 16, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 17, 2019
* Remove ColumnType.RawKind usages Round 1.

Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 17, 2019
Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 22, 2019
eerhardt added a commit that referenced this issue Jan 23, 2019
* Remove ColumnType.RawKind usages Round 2.

Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 23, 2019
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 23, 2019
eerhardt added a commit that referenced this issue Jan 23, 2019
* Remove more usages of ColumnType.RawKind.

* Remove Conversions usage of DataKind and instead use System.Type.

* Remove the rest of RawKind usages in Conversions.

* Remove ColumnType.RawKind.

Fix #1533

* PR feedback
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant