Skip to content

IDataView Cleanup: ColumnType cleanup #1533

Closed
@TomFinley

Description

@TomFinley

The IDataView type system is extensible (as we see with ImageType).

This is fine, but there is something confusing about ColumnType as well, since there are lots of methods and properties on the base class ColumnType that are specific for derived types. For example, .IsVector, .KeyCount, and other such things, that are only really relevant if the type is either vector, key, or whatever.

Why clean up?

There are lots of things on ColumnType that are unappealing. There are things like AsVector and IsVector which, as the documentation states, are equivalent to as VectorType and is VectorType. I mean, why? You save a few characters here and there, but at the cost of complicating one of the most central classes in the API.

Some things are just plain old silly. Why IsTimeSpan? How useful is that, really? Some things are like this.

There's also DataKind. This is so strange. This has already caused a fair amount of confusion among some people: they see this, and they think, "oh the types are just from this enum." No, they're not.

The reality is, these things are conveniences, but they're conveniences I think that confuse people (multiple smart people have thought their presence meant the type system was not extensible), so maybe we ought not to expose them, at least, not in their current form.

Why not clean up?

In a sense, forming an analogue between the IDV and .NET type systems, there is some precedent for this sort of thing: if we consider System.Type. This has the property IsArray, with the methods GetArrayRank, which is only sensible to to use if the the IsArray property was true. However in our case, ColumnType is an abstract class, and while System.Type is abstract, its inheritance structure does not capture specific types of values in the same sense we do, e.g., there is no specific string type descended from Type. If, hypothetically, .GetType of an int[] returned some type System.ArrayType that descended from System.Type, then we might equally hypothetically imagine that the method to get the array rank would be on that derived ArrayType, rather than on Type directly.

There is also a practical consideration. The reality though is that some types are definitely more important and more heavily used than others.

Let's imagine that we kept the ColumnType inheritance structure as it is now, but removed any properties relevant only to any derived type, specifically. What would hypothetically happen? I picked this usage more or less randomly from our codebase.

var type = schema.GetColumnType(ScoreIndex);
if (!type.IsKnownSizeVector || type.ItemType != NumberType.Float)
throw Host.Except("Score column '{0}' has type {1}, but must be a float vector of known-size", ScoreCol, type);

Now then, let's imagine that we have none of the "specialty" properties on ColumnType used above, but instead have a IsKnownSize and ItemType on VectorType specifically, that is, not on the root class, and you must rephrase this thing as a VectorType if you wished to access these. The most clear way I can imagine to deliver equivalent logic to the linked condition is this:

if (!(type is VectorType vecType && vecType.IsKnownSize && vecType.ItemType == NumberType.Float))

That's not so bad really... the condition went from 60 to 92 characters, which while not great, is hardly ridiculous.

It's even conceivable that had we had pattern matching at the time this code is written, we would have done this. Prior to C# 7.0 (and this code is way prior to C# 7.0), there was no such things as this pattern matching, as we see used here in the type is VectorType vecType expression. So the equivalent in the pre-pattern match days would have been considerably more obnoxious and verbose.

Let's talk about DataKind. Unquestionably it is confusing, but if you take a look at it, it is also really, really helpful to have, for the common builtin types, an enum.

Proposed balance

I think it's possible to sort of have our cake and eat it too. Now that we have #1520, we can sort of make our public surface as sparse as possible, while allowing the conveniences we currently enjoy for the internal implementation to remain more or less unmolested.

  • Let us mark these questionable things as internal, but with BestFriend attributes on them. The internal code can retain is sparsity,
  • Let us add to the public surface the same information that we get from the types on the specific relevant types themselves. (E.g., the vector size would just be publicly part of VectorSize itself.)
  • In any event, some of the properties have no reason to exist in any world. For example, a property like IsTimeSpan exists just because someone misunderstood what was going on, and thought, "gee I'm adding a new class, I see we have tests for things like vector and text, let me just add this." Nope, we just have those special things as conveniences.

We could then, if we like either (1) make the conveniences public or (2) remove them altogether, at our own pace, without jeopardizing the public surface of the API at all.

/cc @Zruty0 @shauheen @terrajobst

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions