Description
The IDataView
type system is extensible (as we see with ImageType
).
This is fine, but there is something confusing about ColumnType
as well, since there are lots of methods and properties on the base class ColumnType
that are specific for derived types. For example, .IsVector
, .KeyCount
, and other such things, that are only really relevant if the type is either vector, key, or whatever.
Why clean up?
There are lots of things on ColumnType
that are unappealing. There are things like AsVector
and IsVector
which, as the documentation states, are equivalent to as VectorType
and is VectorType
. I mean, why? You save a few characters here and there, but at the cost of complicating one of the most central classes in the API.
Some things are just plain old silly. Why IsTimeSpan
? How useful is that, really? Some things are like this.
There's also DataKind
. This is so strange. This has already caused a fair amount of confusion among some people: they see this, and they think, "oh the types are just from this enum
." No, they're not.
The reality is, these things are conveniences, but they're conveniences I think that confuse people (multiple smart people have thought their presence meant the type system was not extensible), so maybe we ought not to expose them, at least, not in their current form.
Why not clean up?
In a sense, forming an analogue between the IDV and .NET type systems, there is some precedent for this sort of thing: if we consider System.Type
. This has the property IsArray
, with the methods GetArrayRank
, which is only sensible to to use if the the IsArray
property was true. However in our case, ColumnType
is an abstract class, and while System.Type
is abstract, its inheritance structure does not capture specific types of values in the same sense we do, e.g., there is no specific string type descended from Type
. If, hypothetically, .GetType
of an int[]
returned some type System.ArrayType
that descended from System.Type
, then we might equally hypothetically imagine that the method to get the array rank would be on that derived ArrayType
, rather than on Type
directly.
There is also a practical consideration. The reality though is that some types are definitely more important and more heavily used than others.
Let's imagine that we kept the ColumnType
inheritance structure as it is now, but removed any properties relevant only to any derived type, specifically. What would hypothetically happen? I picked this usage more or less randomly from our codebase.
machinelearning/src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs
Lines 799 to 801 in f920262
Now then, let's imagine that we have none of the "specialty" properties on ColumnType
used above, but instead have a IsKnownSize
and ItemType
on VectorType
specifically, that is, not on the root class, and you must rephrase this thing as a VectorType
if you wished to access these. The most clear way I can imagine to deliver equivalent logic to the linked condition is this:
if (!(type is VectorType vecType && vecType.IsKnownSize && vecType.ItemType == NumberType.Float))
That's not so bad really... the condition went from 60 to 92 characters, which while not great, is hardly ridiculous.
It's even conceivable that had we had pattern matching at the time this code is written, we would have done this. Prior to C# 7.0 (and this code is way prior to C# 7.0), there was no such things as this pattern matching, as we see used here in the type is VectorType vecType
expression. So the equivalent in the pre-pattern match days would have been considerably more obnoxious and verbose.
Let's talk about DataKind
. Unquestionably it is confusing, but if you take a look at it, it is also really, really helpful to have, for the common builtin types, an enum
.
Proposed balance
I think it's possible to sort of have our cake and eat it too. Now that we have #1520, we can sort of make our public surface as sparse as possible, while allowing the conveniences we currently enjoy for the internal implementation to remain more or less unmolested.
- Let us mark these questionable things as
internal
, but withBestFriend
attributes on them. The internal code can retain is sparsity, - Let us add to the public surface the same information that we get from the types on the specific relevant types themselves. (E.g., the vector size would just be publicly part of
VectorSize
itself.) - In any event, some of the properties have no reason to exist in any world. For example, a property like
IsTimeSpan
exists just because someone misunderstood what was going on, and thought, "gee I'm adding a new class, I see we have tests for things like vector and text, let me just add this." Nope, we just have those special things as conveniences.
We could then, if we like either (1) make the conveniences public or (2) remove them altogether, at our own pace, without jeopardizing the public surface of the API at all.