IDataView Cleanup: ColumnType cleanup

The `IDataView` type system is extensible (as we see with [`ImageType`](https://github.com/dotnet/machinelearning/blob/f9202628fbfac9e599e8c63dc5ed26eae77afbee/src/Microsoft.ML.ImageAnalytics/ImageType.cs#L12)).

This is fine, but there is something confusing about `ColumnType` as well, since there are lots of methods and properties on the base class `ColumnType` that are specific for derived types. For example, `.IsVector`, `.KeyCount`, and other such things, that are only really relevant if the type *is* either vector, key, or whatever.

## Why clean up?

There are lots of things on `ColumnType` that are unappealing. There are things like `AsVector` and `IsVector` which, as the documentation states, are equivalent to `as VectorType` and `is VectorType`. I mean, *why*? You save a few characters here and there, but at the cost of complicating one of the most central classes in the API.

Some things are just plain old silly. Why `IsTimeSpan`? How useful is that, really? Some things are like this.

There's also `DataKind`. This is so strange. This has already caused a fair amount of confusion among some people: they see this, and they think, "oh the types are just from this `enum`." No, they're not.

The reality is, these things are *conveniences*, but they're conveniences I think that confuse people (multiple smart people have thought their presence meant the type system was *not* extensible), so maybe we ought not to expose them, at least, not in their current form.

## Why not clean up?

In a sense, forming an analogue between the IDV and .NET type systems, there is some precedent for this sort of thing: if we consider `System.Type`. This has the property `IsArray`, with the methods `GetArrayRank`, which is only sensible to to use if the the `IsArray` property was true. However in *our* case, `ColumnType` is an abstract class, and while `System.Type` is abstract, its inheritance structure does not capture specific types of values in the same sense we do, e.g., there is no specific string type descended from `Type`. If, hypothetically, `.GetType` of an `int[]` returned some type `System.ArrayType` that descended from `System.Type`, then we might equally hypothetically imagine that the method to get the array rank would be on that derived `ArrayType`, rather than on `Type` directly.

There is also a practical consideration. The reality though is that some types are definitely more important and more heavily used than others.

Let's imagine that we kept the `ColumnType` inheritance structure as it is now, but removed any properties relevant only to any derived type, specifically. What would hypothetically happen? I picked this usage more or less randomly from our codebase.

https://github.com/dotnet/machinelearning/blob/f9202628fbfac9e599e8c63dc5ed26eae77afbee/src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs#L799-L801

Now then, let's imagine that we have *none* of the "specialty" properties on `ColumnType` used above, but instead have a `IsKnownSize` and `ItemType` on `VectorType` *specifically*, that is, not on the root class, and you must rephrase this thing as a `VectorType` if you wished to access these. The most clear way I can imagine to deliver equivalent logic to the linked condition is this:

```csharp
if (!(type is VectorType vecType && vecType.IsKnownSize && vecType.ItemType == NumberType.Float))
```
That's not *so* bad really... the condition went from 60 to 92 characters, which while not great, is hardly ridiculous.

It's even conceivable that had we had pattern matching at the time this code is written, we would have done this. Prior to C# 7.0 (and this code is *way* prior to C# 7.0), there was no such things as this pattern matching, as we see used here in the `type is VectorType vecType` expression. So the equivalent in the pre-pattern match days would have been considerably more obnoxious and verbose.

Let's talk about `DataKind`. Unquestionably it is confusing, but if you take a look at it, it is also really, really helpful to have, for the common builtin types, an `enum`.

## Proposed balance

I think it's possible to sort of have our cake and eat it too. Now that we have #1520, we can sort of make our public surface as sparse as possible, while allowing the conveniences we currently enjoy for the internal implementation to remain more or less unmolested.

* Let us mark these questionable things as `internal`, but with `BestFriend` attributes on them. The internal code can retain is sparsity, 
* Let us add to the public surface the same information that we get from the types on the specific relevant types themselves. (E.g., the vector size would just be *publicly* part of `VectorSize` itself.)
* In any event, some of the properties have no reason to exist in *any* world. For example, a property like `IsTimeSpan` exists just because someone misunderstood what was going on, and thought, "gee I'm adding a new class, I see we have tests for things like vector and text, let me just add this." Nope, we just have those special things as conveniences.

We could then, if we *like* either (1) make the conveniences public or (2) remove them altogether, at our own pace, without jeopardizing the public surface of the API at all.

/cc @Zruty0 @shauheen @terrajobst 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IDataView Cleanup: ColumnType cleanup #1533

Why clean up?

Why not clean up?

Proposed balance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	var type = schema.GetColumnType(ScoreIndex);
	if (!type.IsKnownSizeVector \|\| type.ItemType != NumberType.Float)
	throw Host.Except("Score column '{0}' has type {1}, but must be a float vector of known-size", ScoreCol, type);

IDataView Cleanup: ColumnType cleanup #1533

Description

Why clean up?

Why not clean up?

Proposed balance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions