-
Notifications
You must be signed in to change notification settings - Fork 1.9k
IDataView Cleanup: ColumnType cleanup #1533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Doing this now. For the most part, this is relatively straightforward. Lots of things being marked as internal with best friend attribute set. The only thing I actually had to add was a size property on
I'd also like to do something to |
- IsKey - KeyCount - KeyCountCore Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
- IsKey - KeyCount - KeyCountCore Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
* Remove "VectorType" specific members on ColumnType. Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for #1860 and contributes to #1533. * Address review comments. - Make extension methods verbs. - Add doc to GetItemType extension. - Fix one place using Size > 0 => IsKnownSize.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Removes the "easy" usages of ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
* Remove more usages of ColumnType.RawKind. * Remove Conversions usage of DataKind and instead use System.Type. * Remove the rest of RawKind usages in Conversions. * Remove ColumnType.RawKind. Fix #1533 * PR feedback
The
IDataView
type system is extensible (as we see withImageType
).This is fine, but there is something confusing about
ColumnType
as well, since there are lots of methods and properties on the base classColumnType
that are specific for derived types. For example,.IsVector
,.KeyCount
, and other such things, that are only really relevant if the type is either vector, key, or whatever.Why clean up?
There are lots of things on
ColumnType
that are unappealing. There are things likeAsVector
andIsVector
which, as the documentation states, are equivalent toas VectorType
andis VectorType
. I mean, why? You save a few characters here and there, but at the cost of complicating one of the most central classes in the API.Some things are just plain old silly. Why
IsTimeSpan
? How useful is that, really? Some things are like this.There's also
DataKind
. This is so strange. This has already caused a fair amount of confusion among some people: they see this, and they think, "oh the types are just from thisenum
." No, they're not.The reality is, these things are conveniences, but they're conveniences I think that confuse people (multiple smart people have thought their presence meant the type system was not extensible), so maybe we ought not to expose them, at least, not in their current form.
Why not clean up?
In a sense, forming an analogue between the IDV and .NET type systems, there is some precedent for this sort of thing: if we consider
System.Type
. This has the propertyIsArray
, with the methodsGetArrayRank
, which is only sensible to to use if the theIsArray
property was true. However in our case,ColumnType
is an abstract class, and whileSystem.Type
is abstract, its inheritance structure does not capture specific types of values in the same sense we do, e.g., there is no specific string type descended fromType
. If, hypothetically,.GetType
of anint[]
returned some typeSystem.ArrayType
that descended fromSystem.Type
, then we might equally hypothetically imagine that the method to get the array rank would be on that derivedArrayType
, rather than onType
directly.There is also a practical consideration. The reality though is that some types are definitely more important and more heavily used than others.
Let's imagine that we kept the
ColumnType
inheritance structure as it is now, but removed any properties relevant only to any derived type, specifically. What would hypothetically happen? I picked this usage more or less randomly from our codebase.machinelearning/src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs
Lines 799 to 801 in f920262
Now then, let's imagine that we have none of the "specialty" properties on
ColumnType
used above, but instead have aIsKnownSize
andItemType
onVectorType
specifically, that is, not on the root class, and you must rephrase this thing as aVectorType
if you wished to access these. The most clear way I can imagine to deliver equivalent logic to the linked condition is this:That's not so bad really... the condition went from 60 to 92 characters, which while not great, is hardly ridiculous.
It's even conceivable that had we had pattern matching at the time this code is written, we would have done this. Prior to C# 7.0 (and this code is way prior to C# 7.0), there was no such things as this pattern matching, as we see used here in the
type is VectorType vecType
expression. So the equivalent in the pre-pattern match days would have been considerably more obnoxious and verbose.Let's talk about
DataKind
. Unquestionably it is confusing, but if you take a look at it, it is also really, really helpful to have, for the common builtin types, anenum
.Proposed balance
I think it's possible to sort of have our cake and eat it too. Now that we have #1520, we can sort of make our public surface as sparse as possible, while allowing the conveniences we currently enjoy for the internal implementation to remain more or less unmolested.
internal
, but withBestFriend
attributes on them. The internal code can retain is sparsity,VectorSize
itself.)IsTimeSpan
exists just because someone misunderstood what was going on, and thought, "gee I'm adding a new class, I see we have tests for things like vector and text, let me just add this." Nope, we just have those special things as conveniences.We could then, if we like either (1) make the conveniences public or (2) remove them altogether, at our own pace, without jeopardizing the public surface of the API at all.
/cc @Zruty0 @shauheen @terrajobst
The text was updated successfully, but these errors were encountered: