-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Internalize DataKind #2661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internalize DataKind #2661
Conversation
@@ -34,8 +34,8 @@ public static void Example() | |||
HasHeader = true, | |||
Columns = new[] | |||
{ | |||
new TextLoader.Column("Sentiment", DataKind.BL, 0), | |||
new TextLoader.Column("SentimentText", DataKind.Text, 1) | |||
new TextLoader.Column("Sentiment", typeof(bool), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In issue #2006, I had a comment. You will note that you are picking the option that I argued was the worst of all possible worlds. It offers absolutely no hints to the user of what might be an acceptable type. How does a user know that a bool
is acceptable, but, say, a List<bool>
is not>? This is much, much worse than what it is supposedly meant to replace. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will make ColumnInfo
strongly-typed based on the desired output type. Users will have to create, for example, ToFloatColumnInfo
, if they want to conver things to float
.
In reply to: 258657461 [](ancestors = 258657461)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one possibility. We should also remain open to the possibility that DataKind
will persist, just in a changed form. So while doing this work, think about whether it is making the usage more or less clear.
It is a perfectly acceptable outcome to say, "you know what, this enumeration is still useful, just in a vastly abbreviated scope," if we can make that argument. It is unclear to me which is correct. I would not start with the assumption that datakind needs to be internalized and destroyed. (W.r.t. its prior role in ColumnType
, absolutely!!) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can replace Type
with PrimitiveDataViewType
?
In reply to: 258755422 [](ancestors = 258755422)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We decided internalizing DataKind and creating an simplified alternative.
In reply to: 259025888 [](ancestors = 259025888,258755422)
929df77
to
0fc552c
Compare
Codecov Report
@@ Coverage Diff @@
## master #2661 +/- ##
==========================================
- Coverage 71.7% 71.68% -0.02%
==========================================
Files 808 808
Lines 142221 142214 -7
Branches 16131 16131
==========================================
- Hits 101981 101953 -28
- Misses 35799 35824 +25
+ Partials 4441 4437 -4
|
Before reviewing this code two questions:
|
@Ivanidzo4ka, your suggestion was one of the 3 solutions I have tried. Unfortunately, having specialized functions cannot tackle the case where multiple columns are casted to different types in one transformer. If you want to specialize |
I had a similar concern initially. I tried to think of a better name, and the only name I thought was better was |
Having a new name sounds reasonable to me. #Resolved |
/// </summary> | ||
/// <remarks> | ||
/// <list type="bullet"> | ||
/// <item><description><see cref="SByte"/>: 1-byte integer, type of <see cref="System.SByte"/>.</description></item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these comments typically go on each individual value in the enum?
public enum ActionType
{
/// <summary>The New Subscription is called</summary>
Insert,
/// <summary>The Cancel Subscription is called</summary>
Update,
/// <summary>The Cancel Subscription is called</summary>
Cancel,
/// <summary>The Get Subscription is called</summary>
Retrieve
}
#Resolved
@@ -8,40 +8,83 @@ | |||
namespace Microsoft.ML.Data | |||
{ | |||
/// <summary> | |||
/// Data type specifier. | |||
/// Data type specifier used in text loader and type converters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like giving where a type is used as part of its definition/summary of what it is. What happens if a new place starts using this enum? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need to update this doc string. We have this new enum only because of the two classes mentioned. If we don't usually have design doc for each component, my feeling is that we should somehow describe the motivation of having an object here.
In reply to: 259519002 [](ancestors = 259519002)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this new enum only because of the two classes mentioned.
Right now, sure. But nothing is stopping some new API from using this enum. Then this doc would be out of date. Saying where something is used now doesn't explain what this thing is.
How about simply:
Specifies a simple data type.
other enum summaries are pretty simple:
https://docs.microsoft.com/en-us/dotnet/api/system?view=netframework-4.7.2#enums
TypeCode | Specifies the type of an object.
DayOfWeek | Specifies the day of the week.
If you want to add where it is used, that would be better in a <remarks>
section. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <see cref="DataKind"/> of the items in the column. | ||
/// </summary> | ||
/// It's a public interface to access the information in an internal DataKind. | ||
public DataKind ItemType => Type.HasValue ? Type.Value.ToScalarType() : DataKind.Single; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToScalarType()
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="source">Source index range(s) of the column.</param> | ||
/// <param name="keyCount">For a key column, this defines the range of values.</param> | ||
public Column(string name, DataKind? type, Range[] source, KeyCount keyCount = null) | ||
private Column(string name, InternalDataKind? kind, Range[] source, KeyCount keyCount = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who calls this constructor? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
} | ||
|
||
public Column(string name, DataKind dataKind, Range[] source, KeyCount keyCount = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public API should have XML doc. #Resolved
/// <see cref="DataKind"/> of the items in the column. | ||
/// </summary> | ||
/// It's a public interface to access the information in an internal DataKind. | ||
public DataKind ItemType => Type.HasValue ? Type.Value.ToScalarType() : DataKind.Single; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ItemType
? Typically we use that name for the type of items inside of something - like a Vector. Why not just Type
as the public name? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't. Internal name, nameof(Type), is used in naming command line arguments.
In reply to: 259520596 [](ancestors = 259520596)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this reasoning. Can you explain further?
In general, we shouldn't be sacrificing our public API for "command line" reasonings. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use nameof(Type) as an argument name in entry point (classical APIs) and command line. If we rename Type
to, for example, InternalType
, old applications may be broken.
In reply to: 259523340 [](ancestors = 259523340)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced.... We are breaking all sorts of old applications with the API changes we've been making. We should define our public API as we want, and the other things will need to change.
@TomFinley - thoughts here? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader is a widely used class. Braking BC may cost a lot. Also, Type
is not very accurate. When we load a float vector column, its type will still be Float
instead of Float[]
.
In reply to: 259541585 [](ancestors = 259541585)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have this design guideline:
https://docs.microsoft.com/en-us/dotnet/standard/design-guidelines/constructor
✓ DO use the same name for constructor parameters and a property if the constructor parameters are used to simply set the property.
The only difference between such parameters and the properties should be casing.
So this name should be in sync with the constructor parameter name. Maybe that would solve all your problems?
public DataKind DataKind
to match the constructor parameter name: public Column(string name, DataKind dataKind, ...
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader is a widely used class.
Agreed. That is why it is critical we get the API right. We can never change it after this. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -539,7 +540,7 @@ public sealed class ColumnInfo | |||
/// <summary> | |||
/// The expected kind of the converted column. | |||
/// </summary> | |||
public readonly DataKind OutputKind; | |||
internal readonly DataKind OutputKind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this internal? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -243,7 +243,7 @@ public void TypeConvertKeyBackCompatTest() | |||
{ | |||
// Model generated using the following command before the change removing Min and Count from KeyType. | |||
// ML.Transforms.Conversion.ConvertType(new[] { new TypeConvertingEstimator.ColumnInfo("key", "convertedKey", | |||
// DataKind.U8, new KeyCount(4)) }).Fit(dataView); | |||
// ScalarType.Ulong, new KeyCount(4)) }).Fit(dataView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalarType
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f4cedc6
to
3803bec
Compare
public enum DataKind : byte | ||
{ | ||
/// <summary><see cref="SByte"/>: 1-byte integer, type of <see cref="System.SByte"/>.</summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need <see cref
in the summary for the thing you are providing a summary for:
/// <summary>1-byte integer, type of <see cref="System.SByte"/>.</summary>
``` #Resolved
@@ -55,50 +95,67 @@ public enum DataKind : byte | |||
[BestFriend] | |||
internal static class DataKindExtensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) - maybe this could be InternalDataKindExtensions
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <see cref="Data.DataKind"/> of the items in the column. | ||
/// </summary> | ||
/// It's a public interface to access the information in an internal DataKind. | ||
public DataKind DataKind => Type.ToDataKind(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be settable. All the other fields are settable on this class:
Name
, KeyCount
, Source
, but this one isn't. So if I use the default constructor - there is no way to set this property. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will make it
public DataKind DataKind
{
get { return Type.ToDataKind(); }
set { Type = value.ToInternalDataKind(); }
}
In reply to: 259954397 [](ancestors = 259954397)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good @wschin. I just have the one comment on the property setter. Once that is resolved, I can sign off.
11b567d
to
02e4374
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wschin !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix #2583. In general, we don't want to see
DataKind
in any public area. Depending on the actual amount of works required, this PR might first address cases mentioned #2583 and then more PRs will be published later.Note that to specify target types in type converters and text loader, we introduce a cleaner version of existing
DataKind
,and the existing
DataKind
is renamed asInternalDataKind
.Other user-facing changes were made to work with
DataKind
. Command line and entry point are not affected at all.