Skip to content

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 20, 2019

Fix #2583. In general, we don't want to see DataKind in any public area. Depending on the actual amount of works required, this PR might first address cases mentioned #2583 and then more PRs will be published later.

  • TextLoader
  • TypeConverting
  • Rest (DataKind is internal-best-friend now).

Note that to specify target types in type converters and text loader, we introduce a cleaner version of existing DataKind,

    public enum DataKind : byte 
    { 
        SByte = 1, 
        Byte = 2, 
        Int16 = 3, 
        UInt16 = 4, 
        Int32 = 5, 
        UInt32 = 6, 
        Int64 = 7, 
        UInt64 = 8, 
        Single = 9, 
        Double = 10, 
        String = 11, 
        Boolean = 12, 
        TimeSpan = 13, 
        DateTime = 14, 
        DateTimeOffset = 15, 
    }

and the existing DataKind is renamed as InternalDataKind.
Other user-facing changes were made to work with DataKind. Command line and entry point are not affected at all.

@wschin wschin self-assigned this Feb 20, 2019
@@ -34,8 +34,8 @@ public static void Example()
HasHeader = true,
Columns = new[]
{
new TextLoader.Column("Sentiment", DataKind.BL, 0),
new TextLoader.Column("SentimentText", DataKind.Text, 1)
new TextLoader.Column("Sentiment", typeof(bool), 0),
Copy link
Contributor

@TomFinley TomFinley Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In issue #2006, I had a comment. You will note that you are picking the option that I argued was the worst of all possible worlds. It offers absolutely no hints to the user of what might be an acceptable type. How does a user know that a bool is acceptable, but, say, a List<bool> is not>? This is much, much worse than what it is supposedly meant to replace. #Resolved

Copy link
Member Author

@wschin wschin Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will make ColumnInfo strongly-typed based on the desired output type. Users will have to create, for example, ToFloatColumnInfo, if they want to conver things to float.


In reply to: 258657461 [](ancestors = 258657461)

Copy link
Contributor

@TomFinley TomFinley Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one possibility. We should also remain open to the possibility that DataKind will persist, just in a changed form. So while doing this work, think about whether it is making the usage more or less clear.

It is a perfectly acceptable outcome to say, "you know what, this enumeration is still useful, just in a vastly abbreviated scope," if we can make that argument. It is unclear to me which is correct. I would not start with the assumption that datakind needs to be internalized and destroyed. (W.r.t. its prior role in ColumnType, absolutely!!) #Resolved

Copy link
Member Author

@wschin wschin Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can replace Type with PrimitiveDataViewType?


In reply to: 258755422 [](ancestors = 258755422)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We decided internalizing DataKind and creating an simplified alternative.


In reply to: 259025888 [](ancestors = 259025888,258755422)

@wschin wschin force-pushed the internalize-datakind branch 2 times, most recently from 929df77 to 0fc552c Compare February 21, 2019 21:46
@wschin wschin changed the title [WIP] Internalize DataKind Internalize DataKind Feb 21, 2019
@codecov
Copy link

codecov bot commented Feb 22, 2019

Codecov Report

Merging #2661 into master will decrease coverage by 0.01%.
The diff coverage is 73.52%.

@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage    71.7%   71.68%   -0.02%     
==========================================
  Files         808      808              
  Lines      142221   142214       -7     
  Branches    16131    16131              
==========================================
- Hits       101981   101953      -28     
- Misses      35799    35824      +25     
+ Partials     4441     4437       -4
Flag Coverage Δ
#Debug 71.68% <73.52%> (-0.02%) ⬇️
#production 67.94% <54.82%> (-0.03%) ⬇️
#test 85.86% <87.69%> (+0.02%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.StaticPipe/StaticSchemaShape.cs 76.5% <ø> (ø) ⬆️
test/Microsoft.ML.Functional.Tests/Common.cs 98.54% <ø> (-0.02%) ⬇️
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 76.47% <ø> (ø) ⬆️
src/Microsoft.ML.Recommender/RecommenderUtils.cs 57.14% <ø> (ø) ⬆️
...osoft.ML.Core.Tests/UnitTests/CoreBaseTestClass.cs 68.78% <ø> (ø) ⬆️
src/Microsoft.ML.Data/Commands/TypeInfoCommand.cs 0% <0%> (ø) ⬆️
...s/StochasticDualCoordinateAscentClassifierBench.cs 0% <0%> (ø) ⬆️
....ML.Benchmarks/KMeansAndLogisticRegressionBench.cs 0% <0%> (ø) ⬆️
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs 26.77% <0%> (ø) ⬆️
...t/Microsoft.ML.Benchmarks/PredictionEngineBench.cs 0% <0%> (ø) ⬆️
... and 56 more

@wschin wschin requested a review from TomFinley February 22, 2019 00:41
@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 22, 2019

Before reviewing this code two questions:

  • Are we happy with ScalarType name?
  • Do we ever plan to use this enum outside of TextLoader and TypeConverting? If no, I would suggest to have multiple ConvertToFloat, ConvertToDouble, ConvertToKey, etc to replace TypeConverting and just move that enum to TextLoader #Closed

@wschin
Copy link
Member Author

wschin commented Feb 22, 2019

@Ivanidzo4ka, your suggestion was one of the 3 solutions I have tried. Unfortunately, having specialized functions cannot tackle the case where multiple columns are casted to different types in one transformer. If you want to specialize ColumnInfo to different allowed types, you will end up with TWO groups of those ColumnInfo independently for TextLoader and TypeConverting. Therefore, @TomFinley and I feel it's ok to have a simplified version of DataKind. #Resolved

@eerhardt
Copy link
Member

eerhardt commented Feb 22, 2019

  • Are we happy with ScalarType name?

I had a similar concern initially. I tried to think of a better name, and the only name I thought was better was DataKind. What do you think about renaming the existing DataKind to InternalDataKind (or anything but DataKind), and then rename public enum ScalarType to public enum DataKind? #Resolved

@wschin
Copy link
Member Author

wschin commented Feb 22, 2019

  • Are we happy with ScalarType name?

I had a similar concern initially. I tried to think of a better name, and the only name I thought was better was DataKind. What do you think about renaming the existing DataKind to InternalDataKind (or anything but DataKind), and then rename public enum ScalarType to public enum DataKind?

Having a new name sounds reasonable to me. #Resolved

/// </summary>
/// <remarks>
/// <list type="bullet">
/// <item><description><see cref="SByte"/>: 1-byte integer, type of <see cref="System.SByte"/>.</description></item>
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these comments typically go on each individual value in the enum?

public enum ActionType
{
  /// <summary>The New Subscription is called</summary>
  Insert,
  /// <summary>The Cancel Subscription is called</summary>
  Update,
  /// <summary>The Cancel Subscription is called</summary>
  Cancel, 
  /// <summary>The Get Subscription is called</summary>
  Retrieve
}

#Resolved

@@ -8,40 +8,83 @@
namespace Microsoft.ML.Data
{
/// <summary>
/// Data type specifier.
/// Data type specifier used in text loader and type converters.
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like giving where a type is used as part of its definition/summary of what it is. What happens if a new place starts using this enum? #Resolved

Copy link
Member Author

@wschin wschin Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to update this doc string. We have this new enum only because of the two classes mentioned. If we don't usually have design doc for each component, my feeling is that we should somehow describe the motivation of having an object here.


In reply to: 259519002 [](ancestors = 259519002)

Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this new enum only because of the two classes mentioned.

Right now, sure. But nothing is stopping some new API from using this enum. Then this doc would be out of date. Saying where something is used now doesn't explain what this thing is.

How about simply:

Specifies a simple data type.

other enum summaries are pretty simple:

https://docs.microsoft.com/en-us/dotnet/api/system?view=netframework-4.7.2#enums

TypeCode | Specifies the type of an object.
DayOfWeek | Specifies the day of the week.

If you want to add where it is used, that would be better in a <remarks> section. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will move it to //.


In reply to: 259536248 [](ancestors = 259536248)

/// <see cref="DataKind"/> of the items in the column.
/// </summary>
/// It's a public interface to access the information in an internal DataKind.
public DataKind ItemType => Type.HasValue ? Type.Value.ToScalarType() : DataKind.Single;
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToScalarType()? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename it to ToDataKind.


In reply to: 259519970 [](ancestors = 259519970)

/// <param name="source">Source index range(s) of the column.</param>
/// <param name="keyCount">For a key column, this defines the range of values.</param>
public Column(string name, DataKind? type, Range[] source, KeyCount keyCount = null)
private Column(string name, InternalDataKind? kind, Range[] source, KeyCount keyCount = null)
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who calls this constructor? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the three public constructors above.


In reply to: 259520279 [](ancestors = 259520279)

{
}

public Column(string name, DataKind dataKind, Range[] source, KeyCount keyCount = null)
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public API should have XML doc. #Resolved

/// <see cref="DataKind"/> of the items in the column.
/// </summary>
/// It's a public interface to access the information in an internal DataKind.
public DataKind ItemType => Type.HasValue ? Type.Value.ToScalarType() : DataKind.Single;
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ItemType? Typically we use that name for the type of items inside of something - like a Vector. Why not just Type as the public name? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't. Internal name, nameof(Type), is used in naming command line arguments.


In reply to: 259520596 [](ancestors = 259520596)

Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this reasoning. Can you explain further?

In general, we shouldn't be sacrificing our public API for "command line" reasonings. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use nameof(Type) as an argument name in entry point (classical APIs) and command line. If we rename Type to, for example, InternalType, old applications may be broken.


In reply to: 259523340 [](ancestors = 259523340)

Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced.... We are breaking all sorts of old applications with the API changes we've been making. We should define our public API as we want, and the other things will need to change.

@TomFinley - thoughts here? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader is a widely used class. Braking BC may cost a lot. Also, Type is not very accurate. When we load a float vector column, its type will still be Float instead of Float[].


In reply to: 259541585 [](ancestors = 259541585)

Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we have this design guideline:

https://docs.microsoft.com/en-us/dotnet/standard/design-guidelines/constructor

✓ DO use the same name for constructor parameters and a property if the constructor parameters are used to simply set the property.

The only difference between such parameters and the properties should be casing.

So this name should be in sync with the constructor parameter name. Maybe that would solve all your problems?

public DataKind DataKind to match the constructor parameter name: public Column(string name, DataKind dataKind, ...
#Resolved

Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader is a widely used class.

Agreed. That is why it is critical we get the API right. We can never change it after this. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let's use DataKind.


In reply to: 259542584 [](ancestors = 259542584)

@@ -539,7 +540,7 @@ public sealed class ColumnInfo
/// <summary>
/// The expected kind of the converted column.
/// </summary>
public readonly DataKind OutputKind;
internal readonly DataKind OutputKind;
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this internal? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be. Thanks.


In reply to: 259521046 [](ancestors = 259521046)

@@ -243,7 +243,7 @@ public void TypeConvertKeyBackCompatTest()
{
// Model generated using the following command before the change removing Min and Count from KeyType.
// ML.Transforms.Conversion.ConvertType(new[] { new TypeConvertingEstimator.ColumnInfo("key", "convertedKey",
// DataKind.U8, new KeyCount(4)) }).Fit(dataView);
// ScalarType.Ulong, new KeyCount(4)) }).Fit(dataView);
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarType ? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be DataKind.UInt64. Thanks.


In reply to: 259522599 [](ancestors = 259522599)

@wschin wschin force-pushed the internalize-datakind branch from f4cedc6 to 3803bec Compare February 22, 2019 22:13
public enum DataKind : byte
{
/// <summary><see cref="SByte"/>: 1-byte integer, type of <see cref="System.SByte"/>.</summary>
Copy link
Member

@eerhardt eerhardt Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need <see cref in the summary for the thing you are providing a summary for:

/// <summary>1-byte integer, type of <see cref="System.SByte"/>.</summary>
``` #Resolved

@@ -55,50 +95,67 @@ public enum DataKind : byte
[BestFriend]
internal static class DataKindExtensions
Copy link
Member

@eerhardt eerhardt Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) - maybe this could be InternalDataKindExtensions? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 259953370 [](ancestors = 259953370)

/// <see cref="Data.DataKind"/> of the items in the column.
/// </summary>
/// It's a public interface to access the information in an internal DataKind.
public DataKind DataKind => Type.ToDataKind();
Copy link
Member

@eerhardt eerhardt Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be settable. All the other fields are settable on this class:

Name, KeyCount, Source, but this one isn't. So if I use the default constructor - there is no way to set this property. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will make it

            public DataKind DataKind
            {
                get { return Type.ToDataKind(); }
                set { Type = value.ToInternalDataKind(); }
            }

In reply to: 259954397 [](ancestors = 259954397)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @wschin. I just have the one comment on the property setter. Once that is resolved, I can sign off.

@wschin wschin force-pushed the internalize-datakind branch from 11b567d to 02e4374 Compare February 25, 2019 19:20
Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wschin !

@wschin wschin requested a review from Ivanidzo4ka February 25, 2019 20:59
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wschin wschin merged commit 2ef0614 into dotnet:master Feb 25, 2019
@wschin wschin deleted the internalize-datakind branch February 25, 2019 21:07
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants