Skip to content

Need to simplify Data Types to simpler/less common .NET types #1386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CESARDELATORRE opened this issue Oct 25, 2018 · 17 comments
Closed

Need to simplify Data Types to simpler/less common .NET types #1386

CESARDELATORRE opened this issue Oct 25, 2018 · 17 comments
Labels
bug Something isn't working

Comments

@CESARDELATORRE
Copy link
Contributor

CESARDELATORRE commented Oct 25, 2018

In the dynamic API we have too many and not .NET "standard types" you need to use when creating the file columns schema for loading data. The list of types is not very "dotnetty":

image

This is what you see in intellisense:

image

For instance for text you have Text, TX, and TXT, many types not very clear like R4, etc.

We need to have a simpler and more standard list of data types.

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 25, 2018

Do you have a suggestion in mind @CESARDELATORRE ?

@CESARDELATORRE
Copy link
Contributor Author

Maybe something like the following:

    public enum ValueTypes : byte
    {
	//...
	Float = 9,
        String = 11,
        Bool = 12,
        TimeSpan = 13,
        DateTime = 14,
        TimeZoneInfo = 15,
	//etc.
    }

The closer it can be to data types in C#, the better:
https://docs.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/types-and-variables

https://docs.microsoft.com/en-us/dotnet/api/system.timezoneinfo?view=netframework-4.7.2

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 26, 2018

I think mapping to .NET types is positive. In the past, we chose to have 2-character descriptions of our types, but I was never a big fan. Maybe @TomFinley was? He was the creator of UG after all :)

The one type I am suspicious about is String (since we use ROM<char>). Maybe Text is OK in that particular case?

@GalOshri
Copy link
Contributor

This looks good to me. What purpose do the current granular types play after the work done in #673?

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 26, 2018

We will still have most of the list: I1..I8, U1..U16, R4, R8, TX, TS, DT, BL, it is a matter of renaming and removing duplicates. And I'm not sure if we currently keep DZ.

@CESARDELATORRE
Copy link
Contributor Author

Sound good! Thank you guys! 👍

@jwood803
Copy link
Contributor

jwood803 commented Jan 1, 2019

Is there currently a way to map these to their respective .NET types?

I mainly ask since quite a few examples use the context.Data.TextReader and set the columns with the new TextLoader.Column where the DataKind gets passed in. I also ask in case this is something I can possibly help with. 😄

@CESARDELATORRE
Copy link
Contributor Author

Any refresh on this issue? I strongly think that the data types we currently have (The list of DataKind types shown above looks "surprising") should be simplified and aligned to regular .NET types, if possible.

@eerhardt @danmosemsft @TomFinley @markusweimer , any opinion about it?

@jwood803
Copy link
Contributor

jwood803 commented Jan 10, 2019 via email

@eerhardt
Copy link
Member

I'd be glad to help

Great! We could really use some help here.

The overall goal (as I understand it) is that we want to remove the DataKind enum. Checkout the discussion in #2006 (and related links).

So basically, if you could search for usages of DataKind and remove them and instead use things like ColumnType.RawType or other APIs. I'd start with usages of DataKind RawKind on ColumnType.

Maybe do a file/class at a time and send a PR. Once I'm done doing my current work I was going to start in this area. So it would be good to do small incremental chunks, so we don't overlap too much.

but I'm not sure how to map the current Data Kind types to .NET types

Check out

/// <summary>
/// Maps a DataKind to the associated .Net representation type.
/// </summary>
public static Type ToType(this DataKind kind)
{
switch (kind)
{
case DataKind.I1:
return typeof(sbyte);
case DataKind.U1:
return typeof(byte);
case DataKind.I2:
return typeof(short);
case DataKind.U2:
return typeof(ushort);
case DataKind.I4:
return typeof(int);
case DataKind.U4:
return typeof(uint);
case DataKind.I8:
return typeof(long);
case DataKind.U8:
return typeof(ulong);
case DataKind.R4:
return typeof(Single);
case DataKind.R8:
return typeof(Double);
case DataKind.TX:
return typeof(ReadOnlyMemory<char>);
case DataKind.BL:
return typeof(bool);
case DataKind.TS:
return typeof(TimeSpan);
case DataKind.DT:
return typeof(DateTime);
case DataKind.DZ:
return typeof(DateTimeOffset);
case DataKind.UG:
return typeof(RowId);
}

@shauheen shauheen added the bug Something isn't working label Feb 4, 2019
@singlis
Copy link
Member

singlis commented Feb 7, 2019

I wanted to ping the thread as I see this is in 0.11. After reading #2006, what is this issue tracking for 0.11? Is it to remove DataKind all up or is it to internalize DataKind (which mainly looks to be on TextLoader).

@eerhardt
Copy link
Member

eerhardt commented Feb 7, 2019

I would say we should internalize/remove usages of DataKind in as many places as possible. I see public usages by TextLoader, PartitionedFileLoader, and TypeConvertingTransformer so far... I also see some code that writes the byte value of the DataKind into the model file...

If we decide that DataKind has to be a public enum for the above public usages, then we should also do some renaming, as outlined by this issue.

Note - the names we should use should match the System.XXX type names in .NET. So Single instead of Float. Boolean instead of Bool. etc. This is how to refer to these types in public API. We don't use language keyword names.

https://docs.microsoft.com/en-us/dotnet/standard/design-guidelines/general-naming-conventions#avoiding-language-specific-names

@DevLob-zz
Copy link

one point to mention here when read from IEnumerable

if i have int , double , float , bool data types always map it to Float DataKind.R4

so when try to generate model it throw exception for mismatch as he expected to find R4 and find some else

i worked around and convert every type to be float which in not fine to do that is there ability to Explicitly set this or some attributes that load correct type for numeric variable

best regards

@wschin
Copy link
Member

wschin commented Mar 21, 2019

This is the current DataKind in ML.NET. Would it close this thread?

    public enum DataKind : byte
    {
        /// <summary>1-byte integer, type of <see cref="System.SByte"/>.</summary>
        SByte = 1,
        /// <summary>1-byte unsigned integer, type of <see cref="System.Byte"/>.</summary>
        Byte = 2,
        /// <summary>2-byte integer, type of <see cref="System.Int16"/>.</summary>
        Int16 = 3,
        /// <summary>2-byte usigned integer, type of <see cref="System.UInt16"/>.</summary>
        UInt16 = 4,
        /// <summary>4-byte integer, type of <see cref="System.Int32"/>.</summary>
        Int32 = 5,
        /// <summary>4-byte usigned integer, type of <see cref="System.UInt32"/>.</summary>
        UInt32 = 6,
        /// <summary>8-byte integer, type of <see cref="System.Int64"/>.</summary>
        Int64 = 7,
        /// <summary>8-byte usigned integer, type of <see cref="System.UInt64"/>.</summary>
        UInt64 = 8,
        /// <summary>4-byte floating-point number, type of <see cref="System.Single"/>.</summary>
        Single = 9,
        /// <summary>8-byte floating-point number, type of <see cref="System.Double"/>.</summary>
        Double = 10,
        /// <summary>string, type of <see cref="System.String"/>.</summary>
        String = 11,
        /// <summary>boolean variable type, type of <see cref="System.Boolean"/>.</summary>
        Boolean = 12,
        /// <summary>type of <see cref="System.TimeSpan"/>.</summary>
        TimeSpan = 13,
        /// <summary>type of <see cref="System.DateTime"/>.</summary>
        DateTime = 14,
        /// <summary>type of <see cref="System.DateTimeOffset"/>.</summary>
        DateTimeOffset = 15,
    }

@eerhardt
Copy link
Member

Yes - I believe our change to the public DataKind enum closing this issue. Please re-open if this isn't the case.

@eerhardt
Copy link
Member

Fixed by #2661

@CESARDELATORRE
Copy link
Contributor Author

The additional issue is that error messages are still showing those internal data types.. ;)

@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests