-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Extract IDataView into its own assembly and NuGet package #1860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In order to extract these types, the following work will need to be performed:
/cc @singlis |
Ah that reminds me, There are some more open questions we might ask. Do we consider things like the existing implementations of One thing I'm somewhat interested in hearing is whether this is useful to other types of applications. We specifically engineered this so that it would be useful for ML applications, and eschewed anything that didn't seem to have a direct application. If this idiom is "uplifted," I wonder if it will prove useful to other things. The danger I see, which is at this moment theoretical, is that we might succeed in making something useless in our quest to make it universally applicable. 😄 But we'll see. |
- IsKey - KeyCount - KeyCountCore Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
- IsKey - KeyCount - KeyCountCore Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
* Remove "VectorType" specific members on ColumnType. Remove the following members from ColumnType: - IsVector - ItemType - IsKnownSizeVector - VectorSize - ValueCount Part of the work necessary for #1860 and contributes to #1533. * Address review comments. - Make extension methods verbs. - Add doc to GetItemType extension. - Fix one place using Size > 0 => IsKnownSize.
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
My current thinking is that So my current proposal is to leave |
Removes the "easy" usages of ColumnType.RawKind. Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
* Move IDataView files into new Microsoft.Data.DataView folder * Add Microsoft.Data.DataView assembly and NuGet package. Get it building. * Make ML.NET build on the new Microsoft.Data.DataView assembly. Fix #1860 * Fix tests for the new Data assembly.
IDataView is a very flexible, efficient way of describing tabular data (columns and rows) in a read-only manner.
https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewDesignPrinciples.md
At its heart are 2 key concepts:
It has other capabilities that I won’t enumerate here, the above link describes them in more detail.
IDataView is very useful as an abstraction for tabular data that will allow users to pass data between two independent libraries.
For example, ML.NET is able to both consume and produce IDataView instances. Say there was a .NET library for Apache Arrow. If the Arrow .NET data type implements IDataView, the Apache Arrow data can be passed directly into ML.NET without having to copy it into a format that ML.NET consumes.
Another example is: say we had a visualization/graphing/plotting library in .NET that could consume data using IDataView. Then we could take data that was produced by ML.NET, or Apache Arrow, and feed it directly into the graphing library. There would be no need to copy, or change the shape of the data at all. And there is no need for this graphing library to know anything about ML.NET or Apache Arrow.
In my mind, we can use IDataView in a similar manner to what OData was promised to be: An exchange format which allows producers and consumers of data to communicate in a standardized way. (Although, OData has more capabilities such as filtering, sorting, updating data, etc which I am not proposing we add to IDataView. I was just using it as an analogy.)
/cc @TomFinley @Zruty0 @markusweimer @danmosemsft @stephentoub
The text was updated successfully, but these errors were encountered: