Skip to content

Extract IDataView into its own assembly and NuGet package #1860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eerhardt opened this issue Dec 11, 2018 · 3 comments
Closed

Extract IDataView into its own assembly and NuGet package #1860

eerhardt opened this issue Dec 11, 2018 · 3 comments
Assignees
Labels
API Issues pertaining the friendly API
Milestone

Comments

@eerhardt
Copy link
Member

IDataView is a very flexible, efficient way of describing tabular data (columns and rows) in a read-only manner.

https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewDesignPrinciples.md

At its heart are 2 key concepts:

  • Schema (describing the columns)
  • Cursoring (how to read the rows of data)

It has other capabilities that I won’t enumerate here, the above link describes them in more detail.

IDataView is very useful as an abstraction for tabular data that will allow users to pass data between two independent libraries.

For example, ML.NET is able to both consume and produce IDataView instances. Say there was a .NET library for Apache Arrow. If the Arrow .NET data type implements IDataView, the Apache Arrow data can be passed directly into ML.NET without having to copy it into a format that ML.NET consumes.

Another example is: say we had a visualization/graphing/plotting library in .NET that could consume data using IDataView. Then we could take data that was produced by ML.NET, or Apache Arrow, and feed it directly into the graphing library. There would be no need to copy, or change the shape of the data at all. And there is no need for this graphing library to know anything about ML.NET or Apache Arrow.

In my mind, we can use IDataView in a similar manner to what OData was promised to be: An exchange format which allows producers and consumers of data to communicate in a standardized way. (Although, OData has more capabilities such as filtering, sorting, updating data, etc which I am not proposing we add to IDataView. I was just using it as an analogy.)

/cc @TomFinley @Zruty0 @markusweimer @danmosemsft @stephentoub

@eerhardt eerhardt added the API Issues pertaining the friendly API label Dec 11, 2018
@eerhardt
Copy link
Member Author

eerhardt commented Dec 11, 2018

In order to extract these types, the following work will need to be performed:

/cc @singlis

@TomFinley
Copy link
Contributor

Ah that reminds me, IRowCursorConsolidator actually ought to go away altogether, pursuant to API review meetings. But I see I forgot to enter an associated issue... OK entered as #1867.

There are some more open questions we might ask. Do we consider things like the existing implementations of ColumnType that we presently enjoy to be inherent in this, or ancillary to it? It is technically ancillary, but in order to be meaningful as any sort of interchange between different frameworks there should be some shared opinion. This would include things like vector valued types, and the like.

One thing I'm somewhat interested in hearing is whether this is useful to other types of applications. We specifically engineered this so that it would be useful for ML applications, and eschewed anything that didn't seem to have a direct application. If this idiom is "uplifted," I wonder if it will prove useful to other things. The danger I see, which is at this moment theoretical, is that we might succeed in making something useless in our quest to make it universally applicable. 😄 But we'll see.

eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 7, 2019
- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 11, 2019
* Remove "KeyType" specific members on ColumnType.

- IsKey
- KeyCount
- KeyCountCore

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 11, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 14, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 15, 2019
Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 16, 2019
* Remove "VectorType" specific members on ColumnType.

Remove the following members from ColumnType:

- IsVector
- ItemType
- IsKnownSizeVector
- VectorSize
- ValueCount

Part of the work necessary for #1860 and contributes to #1533.

* Address review comments.

- Make extension methods verbs.
- Add doc to GetItemType extension.
- Fix one place using Size > 0 => IsKnownSize.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 16, 2019
Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
@eerhardt
Copy link
Member Author

Do we consider things like the existing implementations of ColumnType that we presently enjoy to be inherent in this, or ancillary to it? It is technically ancillary, but in order to be meaningful as any sort of interchange between different frameworks there should be some shared opinion. This would include things like vector valued types, and the like.

My current thinking is that ColumnType and its "primitive" implementations will be extracted as well. I think KeyType and VectorType might be a little too "ML.NET"-specific. For example, the VBuffer type (which VectorType references) might not be the best "general vector" representation for .NET. Instead, we might imagine different "dense" or "sparse" vector types that were originally envisioned in #608. And then possibly someone could write an ITransformer from those types into VBuffer for ML.NET to consume. (Note, this whole buffer discussion can be designed/implemented at a later date. The first round of work would just be the primitive types outside of ML.NET, and we can build on that.)

So my current proposal is to leave KeyType and VectorType as extension types of the extractedIDataView - just like how the ImageType class isn't part of ML.Core.

eerhardt added a commit that referenced this issue Jan 17, 2019
* Remove ColumnType.RawKind usages Round 1.

Remove all usages of RawKind that are outside of ML.Core and ML.Data assemblies. The next round will completely remove ColumnType.RawKind.

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 17, 2019
Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for dotnet#1860 and contributes to dotnet#1533.
eerhardt added a commit that referenced this issue Jan 23, 2019
* Remove ColumnType.RawKind usages Round 2.

Removes the "easy" usages of ColumnType.RawKind.

Part of the work necessary for #1860 and contributes to #1533.
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 24, 2019
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 24, 2019
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Jan 25, 2019
eerhardt added a commit that referenced this issue Jan 25, 2019
* Move IDataView files into new Microsoft.Data.DataView folder

* Add Microsoft.Data.DataView assembly and NuGet package.

Get it building.

* Make ML.NET build on the new Microsoft.Data.DataView assembly.

Fix #1860

* Fix tests for the new Data assembly.
@shauheen shauheen added this to the 0119 milestone Jan 29, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

4 participants