Description
IDataView is a very flexible, efficient way of describing tabular data (columns and rows) in a read-only manner.
https://github.com/dotnet/machinelearning/blob/master/docs/code/IDataViewDesignPrinciples.md
At its heart are 2 key concepts:
- Schema (describing the columns)
- Cursoring (how to read the rows of data)
It has other capabilities that I won’t enumerate here, the above link describes them in more detail.
IDataView is very useful as an abstraction for tabular data that will allow users to pass data between two independent libraries.
For example, ML.NET is able to both consume and produce IDataView instances. Say there was a .NET library for Apache Arrow. If the Arrow .NET data type implements IDataView, the Apache Arrow data can be passed directly into ML.NET without having to copy it into a format that ML.NET consumes.
Another example is: say we had a visualization/graphing/plotting library in .NET that could consume data using IDataView. Then we could take data that was produced by ML.NET, or Apache Arrow, and feed it directly into the graphing library. There would be no need to copy, or change the shape of the data at all. And there is no need for this graphing library to know anything about ML.NET or Apache Arrow.
In my mind, we can use IDataView in a similar manner to what OData was promised to be: An exchange format which allows producers and consumers of data to communicate in a standardized way. (Although, OData has more capabilities such as filtering, sorting, updating data, etc which I am not proposing we add to IDataView. I was just using it as an analogy.)
/cc @TomFinley @Zruty0 @markusweimer @danmosemsft @stephentoub