IDataView Cleanup: Predicates from int to Column

As seen in #1500, schema is being changed so that schemas contain columns.

For various reasons it may be easiest once `IDataView` is a class.

Let us consider the use of predicates in the `IDataView` system, e.g., when getting a cursor:

https://github.com/dotnet/machinelearning/blob/f9202628fbfac9e599e8c63dc5ed26eae77afbee/src/Microsoft.ML.Core/Data/IDataView.cs#L103

or elsewhere when forming a mapper, and getting dependencies:

https://github.com/dotnet/machinelearning/blob/f9202628fbfac9e599e8c63dc5ed26eae77afbee/src/Microsoft.ML.Core/Data/ISchemaBindableMapper.cs#L77

The use of integer indices here has sometimes led to confusion or even bugs. With the change of #1500 under consideration, this suggests a possibly better way.

It may be worth considering whether the columns in the new scheme suggested in #1500 should have a backref to the original schema (even as an internal field that is checked by the data-view abstract class), so as to enable an easy way to check whether that column in fact came from that schema, or, even without that backref, to check whether the columns exist.

We could also consider this dependency be expressed not as a *delegate*, but instead just some sort of collection of columns, since that would also make this easier to explain.

Note that while this makes the *interface* to `IDataView` easier, it makes the *implementation* harder, at least, if we suppose that all dataviews are possible for handling these inputs correctly and verifying that there aren't any shenaningans going on with input columns being from a different schema (which we can and so almost certainly should do under this new scheme). This suggests a change to `IDataView`, possibly done once `IDataView` is a class, so that the utility mapping from these column objects back down to indices (which must still happen internally) is handled by common code. It would also enable if these column objects have some sort of internal backreference to the schema, the ability to check that the input schemas are in fact correct. (This we obviously cannot do today with indices!)

/cc @Zruty0 @shauheen @terrajobst 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IDataView Cleanup: Predicates from int to Column #1529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IDataView Cleanup: Predicates from int to Column #1529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions