Three major concepts: Estimators, Transformers and Data

This is still an incomplete proposal, but I played for a bit with what I had, and it looks promising to me so far.

The general idea is that we narrow our 'zoo' of components (transforms, predictors, scorers, loaders etc) down to three kinds:

- The **data**. An `IDataView` with schema, like before.
- The **transformer**. This is an object that can transform data and output data.
```C#
    public interface IDataTransformer
    {
        IDataView Transform(IDataView input);
        ISchema GetOutputSchema(ISchema inputSchema);
    }
```
- The **estimator**. This is the 'trainer'. The object that can 'train' a transformer using data.
```C#
    public interface IDataEstimator
    {
        IDataTransformer Fit(IDataView input);
        SchemaShape GetOutputSchema(SchemaShape inputSchema);
    }
```

Obviously, a chain of transformers can itself behave as a transformer, and a chain of estimators can behave like estimators.

We also introduce a 'data reader' (and its estimator), responsible for bringing the data 'from outside' (think loaders):
```C#
    public interface IDataReader<TIn>
    {
        IDataView Read(TIn input);
        ISchema GetOutputSchema();
    }

    public interface IDataReaderEstimator<TIn>
    {
        IDataReader<TIn> Fit(TIn input);
        SchemaShape GetOutputSchema();
    }
```

| Old component | New component |
| ----------------- | ------------------- |
| Data | Data |
| Transform | Transformer |
| Trainable transform (before it is trained) |  Estimator |
| Trainable transform (after it is trained) |  Transformer |
| Trainer | Estimator |
| Predictor | **not sure yet**. I'm thinking like 'a field of the scoring transformer?' |
| Scorer | Transformer |
| Untrainable loader | Data reader |
| Trainable loader | Estimator of data reader |

I have gone through the motions of creating a 'pipeline estimator' and 'pipeline transformer' objects, which then allows me to write this code to train and test:

```C#
            var env = new TlcEnvironment();

            var pipeline = new EstimatorPipe<IMultiStreamSource>(new MyTextLoader(env, MakeTextLoaderArgs()));
            pipeline.Append(new MyConcatTransformer(env, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
                    .Append(new MyNormalizer(env, "Features"))
                    .Append(new MySdca(env));

            var model = pipeline.Fit(new MultiFileSource(@"e:\data\iris.txt"));

            IrisPrediction[] scoredTrainData = model.Transform(new MultiFileSource(@"e:\data\iris.txt"))
                .AsEnumerable<IrisPrediction>(env, reuseRowObject: false)
                .ToArray();
```

Here, the only catch is the 'MakeTextLoaderArgs', which is an obnoxiously long way to define the original schema of the text loader. But it is obviously subject to improvement.

The full 'playground' is available at https://github.com/Zruty0/machinelearning/tree/feature/estimators 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Three major concepts: Estimators, Transformers and Data #581

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Old component	New component
Data	Data
Transform	Transformer
Trainable transform (before it is trained)	Estimator
Trainable transform (after it is trained)	Transformer
Trainer	Estimator
Predictor	not sure yet. I'm thinking like 'a field of the scoring transformer?'
Scorer	Transformer
Untrainable loader	Data reader
Trainable loader	Estimator of data reader

Three major concepts: Estimators, Transformers and Data #581

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions