Skip to content

Initial replacement of SubComponent with IComponentFactory #622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 8, 2018

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Aug 1, 2018

This is the first round of code changes necessary in order to remove SubComponent and instead use IComponentFactory.

This change removes SubComponent from the following public APIs:

  • CompositeDataLoader
    • In order to completely remove it here, I needed to also remove it in some of the DataCommand classes as well - since they were calling methods on CompositeDataLoader that used SubComponent.
  • Ova and Pkpd
  • TermTransform

Working towards #585

[Argument(ArgumentType.Multiple, HelpText = "Transforms to apply prior to splitting the data into folds", ShortName = "prexf")]
public KeyValuePair<string, SubComponent<IDataTransform, SignatureDataTransform>>[] PreTransform;
[Argument(ArgumentType.Multiple, HelpText = "Transforms to apply prior to splitting the data into folds", ShortName = "prexf", SignatureType = typeof(SignatureDataTransform))]
public KeyValuePair<string, IComponentFactory<IDataView, IDataTransform>>[] PreTransform;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily how I'd prefer to do these things, for two reasons. Firstly, while IComponentFactory is an interface, usually when appearing in these arguments objects we actually are indicating a specific inheriting interface appropriate for this task... e.g., I would expect to see something like IDataTransformFactory.

Also: our goal would be to also get rid of this little "signature" concept.

Signature is used for two major purposes, which I list here and describe how I think they could be better accomplished.

  1. It constrains the type based on the generic type (in the case where signatures are used. e.g., we might imagine an ITrainerFactory<IBinaryPredictor> in the pre-estimator world). This would I think suffice.

  2. It indicates what parameters will be necessary to pass in to instantiate the object in the Create method or the constructor. However, in the case where the signature is indicating "globally appropriate" input (in this case, IDataTransform is always instantiated with the input data), it would be better for the subinterface to just have a method directly, without using all this reflection based stuff at all.

I would have an actual subinterface type (that is, IDataTransformFactory or something similarly named. Then this interface would itself have a method to create the transform, and that method would have at least two parameters (in the current world): the env and the input data.

Copy link
Member Author

@eerhardt eerhardt Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: our goal would be to also get rid of this little "signature" concept.

I 100% agree. But it is too baked into the system right now to remove it in a single commit. Moving to IComponentFactory will lessen the dependence on the "signature" concept, but it will have to stay for a while until we can completely revamp the ComponentCatalog system.

usually when appearing in these arguments objects we actually are indicating a specific inheriting interface appropriate for this task... e.g., I would expect to see something like IDataTransformFactory.

I would have an actual subinterface type (that is, IDataTransformFactory or something similarly named. Then this interface would itself have a method to create the transform, and that method would have at least two parameters (in the current world): the env and the input data.

I'd need to think about this a bit more. My initial reaction is that this would cause the following disadvantages:

  1. We would have a lot of different IComponentFactory interfaces instead of the 3 we have so far. As we added new types of components, we would constantly be needing to create new interfaces.
    • I think this is analogous with delegates in .NET 1.0 and 2.0. If you needed a delegate that took in an int and returned a bool, you needed to make a brand new delegate type. But in .NET 3.5, we added Action<> and Func<> delegates. And now you barely ever need to define custom delegate types. Instead you can just use Func<int, bool>.
  2. We couldn't have the small set of SimpleComponentFactory classes that just takes a delegate to create the component. We would have to define a new class for each subinterface.
  3. Similarly as (2) above, the CmdParser class would have to (1) know about (2) define a new class, and (3) create different objects for each of the subinterfaces the API defines.
    • (Obviously I know the command line isn't a high priority, or selling reason, but we still need to support it and other tooling scenarios) #Pending

Copy link
Member Author

@eerhardt eerhardt Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Zruty0 - any thoughts here? According to history you were the original author #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @[email protected] 's reasoning. I think we could check it in as is.


In reply to: 207046122 [](ancestors = 207046122)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I am OK with this solution.


In reply to: 207057777 [](ancestors = 207057777)

{
Name ="Term",
Type = DataKind.TX,
KeyRange = new KeyRange() { Min = 0 }
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyRange = new KeyRange() { Min = 0 } [](start = 36, length = 37)

It should be Source = new[] {new TextLoader.Range() { Min = 0 } }
instead of current line #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Aug 2, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

this can throw exception. #Resolved


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Aug 2, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

so before I can had something like: <key:"", value:"ProduceId"> and value.ToString() return me "ProduceId", and now i'm not even sure what it contains.


In reply to: 410081350 [](ancestors = 410081350)


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

@eerhardt
Copy link
Member Author

eerhardt commented Aug 3, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

I don't understand the "this can throw exception" comment. Can you explain further?

As for the 2nd comment, I think I see what was happening here. Before, when it was a SubComponent, we would ToString() the SubComponent (which produces the command-line style "name + args" of the transform) and later write that string to the model.

However, here we are introducing a new scenario where you can new up any transform using this factory, and you may not have the "name + args" string available to write to the model. But when using the command-line (and GUI, etc) we still should be writing the string as before (which we can do from the CmdParser factory classes).

To solve this, I'll check for ICommandLineFactory interface . If the x.Value implements this interface, use its ToString(). If it doesn't, then just write null for the tag value, like we are doing here:

.Select((xf, i) => new TransformEx(GenerateTag(i), null, xf))

I considered adding a new interface ICanProduceTagData, but I don't know of any usages of it, so for now I am not introducing it.


In reply to: 410085392 [](ancestors = 410085392,410081350)


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Aug 3, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

Sorry for this cryptic messages, I just took your code, move it into internal repo, run bunch of tests, and notice what it fails, but since I had only one version of code, and I wasn't sure what happened originally, I just mark this place, revert code, and then tried to explain why it fails. Sometimes. As example it fails if in command line we call text loader, and then ProduceId transform with no arguments.

I think @Zruty0 is person who come up with this tags, so maybe he can elaborate is it still required, and why we accept nulls as tags. Or maybe we can use just use name of class instead of implementing interface


In reply to: 410393424 [](ancestors = 410393424,410085392,410081350)


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

Fix NRE.
Use existing PlattCalibratorTrainerFactory.
@@ -139,5 +140,11 @@ public bool IsRequired
{
get { return ArgumentType.Required == (_type & ArgumentType.Required); }
}

public Type SignatureType
Copy link
Contributor

@Zruty0 Zruty0 Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignatureType [](start = 20, length = 13)

Does it have to be here? Let me see why it would be needed #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I see now. You need to tell the arg parser what's the signature type to invoke the ctor properly.
Hmm, this is unfortunate. Are you sure there's no better way?


In reply to: 208054268 [](ancestors = 208054268)

Copy link
Member Author

@eerhardt eerhardt Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found one yet...

We could have some sort of mapping from IComponentFactory<...> to SignatureType. But that would seem fragile to me.

My other thinking (that seems to be aligned with @TomFinley's above) is that we should be moving away from SignatureTypes. This at least puts the Signature type into the attribute, and out of the API directly. Hopefully, eventually we can remove signature types all together. #Pending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the fact that it's prominent in the attribute makes the whole approach less desirable. But I think it's an OK intermediate step


In reply to: 208058930 [](ancestors = 208058930)

@Zruty0
Copy link
Contributor

Zruty0 commented Aug 7, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

Tags are still required, and we shouldn't have failing tests after this change. If the failing tests are in TLC and not ML.NET, it's not reasonable to expect Eric to write code that doesn't break it :)
I would still like to see the error occurrence to understand what's going on.


In reply to: 410395240 [](ancestors = 410395240,410393424,410085392,410081350)


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

@@ -37,6 +37,21 @@ public interface IComponentFactory<in TArg1, out TComponent> : IComponentFactory
TComponent CreateComponent(IHostEnvironment env, TArg1 argument1);
}

public class SimpleComponentFactory<TArg1, TComponent> : IComponentFactory<TArg1, TComponent>
Copy link
Contributor

@Zruty0 Zruty0 Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a summary comment to public class #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In reply to: 208311231 [](ancestors = 208311231)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eerhardt
Copy link
Member Author

eerhardt commented Aug 7, 2018

            .Select(x => new KeyValuePair<string, string>(x.Key, x.Value.ToString()))

I've fixed the NullReferenceException (NRE) in the latest commits. It was caused by the ComponentFactory.ToString() method I added to CmdParser.cs.


In reply to: 411130583 [](ancestors = 411130583,410395240,410393424,410085392,410081350)


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False)

@Zruty0
Copy link
Contributor

Zruty0 commented Aug 8, 2018

@codemzs @Ivanidzo4ka please take a look as well.

@@ -451,8 +451,8 @@ public sealed class TermLoaderArguments
[Argument(ArgumentType.AtMostOnce, IsInputFileName = true, HelpText = "Data file containing the terms", ShortName = "data", SortOrder = 2, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly)]
public string DataFile;

[Argument(ArgumentType.Multiple, HelpText = "Data loader", NullName = "<Auto>", SortOrder = 3, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly)]
public SubComponent<IDataLoader, SignatureDataLoader> Loader;
[Argument(ArgumentType.Multiple, HelpText = "Data loader", NullName = "<Auto>", SortOrder = 3, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, SignatureType = typeof(SignatureDataLoader))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly [](start = 103, length = 57)

unrelated to your PR, but we need to look at some point on this visibility. It's CmdLineOnly now because Subcomponent can't be exposed to EntryPoints, but since you moving to IComponentFactory, maybe we should remove this visibility restriction.

@@ -10,6 +10,7 @@
using Microsoft.ML.Runtime.Internal.Calibration;
using Microsoft.ML.Runtime.Internal.Internallearn;
using Microsoft.ML.Runtime.Training;
using Microsoft.ML.Runtime.EntryPoints;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sorting

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eerhardt eerhardt merged commit 1b2f181 into dotnet:master Aug 8, 2018
[TGUI(Label = "Predictor Type", Description = "Type of underlying binary predictor")]
public SubComponent<TScalarTrainer, SignatureBinaryClassifierTrainer> PredictorType =
new SubComponent<TScalarTrainer, SignatureBinaryClassifierTrainer>(LinearSvm.LoadNameValue);
public IComponentFactory<TScalarTrainer> PredictorType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PredictorType [](start = 53, length = 13)

This is force us to go and put ITrainer<IPredictorProducing> to all our binary learners.
For example FastTree is just ITrainer in current moment.
Which I think is a good thing, but we need to do it, overwise we can't use OVA in same way as before this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I agree it's a good thing that we need to make our trainers produce IPredictorProducing


In reply to: 208768180 [](ancestors = 208768180)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an internal test catch this? Or did you find it some other way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All internal tests use old way of instantiating data, via string :)
I've tried to write OVA example with new changes, and found I can't pass anything here, because nothing satisfy to this interface.


In reply to: 208774082 [](ancestors = 208774082)

Copy link
Member Author

@eerhardt eerhardt Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the idea with the SimpleComponentFactory class(es):

using (var env = new TlcEnvironment())
{
    var ovaArgs = new Ova.Arguments();
    ovaArgs.PredictorType = new SimpleComponentFactory<ITrainer<IPredictorProducing<float>>>(
        (e) => new FastTreeBinaryClassificationTrainer(e, new FastTreeBinaryClassificationTrainer.Arguments()));

    var ova = new Ova(env, ovaArgs);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this code into setter for PredictorType?
public IComponentFactory PredictorType = new SimpleComponentFactory<ITrainer<IPredictorProducing>>((e)=> new LinearSvm(e, new LinearSvm.Arguments())));

So we will have example of how it's done.

Also I personally prefer to throw exception if user set PredictorType to null, rather than falling to default value.


In reply to: 209000280 [](ancestors = 209000280)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did it the way I did was because it skips using the Factory all together in the default case. No need to create an extra factory object or a delegate - if the user didn't provide a PredictorType, we just use LinearSvm.

It also solves the case where the user set the field to null, as an added bonus. There's no reason to throw an exception in this case. We have a default we can use if the user doesn't give us one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants