-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Initial replacement of SubComponent with IComponentFactory #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
And also a couple simple transforms.
…ands. This allows for CompositeDataLoader to no longer need to know about SubComponent.
[Argument(ArgumentType.Multiple, HelpText = "Transforms to apply prior to splitting the data into folds", ShortName = "prexf")] | ||
public KeyValuePair<string, SubComponent<IDataTransform, SignatureDataTransform>>[] PreTransform; | ||
[Argument(ArgumentType.Multiple, HelpText = "Transforms to apply prior to splitting the data into folds", ShortName = "prexf", SignatureType = typeof(SignatureDataTransform))] | ||
public KeyValuePair<string, IComponentFactory<IDataView, IDataTransform>>[] PreTransform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily how I'd prefer to do these things, for two reasons. Firstly, while IComponentFactory
is an interface, usually when appearing in these arguments objects we actually are indicating a specific inheriting interface appropriate for this task... e.g., I would expect to see something like IDataTransformFactory
.
Also: our goal would be to also get rid of this little "signature" concept.
Signature is used for two major purposes, which I list here and describe how I think they could be better accomplished.
-
It constrains the type based on the generic type (in the case where signatures are used. e.g., we might imagine an
ITrainerFactory<IBinaryPredictor>
in the pre-estimator world). This would I think suffice. -
It indicates what parameters will be necessary to pass in to instantiate the object in the
Create
method or the constructor. However, in the case where the signature is indicating "globally appropriate" input (in this case,IDataTransform
is always instantiated with the input data), it would be better for the subinterface to just have a method directly, without using all this reflection based stuff at all.
I would have an actual subinterface type (that is, IDataTransformFactory
or something similarly named. Then this interface would itself have a method to create the transform, and that method would have at least two parameters (in the current world): the env
and the input
data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: our goal would be to also get rid of this little "signature" concept.
I 100% agree. But it is too baked into the system right now to remove it in a single commit. Moving to IComponentFactory
will lessen the dependence on the "signature" concept, but it will have to stay for a while until we can completely revamp the ComponentCatalog system.
usually when appearing in these arguments objects we actually are indicating a specific inheriting interface appropriate for this task... e.g., I would expect to see something like IDataTransformFactory.
I would have an actual subinterface type (that is, IDataTransformFactory or something similarly named. Then this interface would itself have a method to create the transform, and that method would have at least two parameters (in the current world): the env and the input data.
I'd need to think about this a bit more. My initial reaction is that this would cause the following disadvantages:
- We would have a lot of different IComponentFactory interfaces instead of the 3 we have so far. As we added new types of components, we would constantly be needing to create new interfaces.
- I think this is analogous with delegates in .NET 1.0 and 2.0. If you needed a delegate that took in an
int
and returned abool
, you needed to make a brand new delegate type. But in .NET 3.5, we addedAction<>
andFunc<>
delegates. And now you barely ever need to define custom delegate types. Instead you can just useFunc<int, bool>
.
- I think this is analogous with delegates in .NET 1.0 and 2.0. If you needed a delegate that took in an
- We couldn't have the small set of
SimpleComponentFactory
classes that just takes a delegate to create the component. We would have to define a new class for each subinterface. - Similarly as (2) above, the CmdParser class would have to (1) know about (2) define a new class, and (3) create different objects for each of the subinterfaces the API defines.
- (Obviously I know the command line isn't a high priority, or selling reason, but we still need to support it and other tooling scenarios) #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @Zruty0 - any thoughts here? According to history you were the original author #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @[email protected] 's reasoning. I think we could check it in as is.
In reply to: 207046122 [](ancestors = 207046122)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
Name ="Term", | ||
Type = DataKind.TX, | ||
KeyRange = new KeyRange() { Min = 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyRange = new KeyRange() { Min = 0 } [](start = 36, length = 37)
It should be Source = new[] {new TextLoader.Range() { Min = 0 } }
instead of current line #Resolved
so before I can had something like: <key:"", value:"ProduceId"> and value.ToString() return me "ProduceId", and now i'm not even sure what it contains. In reply to: 410081350 [](ancestors = 410081350) Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False) |
I don't understand the "this can throw exception" comment. Can you explain further? As for the 2nd comment, I think I see what was happening here. Before, when it was a SubComponent, we would ToString() the SubComponent (which produces the command-line style "name + args" of the transform) and later write that string to the model. However, here we are introducing a new scenario where you can new up any transform using this factory, and you may not have the "name + args" string available to write to the model. But when using the command-line (and GUI, etc) we still should be writing the string as before (which we can do from the CmdParser factory classes). To solve this, I'll check for
I considered adding a new interface In reply to: 410085392 [](ancestors = 410085392,410081350) Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False) |
Sorry for this cryptic messages, I just took your code, move it into internal repo, run bunch of tests, and notice what it fails, but since I had only one version of code, and I wasn't sure what happened originally, I just mark this place, revert code, and then tried to explain why it fails. Sometimes. As example it fails if in command line we call text loader, and then ProduceId transform with no arguments. I think @Zruty0 is person who come up with this tags, so maybe he can elaborate is it still required, and why we accept nulls as tags. Or maybe we can use just use name of class instead of implementing interface In reply to: 410393424 [](ancestors = 410393424,410085392,410081350) Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False) |
Fix NRE. Use existing PlattCalibratorTrainerFactory.
@@ -139,5 +140,11 @@ public bool IsRequired | |||
{ | |||
get { return ArgumentType.Required == (_type & ArgumentType.Required); } | |||
} | |||
|
|||
public Type SignatureType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SignatureType [](start = 20, length = 13)
Does it have to be here? Let me see why it would be needed #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I see now. You need to tell the arg parser what's the signature type to invoke the ctor properly.
Hmm, this is unfortunate. Are you sure there's no better way?
In reply to: 208054268 [](ancestors = 208054268)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found one yet...
We could have some sort of mapping from IComponentFactory<...>
to SignatureType
. But that would seem fragile to me.
My other thinking (that seems to be aligned with @TomFinley's above) is that we should be moving away from SignatureTypes
. This at least puts the Signature type into the attribute, and out of the API directly. Hopefully, eventually we can remove signature types all together. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the fact that it's prominent in the attribute makes the whole approach less desirable. But I think it's an OK intermediate step
In reply to: 208058930 [](ancestors = 208058930)
Tags are still required, and we shouldn't have failing tests after this change. If the failing tests are in TLC and not ML.NET, it's not reasonable to expect Eric to write code that doesn't break it :) In reply to: 410395240 [](ancestors = 410395240,410393424,410085392,410081350) Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False) |
@@ -37,6 +37,21 @@ public interface IComponentFactory<in TArg1, out TComponent> : IComponentFactory | |||
TComponent CreateComponent(IHostEnvironment env, TArg1 argument1); | |||
} | |||
|
|||
public class SimpleComponentFactory<TArg1, TComponent> : IComponentFactory<TArg1, TComponent> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a summary comment to public class #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the NullReferenceException (NRE) in the latest commits. It was caused by the ComponentFactory.ToString() method I added to CmdParser.cs. In reply to: 411130583 [](ancestors = 411130583,410395240,410393424,410085392,410081350) Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:136 in 05141a8. [](commit_id = 05141a8, deletion_comment = False) |
@codemzs @Ivanidzo4ka please take a look as well. |
@@ -451,8 +451,8 @@ public sealed class TermLoaderArguments | |||
[Argument(ArgumentType.AtMostOnce, IsInputFileName = true, HelpText = "Data file containing the terms", ShortName = "data", SortOrder = 2, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly)] | |||
public string DataFile; | |||
|
|||
[Argument(ArgumentType.Multiple, HelpText = "Data loader", NullName = "<Auto>", SortOrder = 3, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly)] | |||
public SubComponent<IDataLoader, SignatureDataLoader> Loader; | |||
[Argument(ArgumentType.Multiple, HelpText = "Data loader", NullName = "<Auto>", SortOrder = 3, Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly, SignatureType = typeof(SignatureDataLoader))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visibility = ArgumentAttribute.VisibilityType.CmdLineOnly [](start = 103, length = 57)
unrelated to your PR, but we need to look at some point on this visibility. It's CmdLineOnly now because Subcomponent can't be exposed to EntryPoints, but since you moving to IComponentFactory, maybe we should remove this visibility restriction.
@@ -10,6 +10,7 @@ | |||
using Microsoft.ML.Runtime.Internal.Calibration; | |||
using Microsoft.ML.Runtime.Internal.Internallearn; | |||
using Microsoft.ML.Runtime.Training; | |||
using Microsoft.ML.Runtime.EntryPoints; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sorting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[TGUI(Label = "Predictor Type", Description = "Type of underlying binary predictor")] | ||
public SubComponent<TScalarTrainer, SignatureBinaryClassifierTrainer> PredictorType = | ||
new SubComponent<TScalarTrainer, SignatureBinaryClassifierTrainer>(LinearSvm.LoadNameValue); | ||
public IComponentFactory<TScalarTrainer> PredictorType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PredictorType [](start = 53, length = 13)
This is force us to go and put ITrainer<IPredictorProducing> to all our binary learners.
For example FastTree is just ITrainer in current moment.
Which I think is a good thing, but we need to do it, overwise we can't use OVA in same way as before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I agree it's a good thing that we need to make our trainers produce IPredictorProducing
In reply to: 208768180 [](ancestors = 208768180)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an internal test catch this? Or did you find it some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All internal tests use old way of instantiating data, via string :)
I've tried to write OVA example with new changes, and found I can't pass anything here, because nothing satisfy to this interface.
In reply to: 208774082 [](ancestors = 208774082)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the idea with the SimpleComponentFactory
class(es):
using (var env = new TlcEnvironment())
{
var ovaArgs = new Ova.Arguments();
ovaArgs.PredictorType = new SimpleComponentFactory<ITrainer<IPredictorProducing<float>>>(
(e) => new FastTreeBinaryClassificationTrainer(e, new FastTreeBinaryClassificationTrainer.Arguments()));
var ova = new Ova(env, ovaArgs);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this code into setter for PredictorType?
public IComponentFactory PredictorType = new SimpleComponentFactory<ITrainer<IPredictorProducing>>((e)=> new LinearSvm(e, new LinearSvm.Arguments())));
So we will have example of how it's done.
Also I personally prefer to throw exception if user set PredictorType to null, rather than falling to default value.
In reply to: 209000280 [](ancestors = 209000280)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it the way I did was because it skips using the Factory all together in the default case. No need to create an extra factory object or a delegate - if the user didn't provide a PredictorType, we just use LinearSvm
.
It also solves the case where the user set the field to null
, as an added bonus. There's no reason to throw an exception in this case. We have a default we can use if the user doesn't give us one.
This is the first round of code changes necessary in order to remove SubComponent and instead use IComponentFactory.
This change removes SubComponent from the following public APIs:
CompositeDataLoader
Ova
andPkpd
TermTransform
Working towards #585