-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Trainer estimator cleanup for FastTrees and LightGBM #1352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding tests, and refactoring catalog and pigsty statics
adding groupid to the trainer estimator base refactoring the catalog and static extensions for trees
@@ -377,9 +377,10 @@ public static SchemaShape.Column MakeR4VecFeature(string featureColumn) | |||
/// The <see cref="SchemaShape.Column"/> for the weight column. | |||
/// </summary> | |||
/// <param name="weightColumn">name of the weight column</param> | |||
public static SchemaShape.Column MakeR4ScalarWeightColumn(string weightColumn) | |||
/// <param name="isExplicit">whether the column is implicitely, or explicitely defined</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling: implicitly & explicitly #Closed
/// <summary> | ||
/// The optional groupID column that the ranking trainers expects. | ||
/// </summary> | ||
public readonly SchemaShape.Column GroupIdColumn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupIdColumn [](start = 43, length = 13)
I am not sure I like the idea that every estimator will have the group ID column, but it only makes sense for ranking trainers. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative would be to modify the role mapped data after the estimator constructs it, and inject the groupId, and that feels error-prone. It's just one reference :)
In reply to: 227866406 [](ancestors = 227866406)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have a TrainerEstimatorBase
, is there anything that prevents us from having another subclass, if we wanted to have additional roles? We may have to change something below to be a virtual method or have some other extension point, but I feel like this would be helpful anyway.
I am not really a fan of having a single utility base class handle everything, since if it is capable of doing everything I suspect it will become too unwieldly.
In reply to: 227895594 [](ancestors = 227895594,227866406)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk what i like less; the GroupId field here, or making MakeRoles protected virtual ... changing it anyways.
In reply to: 228312151 [](ancestors = 228312151,227895594,227866406)
|
||
internal static class LightGbmStaticsUtils { | ||
/// <summary> | ||
/// LightGbm <see cref="RankingContext"/> extension method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGbm extension method. [](start = 12, length = 55)
Change the summary to look like the below summary #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if (advancedSettings != null) | ||
CheckArgsAndAdvancedSettingMismatch(numLeaves, minDataPerLeaf, learningRate, numBoostRound, new LightGbmArguments(), Args); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there has been a misconception.
It is OK to have the parameters (like numLeaves
and learningRate
) available in 2 places (inside Args
and as a ctor argument.
The correct pattern is
var args = new Arguments();
args.NumLeaves = numLeaves;
args.LearningRate = learningRate;
//...
advancedSettings?.Invoke(args);
That is, we set the args' fields to the user-provided ctor arguments, and then let the user optionally change them again in the delegate: this is not a bug.
One exception is for column names: we are always ignoring the column names if they are set in the arguments. So I suggested to check (and throw) if the user sets something like FeatureColumn
in the args to the value different to featureColumn
ctor argument (as we will not respect that change).
Apparently, the latter is not yet done, and it's probably fine for now. But let's not check the other args. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I have been doing, is the exact opposite, and i do prefer that tbh because of two reasons:
1- We do have parms in two places, and i think whenever we can deprecate those params from the arguments, we should do so. I don;t think we should promote the buried arguments in favor of the ones that surfaced in the signature.
2- It keeps the behavior consistent with the column names.
cc @TomFinley for his opinion.
In reply to: 227869641 [](ancestors = 227869641)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of enforcing a mismatch, since I could easily imagine situations where someone sets some reasonable arguments, but might thanks to some complicated logic decide to think better of it in the more complex setter.
However there is an addition point I wanted to address, since my first reaction is to disagree with it somewhat.
We do have parms in two places, and i think whenever we can deprecate those params from the arguments, we should do so.
I am strongly not in favor of this. I believe it is essential that the object with the advanced setting also continue to have the basic settings.
The reason is, a delegate for setting the object might be something used alongside the constructor, but in more programmatic, non-direct instantiation settings, especially in situations where several decisions have to be taken, it might be a real honest-to-god separate method living in a class somewhere. If the options argument has omitted the most common options, as suggested here, then we would force this scenario to spread its setting objects across two places, in a way that becomes suddenly rather more complex than just setting a bunch of properties in a single object.
A more minor consideration is that it seems more in-line with the idiom from ASP.NET, which is where @alexdegroot originally suggested we lift the idiom here.
My inclination therefore is to agree with @Zruty0 on this one, sorry @sfilipi. :)
In reply to: 227886248 [](ancestors = 227886248,227869641)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the elaborate and convincing argument @[email protected]. changing it everywhere.
In reply to: 228319239 [](ancestors = 228319239,227886248,227869641)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
/// <param name="columnName">name of the weight column</param> | ||
public static SchemaShape.Column MakeU4ScalarColumn(string columnName) | ||
{ | ||
if (columnName == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check needed? It looks like SchemaShape.Column constructor also checks the name for null or empty string. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, because the check inside the constructor will throw if we pass null.
In reply to: 227989462 [](ancestors = 227989462)
/// </summary> | ||
/// <param name="ctx">The <see cref="BinaryClassificationContext"/>.</param> | ||
/// <param name="label">The label column.</param> | ||
/// <param name="features">The features colum.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column, same on the other function comments. #Resolved
float l2Weight, | ||
float optimizationTolerance, | ||
int memorySize, | ||
bool enforceNoNegativity) | ||
: this(env, ArgsInit(featureColumn, labelColumn, weightColumn, advancedSettings), labelColumn, | ||
l1Weight, l2Weight, optimizationTolerance, memorySize, enforceNoNegativity) | ||
: this(env, new TArgs(), labelColumn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TArgs() [](start = 27, length = 8)
this needs to get all the other parameters in #Resolved
if (Args.LearningRates != learningRate) | ||
{ | ||
using (var ch = Host.Start($"Setting learning rate to: {learningRate} as supplied in the direct arguments.")) | ||
Args.LearningRates = learningRate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args [](start = 16, length = 4)
indenting #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -51,12 +51,18 @@ internal RegressionGamTrainer(IHostEnvironment env, Arguments args) | |||
/// <param name="labelColumn">The name of the label column.</param> | |||
/// <param name="featureColumn">The name of the feature column.</param> | |||
/// <param name="weightColumn">The name for the column containing the initial weight.</param> | |||
/// <param name="learningRate">The learning rate.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, the ordering of parameters, minDocumentsInLeafs and learningRate. #Resolved
/// <param name="weightColumn">The name for the column containing the initial weight.</param> | ||
/// <param name="learningRate">The learning rate.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordering of parameters here too. #Resolved
|
||
internal class FastTreeStaticsUtils | ||
{ | ||
internal static void CheckUserValues(PipelineColumn label, Vector<float> features, Scalar<float> weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent is off here... #Resolved
public LinearClassificationTrainer(IHostEnvironment env, | ||
string featureColumn, | ||
string featureColumn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent is off here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I have a few minor nit picks on formatting.
1- Adding the GroupId to the TrainerEstimatorBase class.
2- Adding the static xtensions for LightGM Multiclass and Ranking. Fixes #1314
3- Reorganizing the static and dynamic xtensions for FastTree and LightGBM
Addresses part of #1318