You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The usage of the default column names is more of a source of trouble than beneficial in my opinion.
Providing defaults for the numeric values is one think - we know the algorithms, and what ranges might work best for most datasets, and we also want to give a guideline on their range.
The columns are unlikely to be called what ML.Net calls them, across datasets, and it is easy to omit them from the signature when they are set to defaults.
Consider this pipeline:
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(label: "Sentiment", features: "Features", l2Const: 0.001f));
// Step 3: Run Cross-Validation on this pipeline, and dataFile.
var cvResult = mlContext.BinaryClassification.CrossValidate(data, pipeline);
without specifying the label on CrossValidate
var cvResult = mlContext.BinaryClassification.CrossValidate(data, pipeline, labelColumn: "Sentiment");
this will fail with message: 'Label column 'Label' not found'
which requires some level of looking aroudn to eventually figure out the mismatch between your data and the defaults on CV.
Why push that to the users, when we can just guide them towards providing the right names where the apis need them.
I think that, regardless of what we do, we should be consistent between components.
Either all trainers have defaults for all columns (like label, features, weight), or none do.
Frankly, I think having default names is just fine: I believe that the users don't often have 'inherent' names to their columns, they are forced to give them names just because that's how our data views work. In this case, there is no incentive NOT to use the default names, or to force them to be specified multiple times.
The usage of the default column names is more of a source of trouble than beneficial in my opinion.
Providing defaults for the numeric values is one think - we know the algorithms, and what ranges might work best for most datasets, and we also want to give a guideline on their range.
The columns are unlikely to be called what ML.Net calls them, across datasets, and it is easy to omit them from the signature when they are set to defaults.
Consider this pipeline:
without specifying the label on CrossValidate
var cvResult = mlContext.BinaryClassification.CrossValidate(data, pipeline, labelColumn: "Sentiment");
this will fail with message: 'Label column 'Label' not found'
which requires some level of looking aroudn to eventually figure out the mismatch between your data and the defaults on CV.
Why push that to the users, when we can just guide them towards providing the right names where the apis need them.
cc @Zruty0 @shauheen @GalOshri @TomFinley for opinions
The text was updated successfully, but these errors were encountered: