-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Dynamic number of features for the trainer / schema #4903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I ll take a look. |
After creating a simplified example, I see that LGBM always selects first model in the training set as a prediction, no matter what was provided as a test data. public class StrategyInputModel
{
[ColumnName("Strategy"), LoadColumn(0)]
public string Strategy { get; set; } // will be used as a label (classifier)
[ColumnName("Pitch"), LoadColumn(1)]
public float Pitch { get; set; } // will be used as a part of dynamic Features
[ColumnName("Energy"), LoadColumn(2)]
public float Energy { get; set; } // will be always 0
[ColumnName("Contrast"), LoadColumn(3, 8), VectorType(6)]
public float[] Contrast { get; set; } // will be always [0, 0, 0, 0, 0, 0]
}
public class StrategyOutputModel
{
[ColumnName("Prediction")]
public string Prediction { get; set; }
public float[] Score { get; set; }
}
public IEstimator<ITransformer> GetPipeline(IEnumerable<string> columns)
{
var pipeline = Context
.Transforms
.Conversion
.MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })
.Append(Context.Transforms.Concatenate("Combination", columns.ToArray())) // merge "dynamic" colums into single property
.Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") })) // normalize merged columns into Features
.Append(Context.Transforms.SelectColumns(new string[] { "Label", "Features" })); // remove everything from data view, except transformed columns
return pipeline;
}
public IEstimator<ITransformer> GetEstimator()
{
var estimator = Context
.MulticlassClassification
.Trainers
.LightGbm()
.Append(Context.Transforms.Conversion.MapKeyToValue(new[]
{
new InputOutputColumnPair("Prediction", "PredictedLabel") // set trainer to use Prediction property as output
}));
return estimator;
}
public byte[] SaveModel(IEnumerable<string> columns, IEnumerable<StrategyInputModel> items)
{
var estimator = GetEstimator();
var pipeline = GetPipeline(columns);
var inputs = Context.Data.LoadFromEnumerable(items);
var estimatorModel = pipeline.Append(estimator).Fit(inputs);
var model = new byte[0];
using (var memoryStream = new MemoryStream())
{
Context.Model.Save(estimatorModel, inputs.Schema, memoryStream);
model = memoryStream.ToArray();
}
return model;
} Test method public string Estimate()
{
var aInput = new StrategyInputModel
{
Strategy = "A",
Pitch = 130F,
Energy = 0,
Contrast = new float[] { 0, 0, 0, 0, 0, 0 }
};
var bInput = new StrategyInputModel
{
Strategy = "B",
Pitch = 131F,
Energy = 0,
Contrast = new float[] { 0, 0, 0, 0, 0, 0 }
};
var columns = new[] { "Pitch" };
var predictor = SaveModel(columns, new[] { aInput, bInput }); // train model on "A" and "B"
using (var stream = new MemoryStream(predictor))
{
var model = Context.Model.Load(stream, out var schema);
var inputs = Context.Data.LoadFromEnumerable(new[] { bInput }); // pass "B" as test data
var predictions = model.Transform(inputs);
var output = Context.MulticlassClassification.Evaluate(data: predictions); // Log Loss = 0.69, Micro / Macro Accuracy = 1
var modelPrediction = predictions.GetColumn<string>("Prediction").ToArray().FirstOrDefault(); // get "A" as prediction [WRONG]
var engine = Context.Model.CreatePredictionEngine<StrategyInputModel, StrategyOutputModel>(model);
var enginePreidiction = engine.Predict(bInput).Prediction; // get "A" as prediction [WRONG]
return modelPrediction;
}
} |
Now, shocking news :)
Training set = 3 records. Test set = item #2 from the training Results Model created and trained with AveragedPerceptronOva always produces correct results. For example, if I create model with AveragedPerceptronOva and then test item using the same AveragedPerceptronOva or LightGbmMulti, then prediction is correct. If I create and train model using any other architecture, no matter what model I use with test data, it always returns item #1 from the training set as the best match, which is wrong.
Conclusion There is no issue with Prediction Engine. There is something wrong with |
Update Algorithms Train data that give CORRECT classification var inputs = new InputModel[]
[
{
Label = "Sample #1",
Factors = new float[] { 163.22714, 2.8778636, 0.5324864, 1.5412121, 0.64363956, 0.1371824, -0.021679323, 0.42805633, -0.712864, -0.2189847, 0.12471165, 0.07920727, 0.47652832 }
},
{
Label = "Sample #2",
Factors = new float[] { 148.25192, 4.3155456, 0.70223117, 1.5649862, 1.1754155, 0.13773751, 0.2579985, -0.26886848, -0.6455144, -0.073765576, -0.15425977, 0.19466293, 0.43180266 }
},
{
Label = "Sample #3",
Factors = new float[] { 164.9029, 4.810955, 0.87685776, 1.4808261, 0.9378684, 0.13101591, -0.06908134, -0.067622736, -0.8588759, -0.038343582, 0.36045787, -0.25861377, 0.63997686 }
}
] Train data that give WRONG classification var inputs = new InputModel[]
[
{
Label = "Sample #1",
Factor = 154.1958F
},
{
Label = "Sample #2",
Factors = 130.47337F
},
{
Label = "Sample #3",
Factors = 135.6923F
}
] Results When I use "wrong" data set for training and then try to use each of its items as a test data, AveragedPerceptronOva can successfully identify "Sample #1" with value of 154, but fails to distinguish values 130 and 135. When I use "correct" data set, AveragedPerceptronOva and SdcaMaximumEntropy can correctly identify each item when it's used as a test data. Tree-based algorithms always fail and return incorrect result on small data sets, no matter what training set was provided and which of its items was used as a test data. At the same time, trees work approximately fine on 500+ records. Perhaps, tree cannot be built using only 2-3 items? Question Is there an activation function, or some threshold, or some parameter to the algorithm that can make them more sensitive, to correctly separate values 130 and 135? |
Update Tried to provide various options to LGBM trainer.
Results No changes. LGBM (possibly all tree-based algorithms) gives an incorrect prediction on small data sets. |
Tried to use XGBoost implemented in this library. |
Hi @artemiusgreat, Sorry for the late response. Regarding your first comment, you found that the output schema has more than 10 columns and result of prediction is always the same. I'm not 100% sure but it's likely that when you call CreatePredictionEngine(), all the stuff defined in your MyInputModel are considered and which is ought to be the correct behavior of ML.NET. As a work around, you can manually define the input that actually use in MyInputModel. Meanwhile, can you please provide the full pipeline of it so that I can investigate further to make sure I give you the right answer? |
Regarding this comment , I think that the network structure of the model you implemented does not have any obvious problem. |
As for this comment. Yes, I think it's likely that tree-based algorithms does not work well on small datasets, and it's probably the expected behavior. In general practice, it's recommended to train tree-based algorithms with larger datasets. |
Closing this issue for now. Please feel free to reopen it if you need more help! |
System information
Issue
Trying to use variable number of properties (dynamic schema) for the trainer using dataView.SelectColumns. This creates correct trainer with only 2 features, but prediction engine still requires to specify original input model and uses all 10+ features, even though all features except selected 2 were set to 0.
What did you do?
What happened?
What did you expect?
Source code / logs
Example
The text was updated successfully, but these errors were encountered: