How to inspect OneVersusAll models #3701

rauhs · 2019-05-10T07:51:34Z

Version: 1.0

Since 2b417bb made SubModelParameters private, there is no chance to get any of the sub models which would be needed for feature importance.

How do I inspect OVA models? In particular, how to get feature importance?

FWIW, I have no chance of using PFI feature importance. It would take day to run. It's 1000x slower than training any of my models.

The text was updated successfully, but these errors were encountered:

shmoradims · 2019-05-16T16:40:18Z

@rauhs making SubModelParameters public isn't useful by itself. Since it contains object-type objects, those objects need to be cast to their concrete type to be useful (e.g. Microsoft.ML.Calibrators.ParameterMixingCalibratedModelParameters<Microsoft.ML.Model.IPredictorWithFeatureWeights<float>, Microsoft.ML.Calibrators.ICalibrator>. To do that, there are at least 6 other internal/private interfaces/classes that need to be made public, so that route would need further discussions.

Is your main motivation to find feature importance? If so, it's easier to address your PFI issue with OVA.

AFAIK, PFI runtime for OVA is a linear function of n_rows * n_features * n_classes. It should be very close to the PFI runtime for the underlying binary trainer multiplied by n_classes, because OVA has that many binary models that it need to evaluate during PFI.

My PFI code below finishes under 7min for 1M rows, 40 features, and 6 classes on a very modest VM with specs below. What are the properties of your data that's taking a day to finish? What binary trainer are you using with OVA? Have you tried OVA with other binary trainers? Please share your code if possible.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

namespace Samples.Dynamic.Trainers.MulticlassClassification
{
    public static class OvaPfi
    {
        public static void Example()
        {
            var mlContext = new MLContext(seed: 0);
            var dataPoints = GenerateRandomDataPoints(1000*1000);
            var trainingData = mlContext.Data.LoadFromEnumerable(dataPoints);
            trainingData = mlContext.Transforms.Conversion.MapValueToKey("Label").Fit(trainingData).Transform(trainingData);
            var pipeline = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.SdcaNonCalibrated());
            var model = pipeline.Fit(trainingData);

            var stopWatch = new System.Diagnostics.Stopwatch();
            stopWatch.Start();
            var x = mlContext.MulticlassClassification.PermutationFeatureImportance(model, trainingData);
            stopWatch.Stop();
            Console.WriteLine(stopWatch.Elapsed);
            
            // Output:
            // 00:06:29.0938075
        }

        private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed = 0)
        {
            var random = new Random(seed);
            float randomFloat() => (float)(random.NextDouble() - 0.5);
            for (int i = 0; i < count; i++)
            {
                var label = random.Next(6);
                yield return new DataPoint
                {
                    Label = (uint)label,
                    // Create random features that are correlated with the label.
                    // The feature values are slightly increased by adding a constant multiple of label.
                    Features = Enumerable.Repeat(label, 40).Select(x => randomFloat() + label * 0.2f).ToArray()
                };
            }
        }

        private class DataPoint
        {
            public uint Label { get; set; }
            [VectorType(40)]
            public float[] Features { get; set; }
        }
    }
}

rauhs · 2019-05-17T06:38:38Z

My hunch is that the Evaluate method for multiclass classifiers is actually super slow. Hence why we're not using them at all and have implemented our own evaluation code which is much faster. (e.g. #744)

We have hundreds of classes, sometimes even 2-3k classes.

This particular setup:

Learner: LightGBM Multiclass "objective=multiclassova"
Labels: 249
Features: 2416 + 5
Samples/Instances: 10_000

Log:

Number of categorical features 15 (516, 725, 23, 26, 6, 4, 1, 7, 5, 3, 27, 40, 33, 972, 28) (Sum: 2416)
Number of real valued features 5
2019-05-17 08:19:53,195 DEBUG: [Source=LightGBMMulticlass; Training with LightGBM, Kind=Info] LightGBM objective=multiclassova
2019-05-17 08:20:04,698 DEBUG: [Source=LightGBMMulticlass; Training with LightGBM, Kind=Info] Met early stopping, best iteration: 30, best score: 0.19
Finished training 15s. Data points: 10000, Labels 249, (CarrierId)
Timed Predict train set: 0.300888888888889ms (per iter)
Train acc: 0.9774445, 0.9915556, 0.994, 0.9942222, 0.9944444. Kappa: 0.9717428, 0.9880579, 0.9908474, 0.9905593, 0.9903494
Test  acc: 0.8124373, 0.8676028, 0.8866599, 0.89669, 0.9037111. Kappa: 0.7803715, 0.8311259, 0.8451638, 0.848962, 0.84884
2019-05-17 08:20:14,328 DEBUG: [Source=PermutationFeatureImportance; GetImportanceMetrics, Kind=Info] Number of slots: 2421
2019-05-17 08:20:14,356 DEBUG: [Source=PermutationFeatureImportance; GetImportanceMetrics, Kind=Info] Detected 1000 examples for evaluation.
PFI: 863s

So ~15min for PFI and only 15s for the training. And I've reduced our usual problem size here.

I've passed in the test set (1000 Samples) to the PFI call.

shmoradims · 2019-05-21T19:36:41Z

We need to do some profiling for this. The large number of classes is making it very slow.

shmoradims self-assigned this May 13, 2019

shmoradims added the need info This issue needs more info before triage label May 16, 2019

ganik added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label May 21, 2019

shmoradims removed the need info This issue needs more info before triage label Jun 5, 2019

shmoradims removed their assignment Jun 5, 2019

harishsk added image Bugs related image datatype tasks lightgbm Bugs related lightgbm labels Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to inspect OneVersusAll models #3701

How to inspect OneVersusAll models #3701

rauhs commented May 10, 2019

shmoradims commented May 16, 2019

rauhs commented May 17, 2019

shmoradims commented May 21, 2019

How to inspect OneVersusAll models #3701

How to inspect OneVersusAll models #3701

Comments

rauhs commented May 10, 2019

shmoradims commented May 16, 2019

rauhs commented May 17, 2019

shmoradims commented May 21, 2019