Skip to content

How to inspect OneVersusAll models #3701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rauhs opened this issue May 10, 2019 · 3 comments
Open

How to inspect OneVersusAll models #3701

rauhs opened this issue May 10, 2019 · 3 comments
Labels
image Bugs related image datatype tasks lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@rauhs
Copy link
Contributor

rauhs commented May 10, 2019

Version: 1.0

Since 2b417bb made SubModelParameters private, there is no chance to get any of the sub models which would be needed for feature importance.

How do I inspect OVA models? In particular, how to get feature importance?

FWIW, I have no chance of using PFI feature importance. It would take day to run. It's 1000x slower than training any of my models.

@shmoradims shmoradims self-assigned this May 13, 2019
@shmoradims
Copy link

@rauhs making SubModelParameters public isn't useful by itself. Since it contains object-type objects, those objects need to be cast to their concrete type to be useful (e.g. Microsoft.ML.Calibrators.ParameterMixingCalibratedModelParameters<Microsoft.ML.Model.IPredictorWithFeatureWeights<float>, Microsoft.ML.Calibrators.ICalibrator>. To do that, there are at least 6 other internal/private interfaces/classes that need to be made public, so that route would need further discussions.

Is your main motivation to find feature importance? If so, it's easier to address your PFI issue with OVA.

AFAIK, PFI runtime for OVA is a linear function of n_rows * n_features * n_classes. It should be very close to the PFI runtime for the underlying binary trainer multiplied by n_classes, because OVA has that many binary models that it need to evaluate during PFI.

My PFI code below finishes under 7min for 1M rows, 40 features, and 6 classes on a very modest VM with specs below. What are the properties of your data that's taking a day to finish? What binary trainer are you using with OVA? Have you tried OVA with other binary trainers? Please share your code if possible.

image

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

namespace Samples.Dynamic.Trainers.MulticlassClassification
{
    public static class OvaPfi
    {
        public static void Example()
        {
            var mlContext = new MLContext(seed: 0);
            var dataPoints = GenerateRandomDataPoints(1000*1000);
            var trainingData = mlContext.Data.LoadFromEnumerable(dataPoints);
            trainingData = mlContext.Transforms.Conversion.MapValueToKey("Label").Fit(trainingData).Transform(trainingData);
            var pipeline = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.SdcaNonCalibrated());
            var model = pipeline.Fit(trainingData);

            var stopWatch = new System.Diagnostics.Stopwatch();
            stopWatch.Start();
            var x = mlContext.MulticlassClassification.PermutationFeatureImportance(model, trainingData);
            stopWatch.Stop();
            Console.WriteLine(stopWatch.Elapsed);
            
            // Output:
            // 00:06:29.0938075
        }

        private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count, int seed = 0)
        {
            var random = new Random(seed);
            float randomFloat() => (float)(random.NextDouble() - 0.5);
            for (int i = 0; i < count; i++)
            {
                var label = random.Next(6);
                yield return new DataPoint
                {
                    Label = (uint)label,
                    // Create random features that are correlated with the label.
                    // The feature values are slightly increased by adding a constant multiple of label.
                    Features = Enumerable.Repeat(label, 40).Select(x => randomFloat() + label * 0.2f).ToArray()
                };
            }
        }

        private class DataPoint
        {
            public uint Label { get; set; }
            [VectorType(40)]
            public float[] Features { get; set; }
        }
    }
}

@shmoradims shmoradims added the need info This issue needs more info before triage label May 16, 2019
@rauhs
Copy link
Contributor Author

rauhs commented May 17, 2019

My hunch is that the Evaluate method for multiclass classifiers is actually super slow. Hence why we're not using them at all and have implemented our own evaluation code which is much faster. (e.g. #744)

We have hundreds of classes, sometimes even 2-3k classes.

This particular setup:

  • Learner: LightGBM Multiclass "objective=multiclassova"
  • Labels: 249
  • Features: 2416 + 5
  • Samples/Instances: 10_000

Log:

Number of categorical features 15 (516, 725, 23, 26, 6, 4, 1, 7, 5, 3, 27, 40, 33, 972, 28) (Sum: 2416)
Number of real valued features 5
2019-05-17 08:19:53,195 DEBUG: [Source=LightGBMMulticlass; Training with LightGBM, Kind=Info] LightGBM objective=multiclassova
2019-05-17 08:20:04,698 DEBUG: [Source=LightGBMMulticlass; Training with LightGBM, Kind=Info] Met early stopping, best iteration: 30, best score: 0.19
Finished training 15s. Data points: 10000, Labels 249, (CarrierId)
Timed Predict train set: 0.300888888888889ms (per iter)
Train acc: 0.9774445, 0.9915556, 0.994, 0.9942222, 0.9944444. Kappa: 0.9717428, 0.9880579, 0.9908474, 0.9905593, 0.9903494
Test  acc: 0.8124373, 0.8676028, 0.8866599, 0.89669, 0.9037111. Kappa: 0.7803715, 0.8311259, 0.8451638, 0.848962, 0.84884
2019-05-17 08:20:14,328 DEBUG: [Source=PermutationFeatureImportance; GetImportanceMetrics, Kind=Info] Number of slots: 2421
2019-05-17 08:20:14,356 DEBUG: [Source=PermutationFeatureImportance; GetImportanceMetrics, Kind=Info] Detected 1000 examples for evaluation.
PFI: 863s

So ~15min for PFI and only 15s for the training. And I've reduced our usual problem size here.

I've passed in the test set (1000 Samples) to the PFI call.

@ganik ganik added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label May 21, 2019
@shmoradims
Copy link

We need to do some profiling for this. The large number of classes is making it very slow.

@shmoradims shmoradims removed the need info This issue needs more info before triage label Jun 5, 2019
@shmoradims shmoradims removed their assignment Jun 5, 2019
@harishsk harishsk added image Bugs related image datatype tasks lightgbm Bugs related lightgbm labels Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
image Bugs related image datatype tasks lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

No branches or pull requests

4 participants