Skip to content

Train binary classification with text label #2826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daholste opened this issue Mar 4, 2019 · 6 comments
Closed

Train binary classification with text label #2826

daholste opened this issue Mar 4, 2019 · 6 comments
Labels
API Issues pertaining the friendly API classification Bugs related classification tasks lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon. usability Smoothing user interaction or experience

Comments

@daholste
Copy link
Contributor

daholste commented Mar 4, 2019

@justinormont points out (https://github.com/dotnet/machinelearning-automl/issues/255) :

Key type is needed for binary classification learners:

  • Dataset w/ text labels (as seen here)
  • Datasets w/ missing labels -- BL no longer supports NA (changed in dotnet/machinelearning#673)

When the "Label" column is text, calling

var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label");
var trainer = mlContext.BinaryClassification.Trainers.LightGbm(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = pipeline.Append(trainer);
var crossValidationResults = mlContext.BinaryClassification.CrossValidateNonCalibrated(trainingDataView, trainingPipeline, numFolds: 5, labelColumn: "Label");

results in the exception

System.ArgumentOutOfRangeException
  HResult=0x80131502
  Message=Schema mismatch for label column '': expected Bool, got Key<U4>
  Source=Microsoft.ML.Data
  StackTrace:
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.CheckLabelCompatible(Column labelCol)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.CheckInputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.TrainCatalogBase.<>c__DisplayClass7_0.<CrossValidateTrain>b__0(Int32 fold)
   at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator`1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable`1 seed)
   at Microsoft.ML.BinaryClassificationCatalog.CrossValidateNonCalibrated(IDataView data, IEstimator`1 estimator, Int32 numFolds, String labelColumn, String samplingKeyColumn, Nullable`1 seed)
   at DogFruitNLP_14KB_735_rows_BinaryClassification.Program.BuildTrainEvaluateAndSaveModel(MLContext mlContext) in C:\AutoMLDotNet\bin\AnyCPU.Debug\mlnet\netcoreapp2.1\DogFruitNLP_14KB_735_rows_BinaryClassification\Program.cs:line 74

Would you have any recommendation for handling these kinds of scenarios?

@Ivanidzo4ka
Copy link
Contributor

For now plan is following:
Binary classification would support only boolean labels.
If your data contains missing values -> load it as float or text and either filter it, or create mapping from this values to boolean.
Float to boolean conversion should start work after this PR: #2804

Text labels, I think we currently support 'True' and 'False' values in text loader as boolean values.
For any other stuff like 'Positive', 'Negative', 'Cool', 'Not cool' you right now need to implement custom mapping or ValueMap

@justinormont justinormont added API Issues pertaining the friendly API usability Smoothing user interaction or experience labels Mar 4, 2019
@daholste
Copy link
Contributor Author

daholste commented Mar 4, 2019

Thanks, @Ivanidzo4ka !

Float to boolean conversion should start work after this PR: #2804

Do you have any plans for key to Boolean conversion? This would help from our side

@Ivanidzo4ka
Copy link
Contributor

That can be quite tricky. We can convert key to it's original type, but to specific type is feels somewhat weird. Key is basically a runtime build dictionary. It doesn't make much sense for me to cast dictionary which can contain whatever you want to boolean.

Why you need this conversion?

@daholste
Copy link
Contributor Author

daholste commented Mar 4, 2019

If a dataset has a text label with only 2 values, we want to do something like:

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean))
         .Append(mlContext.BinaryClassification.Trainers.LightGbm())

I noticed that

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Single))

converts a key type to a float? Is this correct?
If so, after your PR (#2804), could we do something like

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Single))
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean))
         .Append(mlContext.BinaryClassification.Trainers.LightGbm())

?
Does a better way come to mind to transform a text label to a Boolean form (that a binary classification trainer requires)?
Thanks for your time!

@rogancarr
Copy link
Contributor

@daholste Perhaps a custom transform would be in order? You can specify the exact mapping you want. This would let you map user-supplied values to booleans. Like @Ivanidzo4ka said, it's not clear a priori what value(s) would map to true or false.

You can define a custom transform like this:

// Define a custom function.
Action<ClassWithKey, ClassWithBool> convertLabelToBoolean = (input, output) =>
{
    output.Label = ConversionLogic(input.Label);
    // Copy the rest over too.
};

// Create a pipeline to execute the custom function.
var pipeline = mlContext.Transforms.CustomMapping(convertLabelToBoolean , null);

@daholste daholste changed the title Train binary classification with key type label Train binary classification with text label Mar 19, 2019
@ganik ganik added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label May 23, 2019
@harishsk harishsk added classification Bugs related classification tasks lightgbm Bugs related lightgbm labels Apr 29, 2020
@frank-dong-ms-zz
Copy link
Contributor

Close this issue as suggestion has already been given and not hear back from user for more than 1 year. Feel free to reopen if necessary.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API classification Bugs related classification tasks lightgbm Bugs related lightgbm P1 Priority of the issue for triage purpose: Needs to be fixed soon. usability Smoothing user interaction or experience
Projects
None yet
Development

No branches or pull requests

7 participants