Skip to content

Help utilizing multi-column vectors #2240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jan 25, 2019 · 4 comments
Closed

Help utilizing multi-column vectors #2240

ghost opened this issue Jan 25, 2019 · 4 comments
Assignees
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@ghost
Copy link

ghost commented Jan 25, 2019

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 2.1.5

Issue

  • What did you do?
    In my migration to v0.8, I'm moving away from the legacy API and have data that consists of over 400 columns. Previously, by manually mapping each column, I achieved results from evaluation:
RMS = 1.02567798627118
RSquared = 0.993830469856289

Now, I'm reading in the columns as multi-column vectors (note: column names have been obfuscated here):

TextLoader textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
            {
                Column = new TextLoader.Column[] {
                    new TextLoader.Column("NumericRelatedData", DataKind.R4, 0, 359),
                    new TextLoader.Column("CategoricalRelatedData", DataKind.Text, 360, 407),
                    new TextLoader.Column("SpecificData1", DataKind.Text, 408),
                    new TextLoader.Column("SpecificData2", DataKind.Text, 409),
                    new TextLoader.Column("SpecificData3", DataKind.R4, 410),
                    new TextLoader.Column("Label", DataKind.R4, 411)
                },
                HasHeader = true,
                Separator = ","
            });
  • What happened?
    I got vastly different results from my model evaluation:
*       L1 Loss:        1.543
*       L2 Loss:        182.015
*       RMS:            13.491
*       Loss Function:  182.015
*       R-squared:      -0.067

Additionally, the model is not explorable, because the class I have to represent a single prediction has each column mapped to a different field, but those fields cannot be identified in the model.

  • What did you expect?
    I expected identical metrics, since I'm using the same trainer (FastTree)

It's obvious that I'm not understanding how the multi-column vectors are supposed to work.

My question is primarily: Do I have to continue to map each column in the TextLoader (and thus all subsequent uses of it in transformers) to get the results I'd like?

Source code / logs

I'm asking a very similar question in the documentation repo

@Ivanidzo4ka
Copy link
Contributor

If you had something like float[400] Features
and instead you have float[300] PartOne and float[90] PartTwo and float[10] PartThree and you define your Features column as Concatenate("Features", "PartOne", "PartTwo", "PartThree") we should produce same results.

But from I see in your snippet you don't have such case, you have Text features which you probably somehow transform and I don't see rest of your pipeline, so it's quite possible pipelines are different.

Any chance you can share old and new pipeline?

@ghost
Copy link
Author

ghost commented Jan 31, 2019

Old Pipeline:

public static async Task TrainAsync(string dataPath, string modelPath) {
  var pipeline = new LearningPipeline() {
    new TextLoader(dataPath).CreateFrom<InputData>(useHeader: true, separator: ','),
    new ColumnCopier(("Label", "Label")),
    // Convert string columns into numerics
    new CategoricalOneHotVectorizer("SpecificData1",
      "SpecificData2",
      "CategoricalRelatedData0",
      "CategoricalRelatedData1",
      "CategoricalRelatedData2",
      ...
      "CategoricalRelatedDataN"),
    // Add all relevant feature columns
    new ColumnConcatenator("Features",
      "SpecificData1",
      "SpecificData2",
      "SpecificData3",
      "Label",
      "CategoricalRelatedData0",
      "NumericRelatedData0.0",
      "NumericRelatedData0.1",
      "NumericRelatedData0.2",
      "NumericRelatedData0.3",
      ...
      "NumericRelatedData0.23",
      "CategoricalRelatedData1",
      "NumericRelatedData1.0",
      ...
      "NumericRelatedData23.23"
    ),
    new FastTreeRegressor()
    };

  var model = pipeline.Train<InputData, Prediction>();

  // Store the model
  await model.WriteAsync(modelPath);
}

New Pipeline:

var dataProcessPipeline = mlContext.Transforms.CopyColumns("Label", "Label")
  .Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalData", "CategoricalDataEncoded"))
  .Append(mlContext.Transforms.Categorical.OneHotEncoding("SpecificData1", "SpecificData1Encoded"))
  .Append(mlContext.Transforms.Categorical.OneHotEncoding("SpecificData2", "SpecificData2Encoded"))
  .Append(mlContext.Transforms.Concatenate(outputColumn: "Features", "NumericData", "CategoricalDataEncoded", "SpecificData1Encoded", "SpecificData2Encoded", "SpecificData3"));

var trainer = mlContext.Regression.Trainers.FastTree();
var trainingPipeline = dataProcessPipeline.Append(trainer);

Hopefully this helps - let me know if you need any more information.

@ghost
Copy link
Author

ghost commented Feb 20, 2019

@Ivanidzo4ka Did you have any suggestions of where to go from here? Thanks!

@Lynx1820 Lynx1820 added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label Jan 10, 2020
@najeeb-kazmi
Copy link
Member

@Fedoranimus I see in your old pipeline, you are concatenating the Label column into your Features column. This is the most likely explanation of the big change in RMS.

Please feel free to reopen if this is still an issue.

@najeeb-kazmi najeeb-kazmi self-assigned this Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

No branches or pull requests

3 participants