Skip to content

Help utilizing multi-column vectors #2240

Closed
@ghost

Description

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 2.1.5

Issue

  • What did you do?
    In my migration to v0.8, I'm moving away from the legacy API and have data that consists of over 400 columns. Previously, by manually mapping each column, I achieved results from evaluation:
RMS = 1.02567798627118
RSquared = 0.993830469856289

Now, I'm reading in the columns as multi-column vectors (note: column names have been obfuscated here):

TextLoader textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
            {
                Column = new TextLoader.Column[] {
                    new TextLoader.Column("NumericRelatedData", DataKind.R4, 0, 359),
                    new TextLoader.Column("CategoricalRelatedData", DataKind.Text, 360, 407),
                    new TextLoader.Column("SpecificData1", DataKind.Text, 408),
                    new TextLoader.Column("SpecificData2", DataKind.Text, 409),
                    new TextLoader.Column("SpecificData3", DataKind.R4, 410),
                    new TextLoader.Column("Label", DataKind.R4, 411)
                },
                HasHeader = true,
                Separator = ","
            });
  • What happened?
    I got vastly different results from my model evaluation:
*       L1 Loss:        1.543
*       L2 Loss:        182.015
*       RMS:            13.491
*       Loss Function:  182.015
*       R-squared:      -0.067

Additionally, the model is not explorable, because the class I have to represent a single prediction has each column mapped to a different field, but those fields cannot be identified in the model.

  • What did you expect?
    I expected identical metrics, since I'm using the same trainer (FastTree)

It's obvious that I'm not understanding how the multi-column vectors are supposed to work.

My question is primarily: Do I have to continue to map each column in the TextLoader (and thus all subsequent uses of it in transformers) to get the results I'd like?

Source code / logs

I'm asking a very similar question in the documentation repo

Metadata

Metadata

Assignees

Labels

P1Priority of the issue for triage purpose: Needs to be fixed soon.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions