-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Help utilizing multi-column vectors #2240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you had something like But from I see in your snippet you don't have such case, you have Text features which you probably somehow transform and I don't see rest of your pipeline, so it's quite possible pipelines are different. Any chance you can share old and new pipeline? |
Old Pipeline: public static async Task TrainAsync(string dataPath, string modelPath) {
var pipeline = new LearningPipeline() {
new TextLoader(dataPath).CreateFrom<InputData>(useHeader: true, separator: ','),
new ColumnCopier(("Label", "Label")),
// Convert string columns into numerics
new CategoricalOneHotVectorizer("SpecificData1",
"SpecificData2",
"CategoricalRelatedData0",
"CategoricalRelatedData1",
"CategoricalRelatedData2",
...
"CategoricalRelatedDataN"),
// Add all relevant feature columns
new ColumnConcatenator("Features",
"SpecificData1",
"SpecificData2",
"SpecificData3",
"Label",
"CategoricalRelatedData0",
"NumericRelatedData0.0",
"NumericRelatedData0.1",
"NumericRelatedData0.2",
"NumericRelatedData0.3",
...
"NumericRelatedData0.23",
"CategoricalRelatedData1",
"NumericRelatedData1.0",
...
"NumericRelatedData23.23"
),
new FastTreeRegressor()
};
var model = pipeline.Train<InputData, Prediction>();
// Store the model
await model.WriteAsync(modelPath);
} New Pipeline: var dataProcessPipeline = mlContext.Transforms.CopyColumns("Label", "Label")
.Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalData", "CategoricalDataEncoded"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("SpecificData1", "SpecificData1Encoded"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("SpecificData2", "SpecificData2Encoded"))
.Append(mlContext.Transforms.Concatenate(outputColumn: "Features", "NumericData", "CategoricalDataEncoded", "SpecificData1Encoded", "SpecificData2Encoded", "SpecificData3"));
var trainer = mlContext.Regression.Trainers.FastTree();
var trainingPipeline = dataProcessPipeline.Append(trainer); Hopefully this helps - let me know if you need any more information. |
@Ivanidzo4ka Did you have any suggestions of where to go from here? Thanks! |
@Fedoranimus I see in your old pipeline, you are concatenating the Please feel free to reopen if this is still an issue. |
System information
Issue
In my migration to v0.8, I'm moving away from the legacy API and have data that consists of over 400 columns. Previously, by manually mapping each column, I achieved results from evaluation:
Now, I'm reading in the columns as multi-column vectors (note: column names have been obfuscated here):
I got vastly different results from my model evaluation:
Additionally, the model is not explorable, because the class I have to represent a single prediction has each column mapped to a different field, but those fields cannot be identified in the model.
I expected identical metrics, since I'm using the same trainer (
FastTree
)It's obvious that I'm not understanding how the multi-column vectors are supposed to work.
My question is primarily: Do I have to continue to map each column in the TextLoader (and thus all subsequent uses of it in transformers) to get the results I'd like?
Source code / logs
I'm asking a very similar question in the documentation repo
The text was updated successfully, but these errors were encountered: