-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Adding a sample for LightGbm ranking #2704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2704 +/- ##
=========================================
Coverage ? 71.56%
=========================================
Files ? 805
Lines ? 142039
Branches ? 16122
=========================================
Hits ? 101655
Misses ? 35949
Partials ? 4435
|
Found while fixing dotnet#689, moved to separate commit.
…et rid of trivial transformWrapper (dotnet#2701)
* Move MetadataBuilder to be DataViewSchema.Metadata.Builder. * Move SchemaBuilder to DataViewSchema.Builder. * Rename `GetMetadata` and `GetSchema` to `ToMetadata` and `ToSchema` to follow the immutable collections pattern (and StringBuilder). Working towards dotnet#2297
…#2646) * Adding functional tests for all training and evaluation tasks
docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/LightGBMRankingWithOptions.cs
Outdated
Show resolved
Hide resolved
{ | ||
var fileName = "MSLRWeb10KTrain720kRows.tsv"; | ||
if (!File.Exists(fileName)) | ||
Download("https://tlcresources.blob.core.windows.net/datasets/MSLR-WEB10K/MSLR-WEB10K_Fold1.TRAIN.500MB_720k-rows.tsv", fileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://tlcresources.blob.core.windows.net/datasets/MSLR-WEB10K/MSLR-WEB10K_Fold1.TRAIN.500MB_720k-rows.tsv [](start = 26, length = 107)
This thing is 500 Mb, why we using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ivanidzo4ka This is the only actual ranking dataset we have. I know we use Adult dataset to do some ranking tests with a fake group id column, but I'm not going to add it to our ranking samples. It will only invite confused questions as to why we are using column X as group id, why are we doing ranking on a binary classification dataset, etc. I think it's best to avoid all of that and show an actual ranking example.
I can either switch to using the validation set for this example, which is 160MB, or make a smaller subset of the 500MB dataset. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just generate one?
160 MB is too much.
you generated dataset in which you have 10 features, you have 10 different groups, and label depends on value in feature_x if x is value in group column.
Or since you do example for lightgbm you can use https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank ligthgbm dataset
In reply to: 260045081 [](ancestors = 260045081)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's not a very easy to follow example. There is no group id in the data, nor does the .conf file specify anything about it. It's very unclear how that is being handled or whether the ranker is being trained with one row per group (and if so, why). I'll just make a small sample of MSLR dataset, something like 10,000 rows, ~7MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It specifies query file. in which you have list of numbers. take first number, it's gonna be amount of examples for first group. take second number -> number of examples for second group and so on.
same for test.
My main concern was regarding publicity of that MSLR dataset, but it looks like it become public quite a while ago https://www.microsoft.com/en-us/research/project/mslr/
In reply to: 260058524 [](ancestors = 260058524)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we also have approval to use this in our samples with attribution information in the [README] (https://github.com/dotnet/machinelearning/blob/master/test/data/README.md).
var data = reader.Read(dataFile); | ||
|
||
// Create the featurization pipeline. First, hash the GroupId column. | ||
var pipeline = mlContext.Transforms.Conversion.Hash("GroupId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlContext.Transforms.Conversion.Hash("GroupId") [](start = 27, length = 47)
Doesn't TrainTestSplit take care of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, GroupId needs to be converted to Key first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private void EnsureGroupPreservationColumn(ref IDataView data, ref string samplingKeyColumn, uint? seed = null) |
Are you sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I get that error when I try this pipeline without hashing GroupId
* local tests run fine * made fixes for (feature, label) as well * update cookbook md file
* Hide the uses of DataKind in TypeConverting * Hide DataKind used in TextLoader * Internalize-best-friend DataKind * DataKind ---> InternalDataKind * ScalarType ---> DataKind (massive renaming) * Address comments * Address comments * Address comments * Make R4 as default * Ok. I updated entry point... * Sync with new things from master * Address comments
…nt (dotnet#2688) * Added samples & docs for BinaryClassification.StochasticGradientDescent, plus a bunch of typo fixing. * Addressed PR comments. * Mentioned Hogwild * Updates to exampleWeightColumnName. * Fixed trailing whitespaces.
* Make DataViewRowId not act like a number. - Remove it from the NumberDataViewType. - Remove any method/operator that makes it feel like a number. Working towards dotnet#2297
Replacing PR #2650 as I messed up commit history there.
Fixes #2530
Fixes #776