Skip to content

Adding a sample for LightGbm ranking #2704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

najeeb-kazmi
Copy link
Member

Replacing PR #2650 as I messed up commit history there.

Fixes #2530
Fixes #776

@codecov
Copy link

codecov bot commented Feb 23, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@44c3113). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #2704   +/-   ##
=========================================
  Coverage          ?   71.56%           
=========================================
  Files             ?      805           
  Lines             ?   142039           
  Branches          ?    16122           
=========================================
  Hits              ?   101655           
  Misses            ?    35949           
  Partials          ?     4435
Flag Coverage Δ
#Debug 71.56% <0%> (?)
#production 67.85% <0%> (?)
#test 85.73% <ø> (?)
Impacted Files Coverage Δ
...rosoft.ML.Data/Evaluators/Metrics/RankerMetrics.cs 90.9% <ø> (ø)
src/Microsoft.ML.SamplesUtils/ConsoleUtils.cs 0% <0%> (ø)
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs 25.19% <0%> (ø)

singlis and others added 7 commits February 23, 2019 08:24
* Move MetadataBuilder to be DataViewSchema.Metadata.Builder.

* Move SchemaBuilder to DataViewSchema.Builder.

* Rename `GetMetadata` and `GetSchema` to `ToMetadata` and `ToSchema` to follow the immutable collections pattern (and StringBuilder).

Working towards dotnet#2297
…#2646)

* Adding functional tests for all training and evaluation tasks
{
var fileName = "MSLRWeb10KTrain720kRows.tsv";
if (!File.Exists(fileName))
Download("https://tlcresources.blob.core.windows.net/datasets/MSLR-WEB10K/MSLR-WEB10K_Fold1.TRAIN.500MB_720k-rows.tsv", fileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ivanidzo4ka This is the only actual ranking dataset we have. I know we use Adult dataset to do some ranking tests with a fake group id column, but I'm not going to add it to our ranking samples. It will only invite confused questions as to why we are using column X as group id, why are we doing ranking on a binary classification dataset, etc. I think it's best to avoid all of that and show an actual ranking example.

I can either switch to using the validation set for this example, which is 160MB, or make a smaller subset of the 500MB dataset. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just generate one?
160 MB is too much.
you generated dataset in which you have 10 features, you have 10 different groups, and label depends on value in feature_x if x is value in group column.
Or since you do example for lightgbm you can use https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank ligthgbm dataset


In reply to: 260045081 [](ancestors = 260045081)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's not a very easy to follow example. There is no group id in the data, nor does the .conf file specify anything about it. It's very unclear how that is being handled or whether the ranker is being trained with one row per group (and if so, why). I'll just make a small sample of MSLR dataset, something like 10,000 rows, ~7MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It specifies query file. in which you have list of numbers. take first number, it's gonna be amount of examples for first group. take second number -> number of examples for second group and so on.
same for test.
My main concern was regarding publicity of that MSLR dataset, but it looks like it become public quite a while ago https://www.microsoft.com/en-us/research/project/mslr/


In reply to: 260058524 [](ancestors = 260058524)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we also have approval to use this in our samples with attribution information in the [README] (https://github.com/dotnet/machinelearning/blob/master/test/data/README.md).

var data = reader.Read(dataFile);

// Create the featurization pipeline. First, hash the GroupId column.
var pipeline = mlContext.Transforms.Conversion.Hash("GroupId")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlContext.Transforms.Conversion.Hash("GroupId") [](start = 27, length = 47)

Doesn't TrainTestSplit take care of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, GroupId needs to be converted to Key first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private void EnsureGroupPreservationColumn(ref IDataView data, ref string samplingKeyColumn, uint? seed = null)

Are you sure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I get that error when I try this pipeline without hashing GroupId

abgoswam and others added 10 commits February 25, 2019 18:37
* local tests run fine

* made fixes for (feature, label) as well

* update cookbook md file
* Hide the uses of DataKind in TypeConverting

* Hide DataKind used in TextLoader

* Internalize-best-friend DataKind

* DataKind ---> InternalDataKind

* ScalarType ---> DataKind (massive renaming)

* Address comments

* Address comments

* Address comments

* Make R4 as default

* Ok. I updated entry point...

* Sync with new things from master

* Address comments
…nt (dotnet#2688)

* Added samples & docs for BinaryClassification.StochasticGradientDescent, plus a bunch of typo fixing.

* Addressed PR comments.

* Mentioned Hogwild

* Updates to exampleWeightColumnName.

* Fixed trailing whitespaces.
* Make DataViewRowId not act like a number.

- Remove it from the NumberDataViewType.
- Remove any method/operator that makes it feel like a number.

Working towards dotnet#2297
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants