Adding a sample for LightGbm ranking #2704

najeeb-kazmi · 2019-02-23T03:59:49Z

Replacing PR #2650 as I messed up commit history there.

Adds a sample for LightGbm ranking.
Cleans up namespaces in Microsoft.ML.Samples project.
Addresses feedback from first round of Adding a sample for LightGbm Ranking #2650.

codecov · 2019-02-23T04:49:10Z

Codecov Report

❗ No coverage uploaded for pull request base (master@44c3113). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #2704   +/-   ##
=========================================
  Coverage          ?   71.56%           
=========================================
  Files             ?      805           
  Lines             ?   142039           
  Branches          ?    16122           
=========================================
  Hits              ?   101655           
  Misses            ?    35949           
  Partials          ?     4435

Flag	Coverage Δ
#Debug	`71.56% <0%> (?)`
#production	`67.85% <0%> (?)`
#test	`85.73% <ø> (?)`

Impacted Files	Coverage Δ
...rosoft.ML.Data/Evaluators/Metrics/RankerMetrics.cs	`90.9% <ø> (ø)`
src/Microsoft.ML.SamplesUtils/ConsoleUtils.cs	`0% <0%> (ø)`
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs	`25.19% <0%> (ø)`

Found while fixing dotnet#689, moved to separate commit.

…dotnet#2690) This fixes dotnet#689

…et rid of trivial transformWrapper (dotnet#2701)

* Move MetadataBuilder to be DataViewSchema.Metadata.Builder. * Move SchemaBuilder to DataViewSchema.Builder. * Rename `GetMetadata` and `GetSchema` to `ToMetadata` and `ToSchema` to follow the immutable collections pattern (and StringBuilder). Working towards dotnet#2297

…#2646) * Adding functional tests for all training and evaluation tasks

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/LightGBMRankingWithOptions.cs

Ivanidzo4ka · 2019-02-25T18:32:55Z

src/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs

+        {
+            var fileName = "MSLRWeb10KTrain720kRows.tsv";
+            if (!File.Exists(fileName))
+                Download("https://tlcresources.blob.core.windows.net/datasets/MSLR-WEB10K/MSLR-WEB10K_Fold1.TRAIN.500MB_720k-rows.tsv", fileName);


https://tlcresources.blob.core.windows.net/datasets/MSLR-WEB10K/MSLR-WEB10K_Fold1.TRAIN.500MB_720k-rows.tsv [](start = 26, length = 107)

This thing is 500 Mb, why we using it?

@Ivanidzo4ka This is the only actual ranking dataset we have. I know we use Adult dataset to do some ranking tests with a fake group id column, but I'm not going to add it to our ranking samples. It will only invite confused questions as to why we are using column X as group id, why are we doing ranking on a binary classification dataset, etc. I think it's best to avoid all of that and show an actual ranking example.

I can either switch to using the validation set for this example, which is 160MB, or make a smaller subset of the 500MB dataset. What do you think?

Can we just generate one?
160 MB is too much.
you generated dataset in which you have 10 features, you have 10 different groups, and label depends on value in feature_x if x is value in group column.
Or since you do example for lightgbm you can use https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank ligthgbm dataset

In reply to: 260045081 [](ancestors = 260045081)

Hmm that's not a very easy to follow example. There is no group id in the data, nor does the .conf file specify anything about it. It's very unclear how that is being handled or whether the ranker is being trained with one row per group (and if so, why). I'll just make a small sample of MSLR dataset, something like 10,000 rows, ~7MB.

It specifies query file. in which you have list of numbers. take first number, it's gonna be amount of examples for first group. take second number -> number of examples for second group and so on.
same for test.
My main concern was regarding publicity of that MSLR dataset, but it looks like it become public quite a while ago https://www.microsoft.com/en-us/research/project/mslr/

In reply to: 260058524 [](ancestors = 260058524)

Yes, and we also have approval to use this in our samples with attribution information in the [README] (https://github.com/dotnet/machinelearning/blob/master/test/data/README.md).

Ivanidzo4ka · 2019-02-25T18:33:16Z

src/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs

+            var data = reader.Read(dataFile);
+
+            // Create the featurization pipeline. First, hash the GroupId column.
+            var pipeline = mlContext.Transforms.Conversion.Hash("GroupId")


mlContext.Transforms.Conversion.Hash("GroupId") [](start = 27, length = 47)

Doesn't TrainTestSplit take care of that?

No, GroupId needs to be converted to Key first.

machinelearning/src/Microsoft.ML.Data/TrainCatalog.cs

Line 201 in 3b9d407

private void EnsureGroupPreservationColumn(ref IDataView data, ref string samplingKeyColumn, uint? seed = null)

Are you sure?

I mean, I get that error when I try this pipeline without hashing GroupId

* local tests run fine * made fixes for (feature, label) as well * update cookbook md file

…otnet#2673)

* Hide the uses of DataKind in TypeConverting * Hide DataKind used in TextLoader * Internalize-best-friend DataKind * DataKind ---> InternalDataKind * ScalarType ---> DataKind (massive renaming) * Address comments * Address comments * Address comments * Make R4 as default * Ok. I updated entry point... * Sync with new things from master * Address comments

…nt (dotnet#2688) * Added samples & docs for BinaryClassification.StochasticGradientDescent, plus a bunch of typo fixing. * Addressed PR comments. * Mentioned Hogwild * Updates to exampleWeightColumnName. * Fixed trailing whitespaces.

* Make DataViewRowId not act like a number. - Remove it from the NumberDataViewType. - Remove any method/operator that makes it feel like a number. Working towards dotnet#2297

najeeb-kazmi added 6 commits February 19, 2019 23:19

Adding a sample for LightGbm Ranking

b572614

PR feedback + cleaning up namespaces in Microsoft.ML.Samples project

f3d5d82

Adding a sample for LightGbm Ranking

ba14a9d

PR feedback + cleaning up namespaces in Microsoft.ML.Samples project

f20d7bf

nit

d862c3b

merge conflicts

269619f

najeeb-kazmi requested review from Ivanidzo4ka, shmoradims and zeahmed February 23, 2019 03:59

singlis and others added 7 commits February 23, 2019 08:24

- Fixes the project reference path for OnnxTransformer. (dotnet#2705)

9fe8233

Found while fixing dotnet#689, moved to separate commit.

- Removes ResultProcessor, Maml and Sweeper from Microsoft.ML nuget. (…

160eade

…dotnet#2690) This fixes dotnet#689

Remove MD5Hasher. (dotnet#2706)

eecf272

Hide delegates, model parameters classes, move onFit to staticPIpe, g…

f063510

…et rid of trivial transformWrapper (dotnet#2701)

Adding functional tests for all training and evaluation tasks (dotnet…

8001ccc

…#2646) * Adding functional tests for all training and evaluation tasks

Introduce order for pixel extraction (dotnet#2602)

850559f

Ivanidzo4ka reviewed Feb 25, 2019

View reviewed changes

docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Ranking/LightGBMRankingWithOptions.cs Outdated Show resolved Hide resolved

Ivanidzo4ka reviewed Feb 25, 2019

View reviewed changes

abgoswam and others added 10 commits February 25, 2019 18:37

Fixing parameters in ML.NET Public API (dotnet#2665)

7cc208c

* local tests run fine * made fixes for (feature, label) as well * update cookbook md file

Explicit implementation for IsRowToRowMapper and GetRowToRowMapper (d…

4acf5aa

…otnet#2673)

Make DataViewRowId not act like a number. (dotnet#2707)

f6d55f3

* Make DataViewRowId not act like a number. - Remove it from the NumberDataViewType. - Remove any method/operator that makes it feel like a number. Working towards dotnet#2297

Changed Ranker to Ranking in evaluation related files. (dotnet#2675)

4420cc7

Adding a sample for LightGbm Ranking

18801ab

PR feedback + cleaning up namespaces in Microsoft.ML.Samples project

1e1a803

Adding a sample for LightGbm Ranking

345cf60

PR feedback + cleaning up namespaces in Microsoft.ML.Samples project

c25a3c3

najeeb-kazmi added 3 commits February 25, 2019 16:46

nit

34ecd4a

merge conflicts

b8bbf21

Changing dataset to small sample and other feedback

1c99a4f

najeeb-kazmi closed this Feb 26, 2019

najeeb-kazmi mentioned this pull request Feb 26, 2019

Adding sample for LightGbm ranking #2729

Merged

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a sample for LightGbm ranking #2704

Adding a sample for LightGbm ranking #2704

najeeb-kazmi commented Feb 23, 2019

codecov bot commented Feb 23, 2019 •

edited

Loading

Ivanidzo4ka Feb 25, 2019

najeeb-kazmi Feb 25, 2019

Ivanidzo4ka Feb 25, 2019

najeeb-kazmi Feb 25, 2019

Ivanidzo4ka Feb 25, 2019

najeeb-kazmi Feb 26, 2019

Ivanidzo4ka Feb 25, 2019

najeeb-kazmi Feb 26, 2019

Ivanidzo4ka Feb 26, 2019

najeeb-kazmi Feb 26, 2019

Adding a sample for LightGbm ranking #2704

Adding a sample for LightGbm ranking #2704

Conversation

najeeb-kazmi commented Feb 23, 2019

codecov bot commented Feb 23, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 23, 2019 •

edited

Loading