Skip to content

Enable a QuantileRegression Test & Fix Duplicated Baseline Files #1193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 10, 2018

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Oct 9, 2018

Enable CommandTrainScoreEvaluateQuantileRegression, add the dataset and the necessary baselines.

The only baseline changes were of the form that were caused by dotnet/corefx#31847.

Allow BaseTestBaseline to check Common first.

A lot of baseline tests have duplicated baselines between debug and release. Allow BaseTestBaseline to check the Common baseline directory for a baseline file first.

Also, I cleaned a bunch of dead code from BaseTestBaseline.

Fix #410

Enable CommandTrainScoreEvaluateQuantileRegression, add the dataset and the necessary baselines.

The only baseline changes were of the form that were caused by https://github.com/dotnet/corefx/issues/31847.
A lot of baseline tests have duplicated baselines between debug and release. Allow BaseTestBaseline to check the Common baseline directory for a baseline file first.

Fix dotnet#410
@eerhardt
Copy link
Member Author

eerhardt commented Oct 9, 2018

/cc @justinormont - in case you want to review the Boston Housing dataset attribution. #Closed

@@ -838,7 +748,7 @@ protected static StreamReader OpenReader(string path)
{
Contracts.CheckNonWhiteSpace(path, nameof(path));
#if CORECLR
return new StreamReader(File.Open(path, FileMode.Open, FileAccess.Read));
return new StreamReader(File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read));
Copy link
Member Author

@eerhardt eerhardt Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - this is a nicety when debugging tests and trying to use Beyond Compare (or similar) to diff the output and baseline files. As currently implemented, it is impossible to open these files in 2 processes at the same time. #Resolved

@@ -0,0 +1,506 @@
24.00 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98
21.60 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14
Copy link
Member

@sfilipi sfilipi Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time i asked about this dataset, I believe we couldn't use it.
I switched my test to the wine quality.
Might have to use the regression dataset that is in the repo already.

@shauheen @artidoro to comment further #Resolved

Copy link
Member Author

@eerhardt eerhardt Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From all I could find online, this dataset appears to be in the public domain.

This site claims the dataset is in the public domain:
http://b-course.hiit.fi/obc/housingexpl.html
“ B-Course provides a public domain data set called Boston Housing Data so that one can try out B-Course without own data. Below you can find some details of our example data set.”

The dataset is also distributed in an R package here:
https://www.rdocumentation.org/packages/mlbench/versions/2.1-1

It's used in books as well: https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch10/housing.data.txt

I've added the attribution to the README.

By the way, this is a different dataset from the "King County Housing" (kc_housing) dataset. #Closed

Copy link
Contributor

@justinormont justinormont Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CELA has approved the re-distribution of the Boston Housing dataset. #Closed

protected string ExternalTestBinariesDir { get { return Path.Combine(RootDir, TestExtDir); } }

protected string TestDirectory { get { return Path.Combine(RootDir, TestDir); } }

Copy link
Member

@sfilipi sfilipi Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yaeldekel to check whether this conflicts with her work. #Resolved

Copy link
Member Author

@eerhardt eerhardt Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking of that work, @sfilipi.

If there are things that are only used in the internal repo, we should move them to only be in the internal repo. It isn't sustainable to have a bunch of methods in dotnet/machinelearning that are not used.

@yaeldekel - if any of these are necessary for your work, can you add them to a partial class in the internal repo? #Resolved

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Anipik Anipik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,506 @@
24.00 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98
Copy link
Contributor

@justinormont justinormont Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to add a header to the file to help users understand it:
CrimesPerCapita PercentResidental PercentNonRetail CharlesRiver NitricOxides RoomsPerDwelling PercentPre40s EmploymentDistance HighwayDistance TaxRate TeacherRatio BlackIndex PercentLowIncome MedianHomeValue

The headed is also used when producing the feature importance as it will then print the column headers (textual form) instead of just a semi-opaque 'F13'. #Resolved

Copy link
Member Author

@eerhardt eerhardt Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #Closed

@@ -65,6 +65,12 @@ The dataset is under a CC-by 4.0 license.
}
```

### Boston Housing Data
Copy link
Contributor

@justinormont justinormont Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a more info link? I also added a link to our copy of the dataset so folks know what we are referencing.

### Boston Housing Data
 Redistributing the dataset "[housing.txt](housing.txt)" with attribution:
 > Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

More information: http://b-course.hiit.fi/obc/housingexpl.html
``` #Resolved

Copy link
Member Author

@eerhardt eerhardt Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #Closed

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Add more info to the README.
Add column headers to the housing.txt data file.
@justinormont
Copy link
Contributor

Closing/re-opening to restart CI tests

@justinormont justinormont reopened this Oct 9, 2018
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eerhardt eerhardt merged commit 024bd44 into dotnet:master Oct 10, 2018
@eerhardt eerhardt deleted the QuantileRegressionTest branch November 7, 2018 14:56
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants