-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enable a QuantileRegression Test & Fix Duplicated Baseline Files #1193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Enable CommandTrainScoreEvaluateQuantileRegression, add the dataset and the necessary baselines. The only baseline changes were of the form that were caused by https://github.com/dotnet/corefx/issues/31847.
A lot of baseline tests have duplicated baselines between debug and release. Allow BaseTestBaseline to check the Common baseline directory for a baseline file first. Fix dotnet#410
/cc @justinormont - in case you want to review the Boston Housing dataset attribution. #Closed |
@@ -838,7 +748,7 @@ protected static StreamReader OpenReader(string path) | |||
{ | |||
Contracts.CheckNonWhiteSpace(path, nameof(path)); | |||
#if CORECLR | |||
return new StreamReader(File.Open(path, FileMode.Open, FileAccess.Read)); | |||
return new StreamReader(File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - this is a nicety when debugging tests and trying to use Beyond Compare (or similar) to diff the output and baseline files. As currently implemented, it is impossible to open these files in 2 processes at the same time. #Resolved
@@ -0,0 +1,506 @@ | |||
24.00 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 | |||
21.60 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From all I could find online, this dataset appears to be in the public domain.
This site claims the dataset is in the public domain:
http://b-course.hiit.fi/obc/housingexpl.html
“ B-Course provides a public domain data set called Boston Housing Data so that one can try out B-Course without own data. Below you can find some details of our example data set.”
The dataset is also distributed in an R package here:
https://www.rdocumentation.org/packages/mlbench/versions/2.1-1
It's used in books as well: https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch10/housing.data.txt
I've added the attribution to the README.
By the way, this is a different dataset from the "King County Housing" (kc_housing) dataset. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CELA has approved the re-distribution of the Boston Housing dataset. #Closed
protected string ExternalTestBinariesDir { get { return Path.Combine(RootDir, TestExtDir); } } | ||
|
||
protected string TestDirectory { get { return Path.Combine(RootDir, TestDir); } } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yaeldekel to check whether this conflicts with her work. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking of that work, @sfilipi.
If there are things that are only used in the internal repo, we should move them to only be in the internal repo. It isn't sustainable to have a bunch of methods in dotnet/machinelearning that are not used.
@yaeldekel - if any of these are necessary for your work, can you add them to a partial
class in the internal repo? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -0,0 +1,506 @@ | |||
24.00 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to add a header to the file to help users understand it:
CrimesPerCapita PercentResidental PercentNonRetail CharlesRiver NitricOxides RoomsPerDwelling PercentPre40s EmploymentDistance HighwayDistance TaxRate TeacherRatio BlackIndex PercentLowIncome MedianHomeValue
The headed is also used when producing the feature importance as it will then print the column headers (textual form) instead of just a semi-opaque 'F13'. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #Closed
@@ -65,6 +65,12 @@ The dataset is under a CC-by 4.0 license. | |||
} | |||
``` | |||
|
|||
### Boston Housing Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a more info link? I also added a link to our copy of the dataset so folks know what we are referencing.
### Boston Housing Data
Redistributing the dataset "[housing.txt](housing.txt)" with attribution:
> Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
More information: http://b-course.hiit.fi/obc/housingexpl.html
``` #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more info to the README. Add column headers to the housing.txt data file.
Closing/re-opening to restart CI tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable CommandTrainScoreEvaluateQuantileRegression, add the dataset and the necessary baselines.
The only baseline changes were of the form that were caused by dotnet/corefx#31847.
Allow BaseTestBaseline to check Common first.
A lot of baseline tests have duplicated baselines between debug and release. Allow BaseTestBaseline to check the Common baseline directory for a baseline file first.
Also, I cleaned a bunch of dead code from BaseTestBaseline.
Fix #410