Skip to content

switch housing dataset to wine #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -328,3 +328,5 @@ ASALocalRun/

# MSBuild Binary and Structured Log
*.binlog
# Ignore external test datasets.
/test/data/external/
16 changes: 14 additions & 2 deletions build.proj
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
<Import Project="$([MSBuild]::GetDirectoryNameOfFileAbove($(MSBuildThisFileDirectory), Directory.Build.props))\Directory.Build.props" />

<Import Project="$(ToolsDir)VersionTools.targets" Condition="Exists('$(ToolsDir)VersionTools.targets')" />

<UsingTask TaskName="DownloadFilesFromUrl" AssemblyFile="$(ToolsDir)Microsoft.DotNet.Build.Tasks.dll"/>
<PropertyGroup>
<!-- To disable the restoration of packages, set RestoreDuringBuild=false or pass /p:RestoreDuringBuild=false.-->
<RestoreDuringBuild Condition="'$(RestoreDuringBuild)'==''">true</RestoreDuringBuild>
Expand All @@ -33,6 +33,7 @@
RestoreProjects;
BuildNative;
$(TraversalBuildDependsOn);
DownloadExternalTestFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this thing download a file on every build or only if the file is not present already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make it download file only once, but struggle for right now, see Eric comment and my reply.

RunTests;
</TraversalBuildDependsOn>
</PropertyGroup>
Expand Down Expand Up @@ -62,7 +63,18 @@
<MSBuild Projects="@(PkgProject)"
Targets="Pack" />
</Target>


<ItemGroup>
<TestFile Include="d" Url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Can you make the "Include" a better name?

<TestFile Include="winequality" Url="[https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)"/>

</ItemGroup>

<Target Name="DownloadExternalTestFiles" Condition="'$(RunTests)'=='true'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always do this, not just when RunTests is on. That way I can still run tests from VS Test Explorer even if I just "build.cmd" from command line.

However, we should also have "Inputs and Outputs" hooked up, so that way the target is skipped if the files already exist. check out https://msdn.microsoft.com/en-us/library/ms171483.aspx for how to make incremental builds work. Let me know if you need help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you can help me with that?

     <TestFile Include="winequality-white.csv" Url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv">
         <Result>$(MSBuildThisFileDirectory)test\data\external\winequality-white.csv</Result>
     </TestFile>
      <TestFile Include="winequality-red.csv" Url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv">
         <Result>$(MSBuildThisFileDirectory)test\data\external\winequality-red.csv</Result>
     </TestFile>
  </ItemGroup>
  
  <Target Name="DownloadExternalTestFiles" Inputs="@(TestFile)" Outputs="%(TestFile.Result)">
    <Message Importance="High" Text="Downloading external test files... %(TestFile.Result)" />
      <DownloadFilesFromUrl Items="@(TestFile)"
                           DestinationDir="test/data/external"
                           TreatErrorsAsWarnings="true"/>
  </Target>

It keeps download files and i'm not sure how to debug it properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt Can you help me?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way Inputs and Outputs work is based on file timestamps. You need an actual Input file, so MSBuild can compare its last write time to the output file's last write time.

Here when you specify:

Inputs="@(TestFile)"

It uses the TestFile's ItemSpec (what is put into the Include attribute). I'm assuming we don't have files named "winequality-white.csv", so MSBuild probably decides to always do the task, since it can't find the input file.

Another option is to check if the local file already exists, and if it does, then don't do the download.
Here's that approach we do in a different repo that is doing a similar thing (downloading files from the internet during the build, and caching them so they aren't downloaded for every build invocation):

https://github.com/dotnet/cli/blob/69062dda4edd7670cffa5c23876b38eccc130e09/build/Prepare.targets#L36-L56

The above will download any URL in the @(_DownloadAndExtractItem) Item. And then where those items are defined:

https://github.com/dotnet/cli/blob/69062dda4edd7670cffa5c23876b38eccc130e09/build/BundledRuntimes.props#L61-L66

The item is "Condition"d on whether the file exists or not:

    <_DownloadAndExtractItem Include="CombinedSharedHostAndFrameworkArchive"
                             Condition="!Exists('$(CombinedSharedHostAndFrameworkArchive)')">
      <Url>$(CoreSetupRootUrl)$(MicrosoftNETCoreAppPackageVersion)/$(CombinedFrameworkHostCompressedFileName)$(CoreSetupBlobAccessTokenParam)</Url>
      <DownloadFileName>$(CombinedSharedHostAndFrameworkArchive)</DownloadFileName>
      <ExtractDestination>$(SharedFrameworkPublishDirectory)</ExtractDestination>
    </_DownloadAndExtractItem>

So if the CombinedSharedHostAndFrameworkArchive file already exists, then this item isn't included. Thus, when all the files already exist locally, nothing is downloaded.

<Message Importance="High" Text="Downloading external test files..." />
<DownloadFilesFromUrl Items="@(TestFile)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable way to check the file hash? At some point the UCI site will be broken, and/or we will receive (perhaps) a partial file.

As we expand on the use of this, there'll be additional ways for it to fail.

DestinationDir="test/data/external"
TreatErrorsAsWarnings="true"/>
</Target>

<Target Name="RunTests" Condition="'$(RunTests)'=='true'">
<MSBuild Projects="test\run-tests.proj"
Targets="RunTests" />
Expand Down
8 changes: 4 additions & 4 deletions test/Microsoft.ML.Core.Tests/UnitTests/TestCSharpApi.cs
Original file line number Diff line number Diff line change
Expand Up @@ -269,10 +269,10 @@ public void TestCrossValidationBinaryMacro()
}
}

[Fact(Skip = "Missing data set. See https://github.com/dotnet/machinelearning/issues/3")]
[Fact()]
public void TestCrossValidationMacro()
{
var dataPath = GetDataPath(@"housing.txt");
var dataPath = GetDataPath(@"external/winequality-white.csv");
using (var env = new TlcEnvironment())
{
var subGraph = env.CreateExperiment();
Expand All @@ -295,7 +295,7 @@ public void TestCrossValidationMacro()
var modelCombineOutput = subGraph.Add(modelCombine);

var experiment = env.CreateExperiment();
var importInput = new ML.Data.TextLoader();
var importInput = new ML.Data.TextLoader() { CustomSchema = "col=Label:R4:11 col=Features:R4:0-10 sep=; header+" };
var importOutput = experiment.Add(importInput);

var crossValidate = new ML.Models.CrossValidator
Expand Down Expand Up @@ -324,7 +324,7 @@ public void TestCrossValidationMacro()
Assert.True(b);
double val = 0;
getter(ref val);
Assert.Equal(3.32, val, 1);
Assert.Equal(0.58, val, 1);
b = cursor.MoveNext();
Assert.False(b);
}
Expand Down
23 changes: 13 additions & 10 deletions test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs
Original file line number Diff line number Diff line change
Expand Up @@ -729,7 +729,7 @@ public void EntryPointTextToKeyToText()
}

private void RunTrainScoreEvaluate(string learner, string evaluator, string dataPath, string warningsPath, string overallMetricsPath,
string instanceMetricsPath, string confusionMatrixPath = null)
string instanceMetricsPath, string confusionMatrixPath = null, string loader = null)
{
string inputGraph = string.Format(@"
{{
Expand All @@ -738,6 +738,7 @@ private void RunTrainScoreEvaluate(string learner, string evaluator, string data
'Name': 'Data.TextLoader',
'Inputs': {{
'InputFile': '$file'
{8}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this already existed before you change, but what do you think about using string interpolation here? I think it would make this so much more readable:

'Inputs': {{
                        'InputFile': '$file'
                        {string.IsNullOrWhiteSpace(loader) ? "" : string.Format(",'CustomSchema': '{0}'", loader)}

}},
'Outputs': {{
'Data': '$AllData'
Expand Down Expand Up @@ -797,7 +798,8 @@ private void RunTrainScoreEvaluate(string learner, string evaluator, string data
}}
}}", learner, evaluator, EscapePath(dataPath), EscapePath(warningsPath), EscapePath(overallMetricsPath), EscapePath(instanceMetricsPath),
confusionMatrixPath != null ? ", 'ConfusionMatrix': '$ConfusionMatrix'" : "",
confusionMatrixPath != null ? string.Format(", 'ConfusionMatrix' : '{0}'", EscapePath(confusionMatrixPath)) : "");
confusionMatrixPath != null ? string.Format(", 'ConfusionMatrix' : '{0}'", EscapePath(confusionMatrixPath)) : "",
string.IsNullOrWhiteSpace(loader) ? "" : string.Format(",'CustomSchema': '{0}'", loader));

var jsonPath = DeleteOutputPath("graph.json");
File.WriteAllLines(jsonPath, new[] { inputGraph });
Expand Down Expand Up @@ -855,15 +857,16 @@ public void EntryPointEvaluateMultiClass()
Assert.Equal(3, CountRows(loader));
}

[Fact(Skip = "Missing data set. See https://github.com/dotnet/machinelearning/issues/3")]
[Fact()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) no need for the parens. You can remove ().

public void EntryPointEvaluateRegression()
{
var dataPath = GetDataPath("housing.txt");
var dataPath = GetDataPath(@"external/winequality-white.csv");
var warningsPath = DeleteOutputPath("warnings.idv");
var overallMetricsPath = DeleteOutputPath("overall.idv");
var instanceMetricsPath = DeleteOutputPath("instance.idv");

RunTrainScoreEvaluate("Trainers.StochasticDualCoordinateAscentRegressor", "Models.RegressionEvaluator", dataPath, warningsPath, overallMetricsPath, instanceMetricsPath);
RunTrainScoreEvaluate("Trainers.StochasticDualCoordinateAscentRegressor", "Models.RegressionEvaluator",
dataPath, warningsPath, overallMetricsPath, instanceMetricsPath, loader: "col=Label:R4:11 col=Features:R4:0-10 sep=; header+");

using (var loader = new BinaryLoader(Env, new BinaryLoader.Arguments(), warningsPath))
Assert.Equal(0, CountRows(loader));
Expand All @@ -872,7 +875,7 @@ public void EntryPointEvaluateRegression()
Assert.Equal(1, CountRows(loader));

using (var loader = new BinaryLoader(Env, new BinaryLoader.Arguments(), instanceMetricsPath))
Assert.Equal(104, CountRows(loader));
Assert.Equal(975, CountRows(loader));
}

[Fact]
Expand All @@ -887,10 +890,10 @@ public void EntryPointSDCAMultiClass()
TestEntryPointRoutine("iris.txt", "Trainers.StochasticDualCoordinateAscentClassifier");
}

[Fact(Skip = "Missing data set. See https://github.com/dotnet/machinelearning/issues/3")]
[Fact()]
public void EntryPointSDCARegression()
{
TestEntryPointRoutine("housing.txt", "Trainers.StochasticDualCoordinateAscentRegressor");
TestEntryPointRoutine(@"external/winequality-white.csv", "Trainers.StochasticDualCoordinateAscentRegressor", loader: "col=Label:R4:11 col=Features:R4:0-10 sep=; header+");
}

[Fact]
Expand Down Expand Up @@ -961,10 +964,10 @@ public void EntryPointHogwildSGD()
TestEntryPointRoutine("breast-cancer.txt", "Trainers.StochasticGradientDescentBinaryClassifier");
}

[Fact(Skip = "Missing data set. See https://github.com/dotnet/machinelearning/issues/3")]
[Fact()]
public void EntryPointPoissonRegression()
{
TestEntryPointRoutine("housing.txt", "Trainers.PoissonRegressor");
TestEntryPointRoutine(@"external/winequality-white.csv", "Trainers.PoissonRegressor", loader: "col=Label:R4:11 col=Features:R4:0-10 sep=; header+");
}

[Fact]
Expand Down