Skip to content

TrainTestSplit function #1005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Sep 26, 2018
Merged

TrainTestSplit function #1005

merged 14 commits into from
Sep 26, 2018

Conversation

Zruty0
Copy link
Contributor

@Zruty0 Zruty0 commented Sep 24, 2018

Part of #754
Added TrainTestSplit function to training contexts.

@Zruty0 Zruty0 added the API Issues pertaining the friendly API label Sep 24, 2018
@Zruty0 Zruty0 self-assigned this Sep 24, 2018
/// train to the test set.</remarks>
/// <returns>A pair of datasets, for the train and test set.</returns>
public static (DataView<T> trainSet, DataView<T> testSet) TrainTestSplit<T>(this TrainContextBase context,
DataView<T> data, double testFraction = 0.1, Func<T, PipelineColumn> stratificationColumn = null)
Copy link
Contributor

@TomFinley TomFinley Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Func<T, PipelineColumn> [](start = 57, length = 23)

Are there any type restrictions on this? It must be hashable, I suppose? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it must be hash-able, but we should bring back the hash-join functionality for all the column types. I will write an issue about this


In reply to: 220069503 [](ancestors = 220069503)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a REVIEW comment about it in the TrainContext, line 140


In reply to: 220366016 [](ancestors = 220366016,220069503)

@@ -236,7 +236,7 @@ public static (Vector<float> score, Key<uint, TVal> predictedLabel)
{
Contracts.CheckValue(label, nameof(label));
Contracts.CheckValue(features, nameof(features));
Contracts.CheckValue(loss, nameof(loss));
Contracts.CheckValueOrNull(loss);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, whoops, thanks for fixing this. 👍

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. I think we ought to think carefully about what we feel about having the training context responsible for doing things like splitting data, but I feel this might be best done by merely observing the idioms you've proposed in action. Thanks @Zruty0 .

r.features,
maxIterations: 2)));

var results = ctx.CrossValidate(reader.Read(dataSource), est, r => r.label)
Copy link
Member

@sfilipi sfilipi Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the example/test flows. #ByDesign

for (int fold = 0; fold < numFolds; fold++)
result.Add(foldFunction(fold));

return result.ToArray();
Copy link
Member

@sfilipi sfilipi Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.ToArray() [](start = 19, length = 16)

curious, why not return a list? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to avoid returning mutable collections.


In reply to: 220371873 [](ancestors = 220371873)


if (stratificationColumn == null)
{
stratificationColumn = data.Schema.GetTempColumnName("StratificationColumn");
Copy link
Member

@sfilipi sfilipi Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StratificationColumn [](start = 70, length = 20)

is there a DefaultColumnName for this? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked, there is none. I don't think there should be either, it's a pretty peculiar concept, unique to CV and split scenarios


In reply to: 220372046 [](ancestors = 220372046)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0 Zruty0 closed this Sep 26, 2018
@Zruty0 Zruty0 reopened this Sep 26, 2018
@shauheen shauheen merged commit dd4320d into dotnet:master Sep 26, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants