-
Notifications
You must be signed in to change notification settings - Fork 1.9k
TrainTestSplit function #1005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrainTestSplit function #1005
Conversation
/// train to the test set.</remarks> | ||
/// <returns>A pair of datasets, for the train and test set.</returns> | ||
public static (DataView<T> trainSet, DataView<T> testSet) TrainTestSplit<T>(this TrainContextBase context, | ||
DataView<T> data, double testFraction = 0.1, Func<T, PipelineColumn> stratificationColumn = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Func<T, PipelineColumn> [](start = 57, length = 23)
Are there any type restrictions on this? It must be hashable, I suppose? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, it must be hash-able, but we should bring back the hash-join functionality for all the column types. I will write an issue about this
In reply to: 220069503 [](ancestors = 220069503)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a REVIEW comment about it in the TrainContext, line 140
In reply to: 220366016 [](ancestors = 220366016,220069503)
@@ -236,7 +236,7 @@ public static (Vector<float> score, Key<uint, TVal> predictedLabel) | |||
{ | |||
Contracts.CheckValue(label, nameof(label)); | |||
Contracts.CheckValue(features, nameof(features)); | |||
Contracts.CheckValue(loss, nameof(loss)); | |||
Contracts.CheckValueOrNull(loss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, whoops, thanks for fixing this. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good. I think we ought to think carefully about what we feel about having the training context responsible for doing things like splitting data, but I feel this might be best done by merely observing the idioms you've proposed in action. Thanks @Zruty0 .
r.features, | ||
maxIterations: 2))); | ||
|
||
var results = ctx.CrossValidate(reader.Read(dataSource), est, r => r.label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how the example/test flows. #ByDesign
for (int fold = 0; fold < numFolds; fold++) | ||
result.Add(foldFunction(fold)); | ||
|
||
return result.ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.ToArray() [](start = 19, length = 16)
curious, why not return a list? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if (stratificationColumn == null) | ||
{ | ||
stratificationColumn = data.Schema.GetTempColumnName("StratificationColumn"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StratificationColumn [](start = 70, length = 20)
is there a DefaultColumnName for this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked, there is none. I don't think there should be either, it's a pretty peculiar concept, unique to CV and split scenarios
In reply to: 220372046 [](ancestors = 220372046)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of #754
Added TrainTestSplit function to training contexts.