-
Notifications
You must be signed in to change notification settings - Fork 1.9k
CollectionDataSource (train on top of memory collection instead of loading data from file) #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CollectionDataSource (train on top of memory collection instead of loading data from file) #106
Conversation
src/Microsoft.ML/MemoryCollection.cs
Outdated
public void SetInput(IHostEnvironment env, Experiment experiment) | ||
{ | ||
if (_listCollection!=null) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctrl-k-d, also, maybe since these are one-liners we can omit these vebose {
s. #Closed
src/Microsoft.ML/MemoryCollection.cs
Outdated
|
||
public MemoryCollection(IList<TInput> collection) | ||
{ | ||
//need validation at some point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//need validation at some point [](start = 12, length = 31)
How about we at least validate to ensure that these aren't null
, with a Contracts.CheckParamValue
? Here and in the other constructor. #Closed
@@ -14,7 +14,7 @@ namespace Microsoft.ML.Runtime.Api | |||
/// <summary> | |||
/// A helper class to create data views based on the user-provided types. | |||
/// </summary> | |||
internal static class DataViewConstructionUtils | |||
public static class DataViewConstructionUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static class DataViewConstructionUtils [](start = 4, length = 45)
This makes me nervous... If you make this public rather than internal, then it is no longer appropriate to have what are currently Assert
s on the public methods. They'll have to be Check
s, with suitably verbose error messages, properly called in the context of env
, and so on, and so on. This is more work than I think you want to have right now.
Do you need all of the functions in this class? This was originally a supporting utility class for the pre-ML.NET API. You don't need all of it... in fact I think you only need a couple of them. I might choose to either (1) introduce another public class with the methods you need being accessible through that or (2) change this class so that it remains public but everything in it has its access member changed to internal
, except the one or two facilities you actually need. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I miss these blobs of text so much. (I'm serious right now) #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone already did that for me. All I had to do is to look for method reference :)
In reply to: 187203478 [](ancestors = 187203478)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ namespace Microsoft.ML.Runtime.Api | |||
/// <summary> | |||
/// An internal class that holds the (already validated) mapping between a custom type and an IDataView schema. | |||
/// </summary> | |||
internal sealed class InternalSchemaDefinition | |||
public sealed class InternalSchemaDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public sealed class InternalSchemaDefinition [](start = 4, length = 44)
I have similar concerns about this class being public. If we can avoid it, possibly by moving some of the logic current in MemoryCollection.cs
to this assembly, then calling that from the Microsoft.ML
project, I'd be delighted. #Closed
Experiment experiment = environment.CreateExperiment(); | ||
ILearningPipelineDataStep output = collection.ApplyStep(null, experiment) as ILearningPipelineDataStep; | ||
|
||
Assert.NotNull(output.Data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.NotNull(output.Data); [](start = 16, length = 28)
If you really meant the as
above, you should first Assert.NonNull
on output
. If you didn't mean the as
above, then you should do a direct ()
style cast to ILearningPipelineDataStep
. The only reason to use an as
style cast is if you are entertaining the possibility that it might not implement that interface, in which case, given that this is a test, you should test that. (And if it wasn't a test, you would Contracts.Check*
it.) #Closed
src/Microsoft.ML/MemoryCollection.cs
Outdated
private Data.DataViewReference _dataViewEntryPoint; | ||
private IDataView _dataView; | ||
|
||
public MemoryCollection(IList<TInput> collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryCollection [](start = 15, length = 16)
public constructor required comments. #Resolved
@@ -14,7 +14,7 @@ namespace Microsoft.ML.Runtime.Api | |||
/// <summary> | |||
/// A helper class to create data views based on the user-provided types. | |||
/// </summary> | |||
internal static class DataViewConstructionUtils | |||
public static class DataViewConstructionUtils | |||
{ | |||
public static IDataView CreateFromList<TRow>(IHostEnvironment env, IList<TRow> data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IList [](start = 75, length = 5)
Since we in this file, @[email protected] do you have any opinion about IList?
I would personally convert it to List. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you would prefer List
here?
So the reason I would prefer a specific implementer of an interface rather than the interface itself is to avoid having to go through the virtual function table. (Which is why it's often better to write Foo<TBar>(TBar biz) where TBar : IBar
vs. Foo(IBar biz)
.) In this specific case though, I don't think it's worth it. You're ultimately going to be dealing with an IEnumerator<TRow>
anyway for most operations on this structure, and there it's unfortunately unavoidable, so I care a bit less. But perhaps there's something to this I'm not appreciating.
In reply to: 187205023 [](ancestors = 187205023)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its just personal. I just don't like IList. In same time, IList allow you to pass Arrays, so this solve problem with Array support.
In reply to: 187207615 [](ancestors = 187207615,187205023)
src/Microsoft.ML/MemoryCollection.cs
Outdated
|
||
public MemoryCollection(IEnumerable<TInput> collection) | ||
{ | ||
Contracts.CheckParamValue(collection != null, collection, nameof(collection), "Must be non-null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckParamValue(collection != null, collection [](start = 22, length = 46)
I'm sorry, I'm silly... I didn't mean CheckParamValue
, I meant CheckValue
. That'll simplify this a bit. #Closed
src/Microsoft.ML/MemoryCollection.cs
Outdated
public MemoryCollectionPipelineStep(Var<IDataView> data) | ||
{ | ||
Data = data; | ||
Model = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary, the default value is null
.
That's fine, but I'd go one step further... What I'd do is change the public Var<ITransformModel> Model { get; }
below to public Var<ITransformModel> Model => null;
, that way you avoid having any backing field whatsoever. (Plus you save three characters, which totally makes it worth it. :D ) #Closed
[Fact] | ||
public void CanSuccessfullyApplyATransform() | ||
{ | ||
var collection = new MemoryCollection<Input>(new List<Input>() { new Input { Number1 = 1, String1 = "1" } }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryCollection [](start = 33, length = 23)
Just an observation... if we had instead structured this as a static utility method somewhere, then we could avoid having the double-specification of the Input
class, as it would have been inferred by the compiler.
So: if we had a input type MyAwesomeInputType
, then instead of MemoryCollection<MyAwesomeInputType>(new MyAwesomeInputType[] {...})
, we would have MemoryCollection.Create(new MyAwesomeInputType[] {...})
since the compiler could have done the work of inferring the type and whatnot. #Closed
src/Microsoft.ML/MemoryCollection.cs
Outdated
public class MemoryCollection | ||
{ | ||
/// <summary> | ||
/// Creates memory collection loader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loader [](start = 38, length = 6)
loader from IList #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that not obvious from the parameter type? Perhaps adding a "from ." or somesuch would be fine (added to both), but even that I'd say would be a bit too needlessly verbose.
In reply to: 187221474 [](ancestors = 187221474)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's confusing that it says "Collection Loader" while it takes a List . perhaps we should rename it to a ListLoader. and split away from the StreamingEnumerableLoader
In reply to: 187230849 [](ancestors = 187230849,187221474)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User no longer see class, it just get interface, is it better?
In reply to: 187238467 [](ancestors = 187238467,187230849,187221474)
src/Microsoft.ML/MemoryCollection.cs
Outdated
} | ||
|
||
/// <summary> | ||
/// Creates memory collection loader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loader [](start = 38, length = 6)
loader from IEnumerable. #Closed
src/Microsoft.ML/MemoryCollection.cs
Outdated
using Microsoft.ML.Runtime.EntryPoints; | ||
using Microsoft.ML.Runtime.Internal.Utilities; | ||
|
||
namespace Microsoft.ML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ML [](start = 19, length = 3)
ML.Data #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextLoader is part of just ML, should I change it as well?
In reply to: 187221578 [](ancestors = 187221578)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question... seems they both should be in data. the argument went about like this - can a user has out of the box experience with just ML namespace.
I guess we can keep it in ML for now.
In reply to: 187221799 [](ancestors = 187221799,187221578)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move to Data. My PR will move TextLoader to ML.Data. #Resolved
[assembly: LoadableClass(typeof(void), typeof(InMemoryDataView), null, typeof(SignatureEntryPointModule), "InMemoryDataView")] | ||
namespace Microsoft.ML.Runtime.EntryPoints | ||
{ | ||
public class InMemoryDataView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InMemoryDataView [](start = 17, length = 16)
Where is this being used? I don't see any references to this class in code or tests. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a entrypoint :) Data.DataViewReference it get used in MemoryCollection.cs (At least it entry point wrapper)
In reply to: 187222065 [](ancestors = 187222065)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, silly me as looking for "InMemoryDataView" not "Data.DataViewReference"
In reply to: 187222365 [](ancestors = 187222365,187222065)
Sorry, indeed I meant to add a different Eric! In reply to: 388041663 [](ancestors = 388041663) |
|
||
namespace Microsoft.ML.Data | ||
{ | ||
public class CollectionLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class [](start = 4, length = 12)
Static class #Closed
@dotnet-bot test Windows_NT Release please |
{ | ||
public sealed class Input | ||
{ | ||
[Argument(ArgumentType.Required, ShortName = "data", HelpText = "Pointer to IDataView in memory", SortOrder = 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShortName = "data" [](start = 45, length = 18)
Since shortname is same as longname, you can safely omit. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new IrisData { SepalLength = 1f, SepalWidth = 1f ,PetalLength=0.3f, PetalWidth=5.1f, Label=1}, | ||
new IrisData { SepalLength = 1.2f, SepalWidth = 0.5f ,PetalLength=0.3f, PetalWidth=5.1f, Label=0} | ||
}; | ||
var collection = CollectionDataSource.Create(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we address the inconsistency in naming/usage between CollectionDataSource and TextLoader in a separate PR? We want it to be clearer these are the two (current) approaches to bringing data into a LearningPipeline.
@glebuk is this review still in blocking state? If yes, what should be process of API review? #Closed |
let me review it. Just give me 20 minutes. In reply to: 388913155 [](ancestors = 388913155) |
|
||
namespace Microsoft.ML.Data | ||
{ | ||
public static class CollectionDataSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CollectionDataSource [](start = 24, length = 20)
Please add a top level XML help - what is this class, where does it use, what's the purpose for its existance? #Closed
done. please look at 2 new comments In reply to: 388992391 [](ancestors = 388992391,388913155) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnet-bot test OSX10.13 Debug please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CollectionDataSource" is not recognized in my project. |
@ntaherkhani Changes in this PR would be part of 0.2 release. Currently on nuget.org you can find only 0.1 release. You can consume nuget with current master branch from myget: https://dotnet.myget.org/F/dotnet-core/api/v3/index.json |
…ading data from file) (dotnet#106) * in memory loader * add test file for memory collection * even in afterlife EntryPointCatalog will chase me down. * Address some comments. * update tests * address more comments. * remove empty param description * hide collectionloader * refactor classes a little. * pesky new lines! * slightly better comments. but only slighty * rename it * make class static * not a loader * remove alias in entrypoint * address comments
First iteration which allows user to create IDataview on top of IList or Enumerable and infrastructure to add it in pipeline.
address #10 issue