Skip to content

CollectionDataSource (train on top of memory collection instead of loading data from file) #106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 15, 2018

Conversation

Ivanidzo4ka
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka commented May 9, 2018

First iteration which allows user to create IDataview on top of IList or Enumerable and infrastructure to add it in pipeline.
address #10 issue

@Ivanidzo4ka Ivanidzo4ka requested review from TomFinley and sfilipi May 9, 2018 23:02
@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 9, 2018

@GalOshri @glebuk I would appreciate your contribution regarding class names. #Resolved

@Ivanidzo4ka Ivanidzo4ka self-assigned this May 9, 2018
@Ivanidzo4ka Ivanidzo4ka added the enhancement New feature or request label May 9, 2018
public void SetInput(IHostEnvironment env, Experiment experiment)
{
if (_listCollection!=null)
{
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctrl-k-d, also, maybe since these are one-liners we can omit these vebose {s. #Closed


public MemoryCollection(IList<TInput> collection)
{
//need validation at some point
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//need validation at some point [](start = 12, length = 31)

How about we at least validate to ensure that these aren't null, with a Contracts.CheckParamValue? Here and in the other constructor. #Closed

@shauheen shauheen added this to the 0518 milestone May 9, 2018
@@ -14,7 +14,7 @@ namespace Microsoft.ML.Runtime.Api
/// <summary>
/// A helper class to create data views based on the user-provided types.
/// </summary>
internal static class DataViewConstructionUtils
public static class DataViewConstructionUtils
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static class DataViewConstructionUtils [](start = 4, length = 45)

This makes me nervous... If you make this public rather than internal, then it is no longer appropriate to have what are currently Asserts on the public methods. They'll have to be Checks, with suitably verbose error messages, properly called in the context of env, and so on, and so on. This is more work than I think you want to have right now.

Do you need all of the functions in this class? This was originally a supporting utility class for the pre-ML.NET API. You don't need all of it... in fact I think you only need a couple of them. I might choose to either (1) introduce another public class with the methods you need being accessible through that or (2) change this class so that it remains public but everything in it has its access member changed to internal, except the one or two facilities you actually need. #Closed

Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I miss these blobs of text so much. (I'm serious right now) #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone already did that for me. All I had to do is to look for method reference :)


In reply to: 187203478 [](ancestors = 187203478)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ivan, it's nice to be appreciated. :D :D


In reply to: 187203886 [](ancestors = 187203886)

@TomFinley TomFinley requested a review from eerhardt May 9, 2018 23:24
@@ -16,7 +16,7 @@ namespace Microsoft.ML.Runtime.Api
/// <summary>
/// An internal class that holds the (already validated) mapping between a custom type and an IDataView schema.
/// </summary>
internal sealed class InternalSchemaDefinition
public sealed class InternalSchemaDefinition
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public sealed class InternalSchemaDefinition [](start = 4, length = 44)

I have similar concerns about this class being public. If we can avoid it, possibly by moving some of the logic current in MemoryCollection.cs to this assembly, then calling that from the Microsoft.ML project, I'd be delighted. #Closed

@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 9, 2018

using Microsoft.ML.Data;

need header #Resolved


Refers to: src/Microsoft.ML/Runtime/EntryPoints/DataViewReference.cs:1 in b166f05. [](commit_id = b166f05, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 9, 2018

using System;

need header #Resolved


Refers to: src/Microsoft.ML/MemoryCollection.cs:1 in b166f05. [](commit_id = b166f05, deletion_comment = False)

Experiment experiment = environment.CreateExperiment();
ILearningPipelineDataStep output = collection.ApplyStep(null, experiment) as ILearningPipelineDataStep;

Assert.NotNull(output.Data);
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.NotNull(output.Data); [](start = 16, length = 28)

If you really meant the as above, you should first Assert.NonNull on output. If you didn't mean the as above, then you should do a direct () style cast to ILearningPipelineDataStep. The only reason to use an as style cast is if you are entertaining the possibility that it might not implement that interface, in which case, given that this is a test, you should test that. (And if it wasn't a test, you would Contracts.Check* it.) #Closed

private Data.DataViewReference _dataViewEntryPoint;
private IDataView _dataView;

public MemoryCollection(IList<TInput> collection)
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryCollection [](start = 15, length = 16)

public constructor required comments. #Resolved

@@ -14,7 +14,7 @@ namespace Microsoft.ML.Runtime.Api
/// <summary>
/// A helper class to create data views based on the user-provided types.
/// </summary>
internal static class DataViewConstructionUtils
public static class DataViewConstructionUtils
{
public static IDataView CreateFromList<TRow>(IHostEnvironment env, IList<TRow> data,
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IList [](start = 75, length = 5)

Since we in this file, @[email protected] do you have any opinion about IList?
I would personally convert it to List. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you would prefer List here?

So the reason I would prefer a specific implementer of an interface rather than the interface itself is to avoid having to go through the virtual function table. (Which is why it's often better to write Foo<TBar>(TBar biz) where TBar : IBar vs. Foo(IBar biz).) In this specific case though, I don't think it's worth it. You're ultimately going to be dealing with an IEnumerator<TRow> anyway for most operations on this structure, and there it's unfortunately unavoidable, so I care a bit less. But perhaps there's something to this I'm not appreciating.


In reply to: 187205023 [](ancestors = 187205023)

Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its just personal. I just don't like IList. In same time, IList allow you to pass Arrays, so this solve problem with Array support.


In reply to: 187207615 [](ancestors = 187207615,187205023)


public MemoryCollection(IEnumerable<TInput> collection)
{
Contracts.CheckParamValue(collection != null, collection, nameof(collection), "Must be non-null");
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckParamValue(collection != null, collection [](start = 22, length = 46)

I'm sorry, I'm silly... I didn't mean CheckParamValue, I meant CheckValue. That'll simplify this a bit. #Closed

public MemoryCollectionPipelineStep(Var<IDataView> data)
{
Data = data;
Model = null;
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary, the default value is null.

That's fine, but I'd go one step further... What I'd do is change the public Var<ITransformModel> Model { get; } below to public Var<ITransformModel> Model => null;, that way you avoid having any backing field whatsoever. (Plus you save three characters, which totally makes it worth it. :D ) #Closed

[Fact]
public void CanSuccessfullyApplyATransform()
{
var collection = new MemoryCollection<Input>(new List<Input>() { new Input { Number1 = 1, String1 = "1" } });
Copy link
Contributor

@TomFinley TomFinley May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryCollection [](start = 33, length = 23)

Just an observation... if we had instead structured this as a static utility method somewhere, then we could avoid having the double-specification of the Input class, as it would have been inferred by the compiler.

So: if we had a input type MyAwesomeInputType, then instead of MemoryCollection<MyAwesomeInputType>(new MyAwesomeInputType[] {...}), we would have MemoryCollection.Create(new MyAwesomeInputType[] {...}) since the compiler could have done the work of inferring the type and whatnot. #Closed

public class MemoryCollection
{
/// <summary>
/// Creates memory collection loader.
Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loader [](start = 38, length = 6)

loader from IList #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that not obvious from the parameter type? Perhaps adding a "from ." or somesuch would be fine (added to both), but even that I'd say would be a bit too needlessly verbose.


In reply to: 187221474 [](ancestors = 187221474)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing that it says "Collection Loader" while it takes a List . perhaps we should rename it to a ListLoader. and split away from the StreamingEnumerableLoader


In reply to: 187230849 [](ancestors = 187230849,187221474)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User no longer see class, it just get interface, is it better?


In reply to: 187238467 [](ancestors = 187238467,187230849,187221474)

}

/// <summary>
/// Creates memory collection loader.
Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loader [](start = 38, length = 6)

loader from IEnumerable. #Closed

using Microsoft.ML.Runtime.EntryPoints;
using Microsoft.ML.Runtime.Internal.Utilities;

namespace Microsoft.ML
Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ML [](start = 19, length = 3)

ML.Data #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextLoader is part of just ML, should I change it as well?


In reply to: 187221578 [](ancestors = 187221578)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question... seems they both should be in data. the argument went about like this - can a user has out of the box experience with just ML namespace.
I guess we can keep it in ML for now.


In reply to: 187221799 [](ancestors = 187221799,187221578)

Copy link
Member

@codemzs codemzs May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to Data. My PR will move TextLoader to ML.Data. #Resolved

[assembly: LoadableClass(typeof(void), typeof(InMemoryDataView), null, typeof(SignatureEntryPointModule), "InMemoryDataView")]
namespace Microsoft.ML.Runtime.EntryPoints
{
public class InMemoryDataView
Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryDataView [](start = 17, length = 16)

Where is this being used? I don't see any references to this class in code or tests. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a entrypoint :) Data.DataViewReference it get used in MemoryCollection.cs (At least it entry point wrapper)


In reply to: 187222065 [](ancestors = 187222065)

Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, silly me as looking for "InMemoryDataView" not "Data.DataViewReference"


In reply to: 187222365 [](ancestors = 187222365,187222065)

@glebuk
Copy link
Contributor

glebuk commented May 11, 2018

Sorry, indeed I meant to add a different Eric!


In reply to: 388041663 [](ancestors = 388041663)


namespace Microsoft.ML.Data
{
public class CollectionLoader
Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class [](start = 4, length = 12)

Static class #Closed

@Ivanidzo4ka Ivanidzo4ka changed the title In memory loader (train on top of memory collection instead of loading data from file) CollectionDataSource (train on top of memory collection instead of loading data from file) May 11, 2018
@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 11, 2018

@dotnet-bot test Windows_NT Release please

{
public sealed class Input
{
[Argument(ArgumentType.Required, ShortName = "data", HelpText = "Pointer to IDataView in memory", SortOrder = 1)]
Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShortName = "data" [](start = 45, length = 18)

Since shortname is same as longname, you can safely omit. #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

new IrisData { SepalLength = 1f, SepalWidth = 1f ,PetalLength=0.3f, PetalWidth=5.1f, Label=1},
new IrisData { SepalLength = 1.2f, SepalWidth = 0.5f ,PetalLength=0.3f, PetalWidth=5.1f, Label=0}
};
var collection = CollectionDataSource.Create(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we address the inconsistency in naming/usage between CollectionDataSource and TextLoader in a separate PR? We want it to be clearer these are the two (current) approaches to bringing data into a LearningPipeline.

@shauheen shauheen removed this from the 0518 milestone May 14, 2018
@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 14, 2018

@glebuk is this review still in blocking state? If yes, what should be process of API review? #Closed

@glebuk
Copy link
Contributor

glebuk commented May 14, 2018

let me review it. Just give me 20 minutes.


In reply to: 388913155 [](ancestors = 388913155)


namespace Microsoft.ML.Data
{
public static class CollectionDataSource
Copy link
Contributor

@glebuk glebuk May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionDataSource [](start = 24, length = 20)

Please add a top level XML help - what is this class, where does it use, what's the purpose for its existance? #Closed

@glebuk
Copy link
Contributor

glebuk commented May 14, 2018

done. please look at 2 new comments


In reply to: 388992391 [](ancestors = 388992391,388913155)

Copy link
Contributor

@glebuk glebuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka
Copy link
Contributor Author

@dotnet-bot test OSX10.13 Debug please

Copy link
Contributor

@glebuk glebuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka merged commit 6d5a41d into dotnet:master May 15, 2018
@Ivanidzo4ka Ivanidzo4ka deleted the ivanidze/in_memory_loader branch May 15, 2018 00:26
@ntaherkhani
Copy link

"CollectionDataSource" is not recognized in my project.
"The name 'CollectionDataSource' does not exist in the current context" .
may somebody help me please.
Also, May please put some samples for other types of data loading except using "TextLoader". I need some samples to add data to pipeline from "DataTable".
Thanks.

@Ivanidzo4ka
Copy link
Contributor Author

Ivanidzo4ka commented May 23, 2018

@ntaherkhani Changes in this PR would be part of 0.2 release. Currently on nuget.org you can find only 0.1 release. You can consume nuget with current master branch from myget: https://dotnet.myget.org/F/dotnet-core/api/v3/index.json
Check "Include prerelease" in VS and search for Microsoft.ML package.

eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
…ading data from file) (dotnet#106)

* in memory loader

* add test file for memory collection

* even in afterlife EntryPointCatalog will chase me down.

* Address some comments.

* update tests

* address more comments.

* remove empty param description

* hide collectionloader

* refactor classes a little.

* pesky new lines!

* slightly better comments. but only slighty

* rename it

* make class static

* not a loader

* remove alias in entrypoint

* address comments
@ghost ghost locked as resolved and limited conversation to collaborators Mar 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants