Skip to content

Introduce VBufferEditor and hide VBuffer.Count #1580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 16, 2018

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Nov 8, 2018

This PR executes on proposed changes (1) and (5) in #608 (comment).

  1. Keep Length, but hide/private Count
  1. The most drastic proposed change is how to actually mutate a VBuffer.

This introduces VBufferEditor, and uses it in almost all the places where VBuffers are mutated. There are a few stragglers that I will fix in a subsequent PR, but I thought this change was large enough, so I cut it off once I could make .Count private.

Working towards #608.

@@ -615,7 +615,7 @@ public static void ReconcileKeyValues(IHostEnvironment env, IDataView[] views, s
var keyNamesVBuffer = new VBuffer<ReadOnlyMemory<char>>(keyNames.Count, keyNames.Keys.ToArray());
ValueGetter<VBuffer<ReadOnlyMemory<char>>> keyValueGetter =
(ref VBuffer<ReadOnlyMemory<char>> dst) =>
dst = new VBuffer<ReadOnlyMemory<char>>(keyNamesVBuffer.Length, keyNamesVBuffer.Count, keyNamesVBuffer.Values, keyNamesVBuffer.Indices);
keyNamesVBuffer.CopyTo(ref dst);
Copy link
Contributor

@TomFinley TomFinley Nov 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyNamesVBuffer [](start = 24, length = 15)

Wow. This was a really bad bug. We shared the underlying output buffers among everyone that could have asked for this!!

The more I read of your PR the more enthusiastic I become. I had thought only of the effort it would save me and others going forward of having to catch all these potential misuses of VBuffer, but now I become alarmed at the sheer volume of stuff I seem to have missed, that will now get caught. :P Or more precisely, made impossible. #Closed

Float mean = Mean(src.Values, src.Count, src.Length);
Float divisor = StdDev(src.Values, src.Count, src.Length, mean);
var srcValues = src.GetValues();
Float mean = Mean(srcValues, src.Length);
Copy link
Contributor

@TomFinley TomFinley Nov 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float [](start = 32, length = 5)

Hi @eerhardt, I notice here and elsewhere you've been trying to maintain this Float stuff. This was an artifact of something roughly four years ago where we tried to maintain separate "double" and "single" builds (this of course before we had IDataView and could in principle accomodate either where appropriate), and we've been opportunistically removing them, since the desired final state is we have none of this at all.

By this I mean you should choose whichever is less effort for you. If you find it's causing you effort to keep doing this Float stuff then please don't bother, it is intended to be removed in due time anyway. But if it's more effort to remove it feel free to keep using it. Just whichever you find more helpful to do, is all. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I haven't been going out of my way to change Float => float, but I have made a slightly conscious effort to keep existing code the same. My reasoning was that I didn't want unnecessary changes in my PRs.


In reply to: 232436492 [](ancestors = 232436492)

Copy link
Contributor

@TomFinley TomFinley Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, probably makes merges/rebases easier. #Resolved

for (int j = 0; j < inputOpsValues.Length; j++)
inputOpsResult[j] = inputOpsValues[j].ToString();

yield return (name, opType.ToString(), type, inputOpsResult);
Copy link
Contributor

@TomFinley TomFinley Nov 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code appears to have a bug in it which has been carried over, that is, it appears to assume that the output vector was dense. I'm not sure you want to fix it or not, but if not, maybe file an issue? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the place that populates this metadata always uses a dense vector. See

dst = new VBuffer<ReadOnlyMemory<char>>(op.NumInputs, inputOps);
.

I can add an assert here, if you'd like. If you want more of a "fix", I can log a bug. Up to you.


In reply to: 232436665 [](ancestors = 232436665)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you just call DenseValues() in line 118 and fix the bug?..


In reply to: 232822331 [](ancestors = 232822331,232436665)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eerhardt ! I'll take my extra ration of milk now... 😉

for (int i = 0; i < source.Length; i++)
{
if (predicate(source[i]))
{
Copy link
Contributor

@Zruty0 Zruty0 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 16, length = 1)

remove curlies for 1-line clauses like this and below #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


In reply to: 233650638 [](ancestors = 233650638)

/// with a larger physical value count than was needed
/// because the final value count was not known at creation time.
/// </param>
/// <returns></returns>
Copy link
Contributor

@Zruty0 Zruty0 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 12, length = 19)

remove empty #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing.


In reply to: 233651783 [](ancestors = 233651783)

/// because the final value count was not known at creation time.
/// </param>
/// <returns></returns>
public VBuffer<T> CreateBuffer(int? physicalValuesCount = null)
Copy link
Contributor

@Zruty0 Zruty0 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateBuffer [](start = 26, length = 12)

should we assert that this only called once? We don't want it to be called multiple times, right?

Also, if we don't, should we call it something like Commit? I can see you already had the name 'Complete' before, and you left it inside a comment. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to assert it is only called once because VBufferMutationContext<T> is a ref struct (and actually I need to mark it as readonly). Since it is a ref struct, it shouldn't change state because you can get in a lot of trouble passing it into methods as parameters, or making copies of it on the stack.

I can change the name of CreateBuffer, if you'd like. Do you have a suggestion? Do you like Commit more than CreateBuffer? I didn't like Complete, but I can be persuaded.


In reply to: 233652648 [](ancestors = 233652648)

/// because the final value count was not known at creation time.
/// </param>
/// <returns></returns>
public VBuffer<T> CreateBuffer(int? physicalValuesCount = null)
Copy link
Contributor

@Zruty0 Zruty0 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physicalValuesCount [](start = 44, length = 19)

how do we use this? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some cases that don't know the final "count" until they are done iterating through a list of mutations. So they allocate "the max necessary" upfront, but may change this value to a smaller count at the end. It isn't often we need this, but there are a handful of places that NEED it.

Here's an example:


In reply to: 233667829 [](ancestors = 233667829)

Array.Clear(values, 0, src.Length);
dst = new VBuffer<Float>(src.Length, values, dst.Indices);
var mutation = VBufferMutationContext.Create(ref dst, src.Length);
if (!mutation.CreatedNewValues) // We need to clear it
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t [](start = 73, length = 1)

you lost the period. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOPS. Will fix.


In reply to: 233669776 [](ancestors = 233669776)

@@ -211,7 +210,7 @@ public void DropSlots<TDst>(ref VBuffer<TDst> src, ref VBuffer<TDst> dst)
Contracts.Assert(index <= max);
}

dst = new VBuffer<TDst>(newLength, iiDst, values, indices);
dst = mutation.CreateBuffer(iiDst);
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiDst [](start = 40, length = 5)

this looks like the only usage of this parameter.
Can we maybe have a separate version of CreateBuffer / Commit? Like, I don't know, CommitTruncated or something? #Resolved

Copy link
Member Author

@eerhardt eerhardt Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the only usage of the parameter. I pointed out one above in the other comment. And there are more, like in EnsembleUtils.cs. #Resolved

/// An object capable of mutation a <see cref="VBuffer{T}"/> by filling out
/// <see cref="Values"/> (and <see cref="Indices"/> if the buffer is not dense).
/// </summary>
public ref struct VBufferMutationContext<T>
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VBufferMutationContext [](start = 22, length = 22)

I'm not a big fan of this name. Is the discussion closed on naming?
Given the usage pattern, maybe we can call it VBufferBuilder or VBufferEditor? #Resolved

Copy link
Member Author

@eerhardt eerhardt Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the name either. I'm open to suggestions.

VBufferBuilder is already taken by BufferBuilder<T>. So I don't think we should go with that.

(BTW: some of these "design" discussions would have been great to have been had up front on #608, instead of in the PR 😝) #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, well. I didn't actually understand what you're planning to do until I saw it implemented.

BufferBuilder is internal anyway, right?


In reply to: 233672374 [](ancestors = 233672374)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather rename BufferBuilder to something ridiculous like 'value accumulator', and have the good name for the more common (and public) use case.


In reply to: 233675246 [](ancestors = 233675246,233672374)

Copy link
Member Author

@eerhardt eerhardt Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferBuilder is internal anyway, right?

It isn't right now.

public sealed class BufferBuilder<T>

I don't really view this thing as a "builder" anyway. For one there is no "Build" method on it. VBufferEditor would be more appropriate IMO. Do you like that?

  • VBufferMutation
  • VBufferMutationContext
  • VBufferEditor

Any others to consider? #Resolved

{
if (values.Count == 0)
var valueValues = values.GetValues();
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueValues [](start = 16, length = 11)

valueValues goes against feng shui.
Maybe explicitValues or something? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to explicitValues and explicitValuesCount


In reply to: 233672186 [](ancestors = 233672186)


// Get histogram of values
Array.Sort(valArray, 0, values.Count);
Contracts.Assert(valueBuffer.Length >= valuesCount);
valueValues.CopyTo(valueBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueValues.CopyTo(valueBuffer); [](start = 12, length = 32)

so, we haven't done this before. And FastTree.cs line 1463 actually assumes that bin finder is destructive. Maybe we should reconcile one way or another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following exactly what you want changed here.

I changed this for 2 reasons:

  1. You can no longer get an array from a VBuffer. So you can't call Array.Sort on it.
  2. Breaking the passed in VBuffer is pretty bad IMO. So I came up with a different way that doesn't destroy its usefulness. Upstream, we can get rid of the other copy, once this REVIEW comment is taken care of:

// REVIEW: Change this, as well as the bin finding code and bin upper bounds, to be Float instead of Double.

Once those are floats, then you no longer have to copy upstream. You can just pass in the values, and this method won't muck with the VBuffer.

@@ -279,8 +279,9 @@ private OlsLinearRegressionPredictor TrainCore(IChannel ch, FloatLabelCursor.Fac
ch.Check(FloatUtils.IsFinite(beta[i]), "Non-finite values detected in OLS solution");

var weights = VBufferUtils.CreateDense<float>(beta.Length - 1);
var weightsMutation = VBufferMutationContext.CreateFromBuffer(ref weights);
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights [](start = 78, length = 7)

why create a VBuffer and immediately mutate? #Resolved

Copy link
Member Author

@eerhardt eerhardt Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, why not just create an array, mutate that, and then create the buffer from the array? I can change that. #Resolved

var vf = values as float[];
var vb = values as byte[];
Contracts.Assert(vf != null || vb != null);
Span<float> vf = typeof(TValue) == typeof(float) ? MemoryMarshal.Cast<TValue, float>(mutation.Values) : default;
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vf [](start = 36, length = 2)

I feel that a more common pattern is to have 2 getters: one for float case and one for byte case.
Maybe add a // REVIEW: split the getter into 2 specialized getters, one for float case and one for byte case. ? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In reply to: 233673911 [](ancestors = 233673911)


if (newCount == 0)
{
dst = new VBuffer<TDst>(0, dst.Values, dst.Indices);
dst = VBufferMutationContext.Create(ref dst, 0)
.CreateBuffer();
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create followed immediately by CreateBuffer seems to be a common pattern too. Maybe add VBufferMutationContext.Resize for that? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method seems like it would belong in VBufferUtils. I will add it there.


In reply to: 233679182 [](ancestors = 233679182)

@@ -4,6 +4,7 @@
<TargetFramework>netstandard2.0</TargetFramework>
<IncludeInPackage>Microsoft.ML</IncludeInPackage>
<DefineConstants>CORECLR</DefineConstants>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllowUnsafeBlocks [](start = 5, length = 17)

just for one GEMV call... that's annoying. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually nevermind, it's going into HalLearners anyway


In reply to: 233679433 [](ancestors = 233679433)

writer.WriteLine("# Number of Ngrams terms = {0}", ngramsNames.Count);
for (int j = 0; j < ngramsNames.Count; j++)
writer.WriteLine("{0}\t{1}", j, ngramsNames.Values[j]);
var ngramNameValues = ngramsNames.GetValues();
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngramNameValues [](start = 32, length = 15)

explicitNgramNames? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


In reply to: 233680569 [](ancestors = 233680569)

var values = dst.Values;
if (terms.Count > 0)
var mutation = VBufferMutationContext.Create(ref dst, terms.Count);
for (int i = 0; i < terms.Count; i++)
{
Copy link
Contributor

@Zruty0 Zruty0 Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 24, length = 1)

remove #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@eerhardt eerhardt merged commit cb9effc into dotnet:master Nov 16, 2018
@eerhardt eerhardt changed the title Introduce VBufferMutationContext and hide VBuffer.Count Introduce VBufferEditor and hide VBuffer.Count Nov 16, 2018
@eerhardt eerhardt deleted the MasterVBufferPlan branch November 16, 2018 02:43
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants