-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Introduce VBufferEditor and hide VBuffer.Count #1580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -615,7 +615,7 @@ public static void ReconcileKeyValues(IHostEnvironment env, IDataView[] views, s | |||
var keyNamesVBuffer = new VBuffer<ReadOnlyMemory<char>>(keyNames.Count, keyNames.Keys.ToArray()); | |||
ValueGetter<VBuffer<ReadOnlyMemory<char>>> keyValueGetter = | |||
(ref VBuffer<ReadOnlyMemory<char>> dst) => | |||
dst = new VBuffer<ReadOnlyMemory<char>>(keyNamesVBuffer.Length, keyNamesVBuffer.Count, keyNamesVBuffer.Values, keyNamesVBuffer.Indices); | |||
keyNamesVBuffer.CopyTo(ref dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyNamesVBuffer [](start = 24, length = 15)
Wow. This was a really bad bug. We shared the underlying output buffers among everyone that could have asked for this!!
The more I read of your PR the more enthusiastic I become. I had thought only of the effort it would save me and others going forward of having to catch all these potential misuses of VBuffer
, but now I become alarmed at the sheer volume of stuff I seem to have missed, that will now get caught. :P Or more precisely, made impossible. #Closed
Float mean = Mean(src.Values, src.Count, src.Length); | ||
Float divisor = StdDev(src.Values, src.Count, src.Length, mean); | ||
var srcValues = src.GetValues(); | ||
Float mean = Mean(srcValues, src.Length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float [](start = 32, length = 5)
Hi @eerhardt, I notice here and elsewhere you've been trying to maintain this Float
stuff. This was an artifact of something roughly four years ago where we tried to maintain separate "double" and "single" builds (this of course before we had IDataView
and could in principle accomodate either where appropriate), and we've been opportunistically removing them, since the desired final state is we have none of this at all.
By this I mean you should choose whichever is less effort for you. If you find it's causing you effort to keep doing this Float
stuff then please don't bother, it is intended to be removed in due time anyway. But if it's more effort to remove it feel free to keep using it. Just whichever you find more helpful to do, is all. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. I haven't been going out of my way to change Float
=> float
, but I have made a slightly conscious effort to keep existing code the same. My reasoning was that I didn't want unnecessary changes in my PRs.
In reply to: 232436492 [](ancestors = 232436492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, probably makes merges/rebases easier. #Resolved
for (int j = 0; j < inputOpsValues.Length; j++) | ||
inputOpsResult[j] = inputOpsValues[j].ToString(); | ||
|
||
yield return (name, opType.ToString(), type, inputOpsResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code appears to have a bug in it which has been carried over, that is, it appears to assume that the output vector was dense. I'm not sure you want to fix it or not, but if not, maybe file an issue? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the place that populates this metadata always uses a dense vector. See
dst = new VBuffer<ReadOnlyMemory<char>>(op.NumInputs, inputOps); |
I can add an assert here, if you'd like. If you want more of a "fix", I can log a bug. Up to you.
In reply to: 232436665 [](ancestors = 232436665)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you just call DenseValues()
in line 118 and fix the bug?..
In reply to: 232822331 [](ancestors = 232822331,232436665)
5102889
to
33cac0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eerhardt ! I'll take my extra ration of milk now... 😉
for (int i = 0; i < source.Length; i++) | ||
{ | ||
if (predicate(source[i])) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 16, length = 1)
remove curlies for 1-line clauses like this and below #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// with a larger physical value count than was needed | ||
/// because the final value count was not known at creation time. | ||
/// </param> | ||
/// <returns></returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 19)
remove empty #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// because the final value count was not known at creation time. | ||
/// </param> | ||
/// <returns></returns> | ||
public VBuffer<T> CreateBuffer(int? physicalValuesCount = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateBuffer [](start = 26, length = 12)
should we assert that this only called once? We don't want it to be called multiple times, right?
Also, if we don't, should we call it something like Commit
? I can see you already had the name 'Complete' before, and you left it inside a comment. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to assert it is only called once because VBufferMutationContext<T>
is a ref struct
(and actually I need to mark it as readonly
). Since it is a ref struct, it shouldn't change state because you can get in a lot of trouble passing it into methods as parameters, or making copies of it on the stack.
I can change the name of CreateBuffer
, if you'd like. Do you have a suggestion? Do you like Commit
more than CreateBuffer
? I didn't like Complete
, but I can be persuaded.
In reply to: 233652648 [](ancestors = 233652648)
/// because the final value count was not known at creation time. | ||
/// </param> | ||
/// <returns></returns> | ||
public VBuffer<T> CreateBuffer(int? physicalValuesCount = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physicalValuesCount [](start = 44, length = 19)
how do we use this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some cases that don't know the final "count" until they are done iterating through a list of mutations. So they allocate "the max necessary" upfront, but may change this value to a smaller count at the end. It isn't often we need this, but there are a handful of places that NEED it.
Here's an example:
dst = mutation.CreateBuffer(count); |
In reply to: 233667829 [](ancestors = 233667829)
Array.Clear(values, 0, src.Length); | ||
dst = new VBuffer<Float>(src.Length, values, dst.Indices); | ||
var mutation = VBufferMutationContext.Create(ref dst, src.Length); | ||
if (!mutation.CreatedNewValues) // We need to clear it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t [](start = 73, length = 1)
you lost the period. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -211,7 +210,7 @@ public void DropSlots<TDst>(ref VBuffer<TDst> src, ref VBuffer<TDst> dst) | |||
Contracts.Assert(index <= max); | |||
} | |||
|
|||
dst = new VBuffer<TDst>(newLength, iiDst, values, indices); | |||
dst = mutation.CreateBuffer(iiDst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiDst [](start = 40, length = 5)
this looks like the only usage of this parameter.
Can we maybe have a separate version of CreateBuffer / Commit
? Like, I don't know, CommitTruncated
or something? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the only usage of the parameter. I pointed out one above in the other comment. And there are more, like in EnsembleUtils.cs. #Resolved
/// An object capable of mutation a <see cref="VBuffer{T}"/> by filling out | ||
/// <see cref="Values"/> (and <see cref="Indices"/> if the buffer is not dense). | ||
/// </summary> | ||
public ref struct VBufferMutationContext<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VBufferMutationContext [](start = 22, length = 22)
I'm not a big fan of this name. Is the discussion closed on naming?
Given the usage pattern, maybe we can call it VBufferBuilder
or VBufferEditor
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the name either. I'm open to suggestions.
VBufferBuilder
is already taken by BufferBuilder<T>
. So I don't think we should go with that.
(BTW: some of these "design" discussions would have been great to have been had up front on #608, instead of in the PR 😝) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, well. I didn't actually understand what you're planning to do until I saw it implemented.
BufferBuilder
is internal anyway, right?
In reply to: 233672374 [](ancestors = 233672374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather rename BufferBuilder
to something ridiculous like 'value accumulator', and have the good name for the more common (and public) use case.
In reply to: 233675246 [](ancestors = 233675246,233672374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BufferBuilder is internal anyway, right?
It isn't right now.
public sealed class BufferBuilder<T> |
I don't really view this thing as a "builder" anyway. For one there is no "Build" method on it. VBufferEditor
would be more appropriate IMO. Do you like that?
- VBufferMutation
- VBufferMutationContext
- VBufferEditor
Any others to consider? #Resolved
{ | ||
if (values.Count == 0) | ||
var valueValues = values.GetValues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueValues [](start = 16, length = 11)
valueValues
goes against feng shui.
Maybe explicitValues
or something? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// Get histogram of values | ||
Array.Sort(valArray, 0, values.Count); | ||
Contracts.Assert(valueBuffer.Length >= valuesCount); | ||
valueValues.CopyTo(valueBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueValues.CopyTo(valueBuffer); [](start = 12, length = 32)
so, we haven't done this before. And FastTree.cs line 1463 actually assumes that bin finder is destructive. Maybe we should reconcile one way or another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following exactly what you want changed here.
I changed this for 2 reasons:
- You can no longer get an array from a
VBuffer
. So you can't callArray.Sort
on it. - Breaking the passed in
VBuffer
is pretty bad IMO. So I came up with a different way that doesn't destroy its usefulness. Upstream, we can get rid of the other copy, once thisREVIEW
comment is taken care of:
// REVIEW: Change this, as well as the bin finding code and bin upper bounds, to be Float instead of Double. |
Once those are floats, then you no longer have to copy upstream. You can just pass in the values, and this method won't muck with the VBuffer.
@@ -279,8 +279,9 @@ private OlsLinearRegressionPredictor TrainCore(IChannel ch, FloatLabelCursor.Fac | |||
ch.Check(FloatUtils.IsFinite(beta[i]), "Non-finite values detected in OLS solution"); | |||
|
|||
var weights = VBufferUtils.CreateDense<float>(beta.Length - 1); | |||
var weightsMutation = VBufferMutationContext.CreateFromBuffer(ref weights); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights [](start = 78, length = 7)
why create a VBuffer
and immediately mutate? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, why not just create an array, mutate that, and then create the buffer from the array? I can change that. #Resolved
var vf = values as float[]; | ||
var vb = values as byte[]; | ||
Contracts.Assert(vf != null || vb != null); | ||
Span<float> vf = typeof(TValue) == typeof(float) ? MemoryMarshal.Cast<TValue, float>(mutation.Values) : default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vf [](start = 36, length = 2)
I feel that a more common pattern is to have 2 getters: one for float case and one for byte case.
Maybe add a // REVIEW: split the getter into 2 specialized getters, one for float case and one for byte case.
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if (newCount == 0) | ||
{ | ||
dst = new VBuffer<TDst>(0, dst.Values, dst.Indices); | ||
dst = VBufferMutationContext.Create(ref dst, 0) | ||
.CreateBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create
followed immediately by CreateBuffer
seems to be a common pattern too. Maybe add VBufferMutationContext.Resize
for that? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method seems like it would belong in VBufferUtils
. I will add it there.
In reply to: 233679182 [](ancestors = 233679182)
@@ -4,6 +4,7 @@ | |||
<TargetFramework>netstandard2.0</TargetFramework> | |||
<IncludeInPackage>Microsoft.ML</IncludeInPackage> | |||
<DefineConstants>CORECLR</DefineConstants> | |||
<AllowUnsafeBlocks>true</AllowUnsafeBlocks> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllowUnsafeBlocks [](start = 5, length = 17)
just for one GEMV call... that's annoying. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually nevermind, it's going into HalLearners anyway
In reply to: 233679433 [](ancestors = 233679433)
writer.WriteLine("# Number of Ngrams terms = {0}", ngramsNames.Count); | ||
for (int j = 0; j < ngramsNames.Count; j++) | ||
writer.WriteLine("{0}\t{1}", j, ngramsNames.Values[j]); | ||
var ngramNameValues = ngramsNames.GetValues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngramNameValues [](start = 32, length = 15)
explicitNgramNames
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var values = dst.Values; | ||
if (terms.Count > 0) | ||
var mutation = VBufferMutationContext.Create(ref dst, terms.Count); | ||
for (int i = 0; i < terms.Count; i++) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 24, length = 1)
remove #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…he underlying arrays in the column buffer in Transposer. Also enable a Transposer test, since it passes.
0a9747b
to
1cb1411
Compare
This PR executes on proposed changes (1) and (5) in #608 (comment).
This introduces
VBufferEditor
, and uses it in almost all the places where VBuffers are mutated. There are a few stragglers that I will fix in a subsequent PR, but I thought this change was large enough, so I cut it off once I could make.Count
private.Working towards #608.