-
Notifications
You must be signed in to change notification settings - Fork 1.9k
What to do with VBuffer? #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As for the
|
To recap an online discussion: we should separate the two concepts of VBuffer:
For the first one, the plan is to get rid of polymorphous behavior, and formally separate into For the second one, the current approach is fine (i.e. |
Isn't this something where
The closest thing to that in the BCL is |
You're Another scenario we have for this is wrapping memory that's owned by someone else. Consider the TensorFlow case. TF will allocate tensors (and unfortunately doesn't seem to provide an hook for callback allocation or pre-allocation of output tensors) and the ownership of these tensors is transferred to the caller. For Python they are able to wrap the output tensor in an ndarray since ndarray permits wrapping memory: https://docs.scipy.org/doc/numpy-1.13.0/user/c-info.how-to-extend.html#c.PyArray_SimpleNewFromData. Unfortunately since the tensor/vector type in ML.NET is To address this scenario consider changing the type that ML.NET exposes for tensor/vector data to something that can wrap a native pointer (like Memory/Span) or modify VBuffer itself so that it can represent data backed with a native pointer. I started experimenting with this idea a few months back in with Tensor. I do think we can try to merge the ideas of VBuffer with Tensor. We may find that both have good ideas and we end up with a hybrid. |
I quickly searched through current ML.NET APIs and it does not seem like there are any APIs that return VBuffer (except some inconsequential factory methods). And so it seems like we could use |
- Allow it to take Spans instead of arrays. - Remove redundant overloads - When multiple spans are accepted, always use an explicit count parameter instead of one being chosen as having the right length. Working towards dotnet#608
- Allow it to take Spans instead of arrays. - Remove redundant overloads - When multiple spans are accepted, always use an explicit count parameter instead of one being chosen as having the right length. Working towards dotnet#608
* Refactor CpuMathUtils - Allow it to take Spans instead of arrays. - Remove redundant overloads - When multiple spans are accepted, always use an explicit count parameter instead of one being chosen as having the right length. Working towards #608 * Use MemoryMarshal.GetReference to avoid perf hit when pinning Span. * PR feedback
IssuesHere are the problems with the current VBuffer design:
See the Golden Rules Section of the Care and Feeding doc as well. The below proposal will either fix or drastically reduce 4 out of the 5 rules listed there. (The exception is the "two distinct VBuffers sharing references to their internal arrays" rule.) ProposalHere's the proposed design changes I've come up with. Please provide any feedback to the design/names/etc.
VBuffer mutationThere are generally 2 types of VBuffer mutation:
To solve both of these scenarios, we will introduce a new concept A
(*) This is stack-only so (a) it can cache the Span properties and (b) no one can hold onto an instance of the editor outside of the stack. The full mutation must happen immediately. The following is a listing of the proposed public API: public static class VBufferEditor
{
public static VBufferEditor<T> Create<T>(
ref VBuffer<T> destination,
int newLogicalLength,
int? valuesCount = null,
bool keepOldOnResize = false);
}
public ref struct VBufferEditor<T>
{
public readonly Span<T> Values;
public readonly Span<int> Indices;
public bool CreatedNewValues;
public bool CreatedNewIndices;
public VBuffer<T> Commit();
} Here are examples of the typical usages from above:
public static void ScaleBy(ref VBuffer<Float> dst, Float c)
{
var count = dst.GetValues().Length;
if (c == 1 || count == 0)
return;
var editor = VBufferEditor.Create(ref dst, dst.Length, count);
if (c != 0)
CpuMathUtils.Scale(c, editor.Values);
else // Maintain density of dst.
editor.Values.Clear();
}
public static void ScaleBy(in VBuffer<Float> src, ref VBuffer<Float> dst, Float c)
{
...
var srcValues = src.GetValues();
if (src.IsDense)
{
// Maintain the density of src to dst in order to avoid slow down of L-BFGS.
var editor = VBufferEditor.Create(ref dst, src.Length);
if (c == 0)
editor.Values.Clear();
else
CpuMathUtils.Scale(c, srcValues, editor.Values, src.Length);
dst = editor.CreateBuffer();
}
else
{
var editor = VBufferEditor.Create(ref dst, src.Length, srcValues.Length);
src.GetIndices().CopyTo(editor.Indices);
if (c == 0)
editor.Values.Clear();
else
CpuMathUtils.Scale(c, srcValues, editor.Values, srcValues.Length);
dst = editor.Commit();
}
} Notice that to create a VBufferEditor you need a Alternative Approaches ConsideredAs part of my prototyping and investigation I considered the following approaches, but they were rejected:
|
Hello @eerhardt , thank you very much for writing this up. I am strongly in favor of the changes you have outlined for Let us therefore talk most about out bool createdNewValues,
out bool createdNewIndices,
bool keepOldOnResize = false I am somewhat curious about these parameters, because I'm wondering if we can somehow do away with them. It seems like keeping old on resize in the case of The
Here we have a strong difference in implementation between the case where new data is created vs. old data, since we are modifying things in place. I wonder however, if this is the only case where the difference in implementation is so stark, since relatively few operations are in place in this fashion. If this is really the only place it is important, I might consider just making densify in the same assembly and considering it a privileged operation. Let us now talk about this guy. int? maxValuesCapacity = null, I think what this is meant to accomplish, is to act like the
However in this case at least, I think we can simplify the story a lot. The purpose of that argument is to distinguish between the case where we anticipate that future allocations over the same operation might grow the buffer even further, and cases where we anticipate that future allocations over the same operation will not. So for example: when tokenizing text, we would anticipate that we might have larger sentences/text in the future, so allocating a bit more extra space would not be wasted. Contrariwise, if we have a fixed size vector type, the getter for that in the dense case will -- the getter will never request a larger size than this, so, it would be pointless to ever request more, so this provides a hint. This does I suppose represent a loss of capability, since there are places in the code especially machinelearning/src/Microsoft.ML.Core/Utilities/VBufferUtils.cs Lines 388 to 389 in d262171
But, honestly I'm not sure this is really worth the extra complexity. If that actually winds up making any difference I'll eat my hat. :P So I might say the only thing that is helpful to distinguish, is this case of "I won't need more than this" vs. "I might need more than this." Which could be accomplished with a
I think this will address the common usage patterns. This would also allow us to get rid of this parameter, I hope? Other QuestionsI had some other questions, each rather simpler than the above. Also how is a When calling Is it considered an error condition for |
One thing that occurred to me -- might these pieces of auxiliary information like the |
Thanks for the comments, @TomFinley. They are appreciated. I've updated the propsoal above as appropriate, and am responding to your questions/comments below:
This option is important when expanding the buffer to make room for new elements. Here are two cases that do this: machinelearning/src/Microsoft.ML.Core/Utilities/VBufferUtils.cs Lines 478 to 481 in 5e08fa1
machinelearning/src/Microsoft.ML.Data/Depricated/Vector/VBufferMathUtils.cs Lines 278 to 291 in bee7f17
That way the caller doesn't have to do the copying themselves.
Correct. machinelearning/src/Microsoft.ML.Data/Depricated/Vector/VBufferMathUtils.cs Lines 371 to 372 in bee7f17
So I thought it would be important to preserve this option, somehow. I just looked, and I haven't gotten to a place yet where I have used
Yes, I can totally make that change. It should clean the API up.
Yes, this gets passed into the There appears to be a mix of usages here in the existing code. Some specify the
and some let it default to machinelearning/src/Microsoft.ML.Data/Data/RowCursorUtils.cs Lines 448 to 449 in 8e77d0f
Part of me thinks keeping After some thinking, my gut is telling me to drop the parameter, and let the array double (your proposal above), for now. If we run into problems, we can add an overload/default parameter in the future. Thoughts on that plan?
I wasn't planning on changing the existing constructors to
I think it makes sense to change this to return the buffer instead. After that, we can "poison" the mutation context, and throw an exception if someone calls
I'm still not convinced we should do that. I think it is surprising to a caller. And there are some cases where we'd like to still access the VBuffer after creating a mutation context, ex:
It also isn't fool-proof, because someone could copy the VBuffer on the stack, and even if we blanked out the Would the above plan to have |
Use the new methods in place of Count and Values/Indices in as many readonly situations as possible. Working towards dotnet#608
Use the new methods in place of Count and Values/Indices in as many readonly situations as possible. Working towards dotnet#608
* Introduce VBuffer GetValues and GetIndices Use the new methods in place of Count and Values/Indices in as many readonly situations as possible. Working towards #608 Update the rest of the BinaryWriter utils to use Span.
Remove last remaining usage in VectorUtils by forking CoreCLRs ArraySort algorithm to work with Spans. Fix dotnet#608
Remove last remaining usage in VectorUtils by forking CoreCLRs ArraySort algorithm to work with Spans. Fix dotnet#608
Remove last remaining usage in VectorUtils by forking CoreCLRs ArraySort algorithm to work with Spans. Fix dotnet#608
* Remove all VBuffer.Values usages in non-core assemblies. * Remove VBuffer.Values usages in ML.Api. * Remove all VBuffer.Values usages in ML.Transforms. * Remove VBuffer.Values usages in ML.StandardLearners. * Remove the rest of the relatively easy VBuffer.Values and Indices usages. * TypedCursor handles VBuffer<T> to T[] correctly * Fix null ref in my SDCA changes * Add back clearing out tempGrad in DifferentiableFunctionAggregator. * Remove VBuffer Values and Indices Remove last remaining usage in VectorUtils by forking CoreCLRs ArraySort algorithm to work with Spans. Fix #608
VBuffer
is a very critical, fundamental type in ML.NET. And as such, it has some critical performance considerations and characteristics it needs to adhere to.However, it is not as intuitive of a type as it could be. We even wrote a Care and Feeding doc that describes some common pitfalls and other things to be aware of when using the type.
At its core,
VBuffer
has 2 responsibilities:We should do some investigation into what types of improvements we can make to this type.
ArrayPool
to take care of the "buffer"-ness? (see https://github.com/dotnet/corefx/issues/4547)VBuffer
into two typesSparseVector
andDenseVector
?/cc @TomFinley @Zruty0
The text was updated successfully, but these errors were encountered: