-
Notifications
You must be signed in to change notification settings - Fork 578
RFC: Kernel C API extension for Variable ops. #387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Kernel C API extension for Variable ops. #387
Conversation
The RFC will be open for comment until Tuesday, May 18, 2021 | Status | Proposed | :-------------- |:---------------------------------------------------- | | **RFC #** | (RFC link) | | **Author(s)** | Kulin Seth (Apple), Charles Brissart (Apple) | | **Sponsor** | Saurabh Saksena ([email protected]) | | **Updated** | 2021-05-04 | The proposal extends the current [Kernel C API](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md) to enable plugin writers to add support for : * Optimizer operations like SGD, Adam etc. * Variable updates like Assign, AssignUpdate
If you are interested in attending the design review meeting for this RFC, here are the call details: Time: Tuesday, May 18th, 2021 from 10:00-11:00am PT Meeting participants are expected to read the RFC ahead of time as the meeting will focus on the remaining issues/questions. |
TF_VariableInputLockHolder** lockHolder, | ||
TF_Status* status); | ||
|
||
// This interface returns out tensor which is updated corresponding to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to see the RFC! I have some question here: For TF_GetInputTensorFromVariable
, I think it needs a parameter to pass the datatype? GetInputTensorFromVariable
is a tempelate function with datatype T, it will call EnsureSparseVariableAccess
and EnsureSparseVariableAccess
will use datatype T to compare wether it is Variant
datatype (TF_MaybeLockVariableInputMutexesInOrder
may also need to pass the data type for the same reason)
template <typename Device, typename T>
Status GetInputTensorFromVariable(OpKernelContext* ctx, int input,
bool lock_held, bool sparse, Tensor* out) {
if (ctx->input_dtype(input) == DT_RESOURCE) {
core::RefCountPtr<Var> var;
TF_RETURN_IF_ERROR(LookupResource(ctx, HandleFromInput(ctx, input), &var));
if (sparse) {
TF_RETURN_IF_ERROR(EnsureSparseVariableAccess<Device, T>(ctx, var.get()));
*out = *var->tensor();
return Status::OK();
}
TF_RETURN_IF_ERROR(PrepareToUpdateVariable<Device, T>(
ctx, var->tensor(), var->copy_on_read_mode.load()));
*out = *var->tensor();
return Status::OK();
}
*out = ctx->mutable_input(input, lock_held);
return Status::OK();
}
Status EnsureSparseVariableAccess(OpKernelContext* ctx, Var* var) {
if (var->copy_on_read_mode.load()) {
return Status::OK();
}
mutex_lock ml(*var->mu());
// Once copy-on-read mode is True the refcount is guaranteed to be 1. This can
// also happen if there are no concurrent reads of the variable and
// copy-on-read mode is false.
if (var->tensor()->RefCountIsOne()) {
var->copy_on_read_mode.store(true);
return Status::OK();
}
PersistentTensor unused;
Tensor* tmp;
if (std::is_same<T, Variant>::value) {
AllocatorAttributes attr;
attr.set_on_host(true);
TF_RETURN_IF_ERROR(ctx->allocate_persistent(
var->tensor()->dtype(), var->tensor()->shape(), &unused, &tmp, attr));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jzhoulon Thanks for the review!. That's a good catch, we were initially focusing on the non-sparse operations. We can add a bool isVariantType
to check for Variant type operations, like we did for TF_AssignUpdateVariable
. We need that to be passed to PrepareToUpdateVariable
.
// the input and value tensors. It also accepts the copy functor provided by | ||
// pluggable vendor to do the copying of the tensors. | ||
TF_CAPI_EXPORT extern void TF_AssignVariable(TF_OpKernelContext* ctx, | ||
int input_index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means one C-API is an Op implementation? if it is, do this RFC also also provide other resource related Op's C API? such as ReadVariableOp,TensorArrayReadOp,ScatterNdUpdateOp...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jzhoulon , we looked into other ResourceVariableOps.
ReadVariableOp
is already part of DEVICE_DEFAULT implementationTensorArrayReadOp
this we haven't looked in detail. From the current look the only Device dependent information is using LockedRead where we can memset Tensor to 0.- AssignVariable and AssignVariableUpdate are handled as implementations in the interface.
- ResourceGather and ResourceScatter follow the pattern of :
void Compute(OpKernelContext* c) override {
core::RefCountPtr<Var> v;
OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v));
OP_REQUIRES_OK(c, EnsureSparseVariableAccess<Device, T>(c, v.get()));
if (is_non_pod_dtype || use_exclusive_lock_) {
mutex_lock ml(*v->mu());
<< Compute Functor performing the operation >>
} else {
tf_shared_lock ml(*v->mu());
<< Compute Functor performing the operation >>
}
We implemented these operations where the :
RefCountPtr<Var>
, processed by LookupResource and EnsureSparseVariableAccess (with locking) is implemented using the TF_GetInputResourceVariable
and TF_MaybeLockVariableInputMutexesInOrder
. Having said that if there are helper functions you think are needed we can definitely expose here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kulinseth . We have went through the resource relevant ops and found that the below ops will use some device specific function, such as concat and split and so on. Not only just the memset. And We roughly put it in below table. Will this RFC consider these ops like the variable and add some new high level APIs ?
Resource | Ops |
---|---|
TensorArray | TensorArrayWrite (V2, V3) |
TensorArrayRead (V2, V3) | |
TensorArrayPack | |
TensorArrayGather (V2, V3) | |
TensorArrayConcat (V2, V3) | |
TensorArrayUnpack | |
TensorArrayScatter (V2, V3) | |
TensorArraySplit (V2, V3) | |
UnBatchResource | UnbatchKernel |
UnbatchGradKernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yanzhang-dev , thanks for the list and feedback. We haven't looked at TensorArray ops in detail, but it seems to me we would need more than device specific functionality. We might need to expose the TF_TensorArray
along with necessary methods like TF_GetTensorArray
. This might need a separate RFC. Similarly for UnbatchResource:
// A class encapsulating the state and logic for unbatching tensors.
//
// UnbatchResource keeps two data structures indexed by batch-key: one which has
// the continuations for all concurrent kernels which are waiting for tensors
// and another which has tensors which are waiting for their corresponding
// kernels to run. Whenever a kernel runs, we either grab its tensor if it's
// waiting already, or we insert it in the queue and then look at its tensor to
// see if it can be used to dispatch any stored continuations.
class UnbatchResource : public ResourceBase {
public:
explicit UnbatchResource(int32 timeout_micros)
: timeout_micros_(timeout_micros),
timeout_enforcer_(new serving::PeriodicFunction(
[this] { EnforceTimeout(); }, 1000 /* 1 ms */)) {}
~UnbatchResource() override {
// Tear down 'timeout_enforcer_' first, since it accesses other state in
// this class.
timeout_enforcer_ = nullptr;
}
I agree we can probably expose the TF_ResourceBase
, TF_ResourceManager
and others and potentially recreate all this logic on Plugin side.
On a separate note I was curious about these operations, which networks did you encounter these for performance critical sections.?
TF_Tensor *dest), | ||
void (*updateFunc)(TF_OpKernelContext *ctx, | ||
TF_Tensor *tensor, | ||
TF_Tensor *value, int Op), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Int OP
here is no needed since updateFunc should already contains Op semantic since TF_AssignUpdateVariable is called in plugin side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updateFunctor
would need to distinguish between Add
and Sub
operations. For instance:
enum DenseUpdateType { ADD, SUB, ASSIGN };
void updateFunc(TF_OpKernelContext *ctx, TF_Tensor *tf_var_tensor, TF_Tensor *tf_value, int Op) {
....
if (Op == ADD) {
} else if (Op == SUB) {
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the updateFunc can have semantic like add_func, sub_func, and when we implement the kernel, we pass the corresponding func to TF_AssignUpdateVariable. Any way, I have no strong opinion on this.
template <DenseUpdateTypeOp Op>
AssinUpdateVariableCompute() {
if(Op == ADD)
TF_AssignUpdateVariable(add_func)
elif (Op == Sub)
TF_AssignUpdateVariable(sub_func)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that's doable. If you prefer, we can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to removing the op
arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! Looks good to me overall.
int value_index, | ||
int Op, | ||
int isVariantType, | ||
void (*copyFunc)(TF_OpKernelContext * ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why we need both copyFunc
and updateFunc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right we can use the ASSIGN
in updateFunc to implement copyFunc. I will clean it out during the PR.
int input_index, | ||
int value_index, | ||
int Op, | ||
int isVariantType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It may be worth figuring out if we can avoid the isVariantType
arg. This can wait for the actual PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saxenasaurabh thanks for the review!!. It was to pass the VariantType boolean for PrepareToUpdateVariable
:
template <typename Device, typename T>
Status PrepareToUpdateVariable(OpKernelContext* ctx, Tensor* tensor,
bool copy_on_read_mode) {
if (copy_on_read_mode || !tensor->RefCountIsOne()) {
// Tensor's buffer is in use by some read, so we need to copy before
// updating.
PersistentTensor unused;
Tensor* tmp;
if (std::is_same<T, Variant>::value) {
AllocatorAttributes attr;
attr.set_on_host(true);
TF_RETURN_IF_ERROR(ctx->allocate_persistent(
tensor->dtype(), tensor->shape(), &unused, &tmp, attr));
const auto elements_in = tensor->flat<Variant>();
auto elements_out = tmp->flat<Variant>();
for (int64 i = 0; i < elements_in.size(); ++i) {
elements_out(i) = elements_in(i);
}
} else {
AllocatorAttributes attr;
attr.set_gpu_compatible(true);
attr.set_nic_compatible(true);
TF_RETURN_IF_ERROR(ctx->allocate_persistent(
tensor->dtype(), tensor->shape(), &unused, &tmp, attr));
functor::DenseUpdate<Device, T, ASSIGN> copy_functor;
copy_functor(ctx->eigen_device<Device>(), tmp->flat<T>(),
const_cast<const Tensor*>(tensor)->flat<T>());
}
*tensor = *tmp;
}
return Status::OK();
}
If we can derive this Type information from the input
or value
tensor then we can drop this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could achieve this using tensor.dtype() == DT_VARIANT
.
structures in Tensorflow core like TF_Mutex, TF_ResourceHandle, TF_RefCountPtr | ||
to the plugin vendors. This will allow plugin writers the flexibility to | ||
construct higher level optimizer operations using these lower level primitives | ||
and will be scalable for newer operations. Option #2, is to expose all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any near-term plans to expose any other ops? If so option #1 maybe worth paying more heed to. You already need to expose TF_GetInputTensorFromVariable
for the training ops which does most of what you need functionality-wise. Although, I agree the current API is much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing on comments from @Jianhui-Li and @jzhoulon: They prefer option #1 for scalability. (Please feel free to elaborate here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saxenasaurabh , for resource variable ops:
- some of them fall in the DEVICE_DEFAULT category: like
ReadVariableOp
,VariableShapeOp
,DestroyResourceOp
, - ResourceGather , ResourceScatter as discussed above we have implemented using the
GetInputTensorFromVariable
andLockMutexesInOrder
- @saxenasaurabh , that's a good point. AssignUpdateVariable and AssignVariable are currently implemented as an Op, we can use
TF_GetInputTensorFromVariable
andTF_GetTrainingVariableMutex
(this is a helper function we use forTF_MaybeLockVariableInputMutexesInOrder
) to implement this functionality. Depending on whether the Op needs exclusive lock we can use thelock
orshared_lock
to implement it. This is part of the helper functionTF_MaybeLockVariableInputMutexesInOrder
.
This covers majority of the resource variable operations we have encountered in the Training networks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jianhui-Li and @jzhoulon thanks for your feedback. I agree Option 1 is more scalable. We started at the higher level interface for few reasons:
- Most of the common resource variable ops and training ops exhibited a pattern to implement them which we were mostly able to capture using these APIs.
- We wanted to keep a smaller / tight interface to keep the code maintainable and allows us to organically grow in the future if we see the need.
- With the upcoming new stack of MLIR/TFRT (As mentioned in the StreamExecutor RFC), we wanted to keep the design of interface simple and high-level to represent the needed ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kulinseth, @jzhoulon and I discussed and agreed that Option 2 is simpler but we need to cover the other cases we discovered (TensorArray and UnBatchResource), which use resource manager and are dispatched to device. Covering these two cases with extra APIs following Option 2 approach is more manageable than we originally thought, so we don't really against Option 2. Could you please also verify whether the Option 2 approach can cover these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jianhui-Li thanks for getting back!. That's a good point, initially we weren't looking into TensorArray
and UnbatchResource
, as we didn't encounter them in performance critical sections of the networks we were looking at. I haven't dug into the details, from cursory glance it seems to me we might need to expose TF_TensorArray
and some methods handling TF_TensorArray
type.
Similarly for UnbatchResource
as you mentioned we would need to expose ResourceManager
details. Even then I am not sure it will cover all the necessary functionality due to some of the dependency on AsyncOpKernel
and AdaptiveBatchScheduler
. I feel these are bigger API changes which might need an RFC of their own.
@saxenasaurabh , @Jianhui-Li , @penpornk , @jzhoulon Not sure, what do you guys think?
|
||
### Engineering Impact | ||
* The impact to binary size / startup time / build time / test times are minimum | ||
* The TensorFlow team will maintain this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @rjpower @manivaradarajan FYI
This was accepted at the design review meeting today. |
Design review meeting notes@kulinseth: We started by exposing each helper function for optimizer / Resource variables through C API. Based on @saxenasaurabh's comments, @saxenasaurabh: How do you scope the set of ops? Do you have a fallback mechanism? @kulinseth: @jzhoulon, what models do you need @kulinseth: Are ResourceManager and TensorArray needed?
@kulinseth: In the PR, we will break up input tensor from Variable.
@kulinseth: We recently added a helper function @Jianhui-Li:
@kulinseth: We can send out a PR today so everybody can take a closer look. |
Thanks @penpornk for detailed notes. I will update here with PR. |
Here is @kulinseth's implementation PR: tensorflow/tensorflow#49717 |
The RFC will be open for comment until Tuesday, May 18, 2021
The proposal extends the current Kernel C API to enable plugin writers using Modular API to add support for :