RFC: Kernel C API extension for Variable ops. #387

kulinseth · 2021-05-04T19:24:15Z

The RFC will be open for comment until Tuesday, May 18, 2021

Status	Proposed
RFC #	387
Author(s)	Kulin Seth (Apple), Charles Brissart (Apple)
Sponsor	Saurabh Saksena ([email protected])
Updated	2021-05-04

The proposal extends the current Kernel C API to enable plugin writers using Modular API to add support for :

Optimizer operations like SGD, Adam etc.
Variable updates like Assign, AssignUpdate

penpornk · 2021-05-05T05:30:28Z

If you are interested in attending the design review meeting for this RFC, here are the call details:

Time: Tuesday, May 18th, 2021 from 10:00-11:00am PT
Meeting link: Google Meet

Meeting participants are expected to read the RFC ahead of time as the meeting will focus on the remaining issues/questions.

jzhoulon · 2021-05-10T05:28:09Z

rfcs/20210504-kernel-extension-variable-ops.md

+                                    TF_VariableInputLockHolder** lockHolder,
+                                    TF_Status* status);
+
+// This interface returns out tensor which is updated corresponding to the


It is great to see the RFC! I have some question here: For TF_GetInputTensorFromVariable, I think it needs a parameter to pass the datatype? GetInputTensorFromVariable is a tempelate function with datatype T, it will call EnsureSparseVariableAccess and EnsureSparseVariableAccess will use datatype T to compare wether it is Variant datatype (TF_MaybeLockVariableInputMutexesInOrder may also need to pass the data type for the same reason)

template <typename Device, typename T> Status GetInputTensorFromVariable(OpKernelContext* ctx, int input, bool lock_held, bool sparse, Tensor* out) { if (ctx->input_dtype(input) == DT_RESOURCE) { core::RefCountPtr<Var> var; TF_RETURN_IF_ERROR(LookupResource(ctx, HandleFromInput(ctx, input), &var)); if (sparse) { TF_RETURN_IF_ERROR(EnsureSparseVariableAccess<Device, T>(ctx, var.get())); *out = *var->tensor(); return Status::OK(); } TF_RETURN_IF_ERROR(PrepareToUpdateVariable<Device, T>( ctx, var->tensor(), var->copy_on_read_mode.load())); *out = *var->tensor(); return Status::OK(); } *out = ctx->mutable_input(input, lock_held); return Status::OK(); }

Status EnsureSparseVariableAccess(OpKernelContext* ctx, Var* var) { if (var->copy_on_read_mode.load()) { return Status::OK(); } mutex_lock ml(*var->mu()); // Once copy-on-read mode is True the refcount is guaranteed to be 1. This can // also happen if there are no concurrent reads of the variable and // copy-on-read mode is false. if (var->tensor()->RefCountIsOne()) { var->copy_on_read_mode.store(true); return Status::OK(); } PersistentTensor unused; Tensor* tmp; if (std::is_same<T, Variant>::value) { AllocatorAttributes attr; attr.set_on_host(true); TF_RETURN_IF_ERROR(ctx->allocate_persistent( var->tensor()->dtype(), var->tensor()->shape(), &unused, &tmp, attr));

@jzhoulon Thanks for the review!. That's a good catch, we were initially focusing on the non-sparse operations. We can add a bool isVariantType to check for Variant type operations, like we did for TF_AssignUpdateVariable. We need that to be passed to PrepareToUpdateVariable.

jzhoulon · 2021-05-10T06:40:31Z

rfcs/20210504-kernel-extension-variable-ops.md

+// the input and value tensors. It also accepts the copy functor provided by
+// pluggable vendor to do the copying of the tensors.
+TF_CAPI_EXPORT extern void TF_AssignVariable(TF_OpKernelContext* ctx,
+                               int input_index,


Does this means one C-API is an Op implementation? if it is, do this RFC also also provide other resource related Op's C API? such as ReadVariableOp,TensorArrayReadOp,ScatterNdUpdateOp...

Hi @jzhoulon , we looked into other ResourceVariableOps.

ReadVariableOp is already part of DEVICE_DEFAULT implementation

TensorArrayReadOp this we haven't looked in detail. From the current look the only Device dependent information is using LockedRead where we can memset Tensor to 0.

AssignVariable and AssignVariableUpdate are handled as implementations in the interface.

ResourceGather and ResourceScatter follow the pattern of :

void Compute(OpKernelContext* c) override { core::RefCountPtr<Var> v; OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v)); OP_REQUIRES_OK(c, EnsureSparseVariableAccess<Device, T>(c, v.get())); if (is_non_pod_dtype || use_exclusive_lock_) { mutex_lock ml(*v->mu()); << Compute Functor performing the operation >> } else { tf_shared_lock ml(*v->mu()); << Compute Functor performing the operation >> }

We implemented these operations where the :
RefCountPtr<Var>, processed by LookupResource and EnsureSparseVariableAccess (with locking) is implemented using the TF_GetInputResourceVariable
and TF_MaybeLockVariableInputMutexesInOrder. Having said that if there are helper functions you think are needed we can definitely expose here.

Hi @kulinseth . We have went through the resource relevant ops and found that the below ops will use some device specific function, such as concat and split and so on. Not only just the memset. And We roughly put it in below table. Will this RFC consider these ops like the variable and add some new high level APIs ?

Resource Ops

TensorArray TensorArrayWrite (V2, V3)

TensorArrayRead (V2, V3)

TensorArrayPack

TensorArrayGather (V2, V3)

TensorArrayConcat (V2, V3)

TensorArrayUnpack

TensorArrayScatter (V2, V3)

TensorArraySplit (V2, V3)

UnBatchResource UnbatchKernel

UnbatchGradKernel

Hi @yanzhang-dev , thanks for the list and feedback. We haven't looked at TensorArray ops in detail, but it seems to me we would need more than device specific functionality. We might need to expose the TF_TensorArray along with necessary methods like TF_GetTensorArray. This might need a separate RFC. Similarly for UnbatchResource:

// A class encapsulating the state and logic for unbatching tensors. // // UnbatchResource keeps two data structures indexed by batch-key: one which has // the continuations for all concurrent kernels which are waiting for tensors // and another which has tensors which are waiting for their corresponding // kernels to run. Whenever a kernel runs, we either grab its tensor if it's // waiting already, or we insert it in the queue and then look at its tensor to // see if it can be used to dispatch any stored continuations. class UnbatchResource : public ResourceBase { public: explicit UnbatchResource(int32 timeout_micros) : timeout_micros_(timeout_micros), timeout_enforcer_(new serving::PeriodicFunction( [this] { EnforceTimeout(); }, 1000 /* 1 ms */)) {} ~UnbatchResource() override { // Tear down 'timeout_enforcer_' first, since it accesses other state in // this class. timeout_enforcer_ = nullptr; }

I agree we can probably expose the TF_ResourceBase, TF_ResourceManager and others and potentially recreate all this logic on Plugin side.

On a separate note I was curious about these operations, which networks did you encounter these for performance critical sections.?

jzhoulon · 2021-05-10T06:41:42Z

rfcs/20210504-kernel-extension-variable-ops.md

+                                                     TF_Tensor *dest),
+                                    void (*updateFunc)(TF_OpKernelContext *ctx,
+                                                       TF_Tensor *tensor,
+                                                       TF_Tensor *value, int Op),


I think Int OP here is no needed since updateFunc should already contains Op semantic since TF_AssignUpdateVariable is called in plugin side.

The updateFunctor would need to distinguish between Add and Sub operations. For instance:

enum DenseUpdateType { ADD, SUB, ASSIGN }; void updateFunc(TF_OpKernelContext *ctx, TF_Tensor *tf_var_tensor, TF_Tensor *tf_value, int Op) { .... if (Op == ADD) { } else if (Op == SUB) { } }

I mean the updateFunc can have semantic like add_func, sub_func, and when we implement the kernel, we pass the corresponding func to TF_AssignUpdateVariable. Any way, I have no strong opinion on this.

template <DenseUpdateTypeOp Op> AssinUpdateVariableCompute() { if(Op == ADD) TF_AssignUpdateVariable(add_func) elif (Op == Sub) TF_AssignUpdateVariable(sub_func) }

Sure that's doable. If you prefer, we can change it.

+1 to removing the op arg.

saxenasaurabh

Thanks for putting this together! Looks good to me overall.

saxenasaurabh · 2021-05-16T18:18:56Z

rfcs/20210504-kernel-extension-variable-ops.md

+                                    int value_index,
+                                    int Op,
+                                    int isVariantType,
+                                    void (*copyFunc)(TF_OpKernelContext * ctx, 


Could you explain why we need both copyFunc and updateFunc?

You are right we can use the ASSIGN in updateFunc to implement copyFunc. I will clean it out during the PR.

saxenasaurabh · 2021-05-16T18:19:48Z

rfcs/20210504-kernel-extension-variable-ops.md

+                                    int input_index,
+                                    int value_index,
+                                    int Op,
+                                    int isVariantType,


nit: It may be worth figuring out if we can avoid the isVariantType arg. This can wait for the actual PR though.

@saxenasaurabh thanks for the review!!. It was to pass the VariantType boolean for PrepareToUpdateVariable :

template <typename Device, typename T> Status PrepareToUpdateVariable(OpKernelContext* ctx, Tensor* tensor, bool copy_on_read_mode) { if (copy_on_read_mode || !tensor->RefCountIsOne()) { // Tensor's buffer is in use by some read, so we need to copy before // updating. PersistentTensor unused; Tensor* tmp; if (std::is_same<T, Variant>::value) { AllocatorAttributes attr; attr.set_on_host(true); TF_RETURN_IF_ERROR(ctx->allocate_persistent( tensor->dtype(), tensor->shape(), &unused, &tmp, attr)); const auto elements_in = tensor->flat<Variant>(); auto elements_out = tmp->flat<Variant>(); for (int64 i = 0; i < elements_in.size(); ++i) { elements_out(i) = elements_in(i); } } else { AllocatorAttributes attr; attr.set_gpu_compatible(true); attr.set_nic_compatible(true); TF_RETURN_IF_ERROR(ctx->allocate_persistent( tensor->dtype(), tensor->shape(), &unused, &tmp, attr)); functor::DenseUpdate<Device, T, ASSIGN> copy_functor; copy_functor(ctx->eigen_device<Device>(), tmp->flat<T>(), const_cast<const Tensor*>(tensor)->flat<T>()); } *tensor = *tmp; } return Status::OK(); }

If we can derive this Type information from the input or value tensor then we can drop this.

I believe we could achieve this using tensor.dtype() == DT_VARIANT.

saxenasaurabh · 2021-05-16T18:49:39Z

rfcs/20210504-kernel-extension-variable-ops.md

+structures in Tensorflow core like TF_Mutex, TF_ResourceHandle, TF_RefCountPtr
+to the plugin vendors. This will allow plugin writers the flexibility to
+construct higher level optimizer operations using these lower level primitives
+and will be scalable for newer operations. Option #2, is to expose all the


Do you have any near-term plans to expose any other ops? If so option #1 maybe worth paying more heed to. You already need to expose TF_GetInputTensorFromVariable for the training ops which does most of what you need functionality-wise. Although, I agree the current API is much cleaner.

Passing on comments from @Jianhui-Li and @jzhoulon: They prefer option #1 for scalability. (Please feel free to elaborate here.)

@saxenasaurabh , for resource variable ops:

some of them fall in the DEVICE_DEFAULT category: like ReadVariableOp, VariableShapeOp, DestroyResourceOp,

ResourceGather , ResourceScatter as discussed above we have implemented using the GetInputTensorFromVariable and LockMutexesInOrder

@saxenasaurabh , that's a good point. AssignUpdateVariable and AssignVariable are currently implemented as an Op, we can use TF_GetInputTensorFromVariable and TF_GetTrainingVariableMutex (this is a helper function we use for TF_MaybeLockVariableInputMutexesInOrder) to implement this functionality. Depending on whether the Op needs exclusive lock we can use the lock or shared_lock to implement it. This is part of the helper function TF_MaybeLockVariableInputMutexesInOrder.

This covers majority of the resource variable operations we have encountered in the Training networks.

@Jianhui-Li and @jzhoulon thanks for your feedback. I agree Option 1 is more scalable. We started at the higher level interface for few reasons:

Most of the common resource variable ops and training ops exhibited a pattern to implement them which we were mostly able to capture using these APIs.

We wanted to keep a smaller / tight interface to keep the code maintainable and allows us to organically grow in the future if we see the need.

With the upcoming new stack of MLIR/TFRT (As mentioned in the StreamExecutor RFC), we wanted to keep the design of interface simple and high-level to represent the needed ops.

@kulinseth, @jzhoulon and I discussed and agreed that Option 2 is simpler but we need to cover the other cases we discovered (TensorArray and UnBatchResource), which use resource manager and are dispatched to device. Covering these two cases with extra APIs following Option 2 approach is more manageable than we originally thought, so we don't really against Option 2. Could you please also verify whether the Option 2 approach can cover these cases?

@Jianhui-Li thanks for getting back!. That's a good point, initially we weren't looking into TensorArray and UnbatchResource, as we didn't encounter them in performance critical sections of the networks we were looking at. I haven't dug into the details, from cursory glance it seems to me we might need to expose TF_TensorArray and some methods handling TF_TensorArray type.

Similarly for UnbatchResource as you mentioned we would need to expose ResourceManager details. Even then I am not sure it will cover all the necessary functionality due to some of the dependency on AsyncOpKernel and AdaptiveBatchScheduler. I feel these are bigger API changes which might need an RFC of their own.

@saxenasaurabh , @Jianhui-Li , @penpornk , @jzhoulon Not sure, what do you guys think?

saxenasaurabh · 2021-05-16T18:50:30Z

rfcs/20210504-kernel-extension-variable-ops.md

+
+### Engineering Impact
+* The impact to binary size / startup time / build time / test times are minimum
+* The TensorFlow team will maintain this code. 


cc: @rjpower @manivaradarajan FYI

ematejska · 2021-05-18T17:21:59Z

This was accepted at the design review meeting today.

penpornk · 2021-05-18T21:09:06Z

Design review meeting notes

@kulinseth: We started by exposing each helper function for optimizer / Resource variables through C API. Based on @saxenasaurabh's comments, GetInputTensorFromVariable can be used for Resource variables too. We'll refine them in the actual implementation PR.

@saxenasaurabh: How do you scope the set of ops? Do you have a fallback mechanism?
@kulinseth: We looked through 14-15 networks and dumped out all the ops. For TensorArray, we fall back to CPU.

@kulinseth: @jzhoulon, what models do you need TensorArray and UnbatchResource for?
@jzhoulon / @Jianhui-Li: No specific models. Just for ops coverage / completeness. We looked at TensorArray, Variables, ResourceMgr, BatchResource-related ops. The API you exposed is cleaner.

@kulinseth: Are ResourceManager and TensorArray needed?

@saxenasaurabh: For Resource variable alone, the low-level API makes sense.
- TensorArray ops are used mostly in TF 1. We use TensorList* ops in v2.
- It depends on what models you care about.
@kulinseth: We have been looking at v2 networks.
- Rely on DEVICE_DEFAULT for TensorList*.

@kulinseth: In the PR, we will break up input tensor from Variable.

@saxenasaurabh: Longer term exposing low level mutex / refcounting may be more beneficial.
- Keep high-level API under c_api_experimental for now.
- Expand set of ops -- revisit if more low-level stuff is easier to maintain.
- Doesn't need to support TensorArray and ResourceMgr in this RFC. Can follow-up.

@kulinseth: We recently added a helper function TF_GetInputByName to the RFC.

@Jianhui-Li: TensorArray is replaced by TensorList* in v2. What about UnbatchResource?

@kulinseth: Also, what about async stuff in general?
@saxenasaurabh: Not familiar with this (UnbatchResource). Will have to check.
@reedwm: I don't think v2 usually uses batched ops.
@kulinseth: Any other ops with async op kernels?
@reedwm: Some GPU ops perhaps incorrectly use async op kernel to do some work / transfer.
@saxenasaurabh: How do you handle control flow? Functional control flow ops are async in TF for CPU.
@kulinseth: We rely on default TF implementation and run the lowered Switch, Merge etc ops.

@kulinseth: We can send out a PR today so everybody can take a closer look.

kulinseth · 2021-05-18T21:35:48Z

Thanks @penpornk for detailed notes. I will update here with PR.

penpornk · 2021-05-30T22:38:39Z

Here is @kulinseth's implementation PR: tensorflow/tensorflow#49717

kulinseth requested review from ematejska, ewilderj and theadactyl as code owners May 4, 2021 19:24

google-cla bot added the cla: yes label May 4, 2021

jzhoulon reviewed May 10, 2021

View reviewed changes

penpornk mentioned this pull request May 10, 2021

RFC: TensorFlow on DirectML #243

Merged

saxenasaurabh reviewed May 16, 2021

View reviewed changes

Addressing the comments and adding a helper function.

939397f

ematejska added the RFC: Accepted RFC Design Document: Accepted by Review label May 18, 2021

Update 20210504-kernel-extension-variable-ops.md

91ca407

ematejska approved these changes May 18, 2021

View reviewed changes

ematejska merged commit 4fc1c6b into tensorflow:master May 18, 2021

penpornk mentioned this pull request Jun 28, 2021

r2.6 cherry-pick request: Add the missing kernels_experimental header file to pip package tensorflow/tensorflow#50498

Merged

penpornk mentioned this pull request Apr 15, 2022

Tracker for TensorFlow-DirectML tensorflow/tensorflow#55226

Open

Resource	Ops
TensorArray	TensorArrayWrite (V2, V3)
	TensorArrayRead (V2, V3)
	TensorArrayPack
	TensorArrayGather (V2, V3)
	TensorArrayConcat (V2, V3)
	TensorArrayUnpack
	TensorArrayScatter (V2, V3)
	TensorArraySplit (V2, V3)
UnBatchResource	UnbatchKernel
	UnbatchGradKernel

RFC: Kernel C API extension for Variable ops. #387

RFC: Kernel C API extension for Variable ops. #387

Uh oh!

Conversation

kulinseth commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penpornk commented May 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kulinseth May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kulinseth May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saxenasaurabh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kulinseth May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ematejska commented May 18, 2021

Uh oh!

penpornk commented May 18, 2021

Design review meeting notes

Uh oh!

kulinseth commented May 18, 2021

Uh oh!

penpornk commented May 30, 2021

Uh oh!

Uh oh!

kulinseth commented May 4, 2021 •

edited

Loading

kulinseth May 18, 2021 •

edited

Loading

kulinseth May 18, 2021 •

edited

Loading

kulinseth May 18, 2021 •

edited

Loading