Skip to content

RFC: Kernel C API extension for Variable ops. #387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kulinseth
Copy link
Contributor

@kulinseth kulinseth commented May 4, 2021

The RFC will be open for comment until Tuesday, May 18, 2021

Status Proposed
RFC # 387
Author(s) Kulin Seth (Apple), Charles Brissart (Apple)
Sponsor Saurabh Saksena ([email protected])
Updated 2021-05-04

The proposal extends the current Kernel C API to enable plugin writers using Modular API to add support for :

  • Optimizer operations like SGD, Adam etc.
  • Variable updates like Assign, AssignUpdate

The RFC will be open for comment until Tuesday, May 18, 2021

| Status        | Proposed      |
:-------------- |:---------------------------------------------------- |
| **RFC #**     | (RFC link)                                   |
| **Author(s)** | Kulin Seth (Apple), Charles Brissart (Apple) |
| **Sponsor**   | Saurabh Saksena ([email protected])            |
| **Updated**   | 2021-05-04                                   |

The proposal extends the current [Kernel C API](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md) to enable plugin writers to add support for :

* Optimizer operations like SGD, Adam etc.
* Variable updates like Assign, AssignUpdate
@penpornk
Copy link
Member

penpornk commented May 5, 2021

If you are interested in attending the design review meeting for this RFC, here are the call details:

Time: Tuesday, May 18th, 2021 from 10:00-11:00am PT
Meeting link: Google Meet

Meeting participants are expected to read the RFC ahead of time as the meeting will focus on the remaining issues/questions.

TF_VariableInputLockHolder** lockHolder,
TF_Status* status);

// This interface returns out tensor which is updated corresponding to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to see the RFC! I have some question here: For TF_GetInputTensorFromVariable, I think it needs a parameter to pass the datatype? GetInputTensorFromVariable is a tempelate function with datatype T, it will call EnsureSparseVariableAccess and EnsureSparseVariableAccess will use datatype T to compare wether it is Variant datatype (TF_MaybeLockVariableInputMutexesInOrder may also need to pass the data type for the same reason)

template <typename Device, typename T>
Status GetInputTensorFromVariable(OpKernelContext* ctx, int input,
                                  bool lock_held, bool sparse, Tensor* out) {
  if (ctx->input_dtype(input) == DT_RESOURCE) {
    core::RefCountPtr<Var> var;
    TF_RETURN_IF_ERROR(LookupResource(ctx, HandleFromInput(ctx, input), &var));
    if (sparse) {
      TF_RETURN_IF_ERROR(EnsureSparseVariableAccess<Device, T>(ctx, var.get()));
      *out = *var->tensor();
      return Status::OK();
    }
    TF_RETURN_IF_ERROR(PrepareToUpdateVariable<Device, T>(
        ctx, var->tensor(), var->copy_on_read_mode.load()));
    *out = *var->tensor();
    return Status::OK();
  }
  *out = ctx->mutable_input(input, lock_held);
  return Status::OK();
}
Status EnsureSparseVariableAccess(OpKernelContext* ctx, Var* var) {
  if (var->copy_on_read_mode.load()) {
    return Status::OK();
  }
  mutex_lock ml(*var->mu());
  // Once copy-on-read mode is True the refcount is guaranteed to be 1. This can
  // also happen if there are no concurrent reads of the variable and
  // copy-on-read mode is false.
  if (var->tensor()->RefCountIsOne()) {
    var->copy_on_read_mode.store(true);
    return Status::OK();
  }
  PersistentTensor unused;
  Tensor* tmp;
  if (std::is_same<T, Variant>::value) {
    AllocatorAttributes attr;
    attr.set_on_host(true);
    TF_RETURN_IF_ERROR(ctx->allocate_persistent(
        var->tensor()->dtype(), var->tensor()->shape(), &unused, &tmp, attr));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzhoulon Thanks for the review!. That's a good catch, we were initially focusing on the non-sparse operations. We can add a bool isVariantType to check for Variant type operations, like we did for TF_AssignUpdateVariable. We need that to be passed to PrepareToUpdateVariable.

// the input and value tensors. It also accepts the copy functor provided by
// pluggable vendor to do the copying of the tensors.
TF_CAPI_EXPORT extern void TF_AssignVariable(TF_OpKernelContext* ctx,
int input_index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means one C-API is an Op implementation? if it is, do this RFC also also provide other resource related Op's C API? such as ReadVariableOp,TensorArrayReadOp,ScatterNdUpdateOp...

Copy link
Contributor Author

@kulinseth kulinseth May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jzhoulon , we looked into other ResourceVariableOps.

  1. ReadVariableOp is already part of DEVICE_DEFAULT implementation
  2. TensorArrayReadOp this we haven't looked in detail. From the current look the only Device dependent information is using LockedRead where we can memset Tensor to 0.
  3. AssignVariable and AssignVariableUpdate are handled as implementations in the interface.
  4. ResourceGather and ResourceScatter follow the pattern of :
  void Compute(OpKernelContext* c) override {
    core::RefCountPtr<Var> v;
    OP_REQUIRES_OK(c, LookupResource(c, HandleFromInput(c, 0), &v));
    OP_REQUIRES_OK(c, EnsureSparseVariableAccess<Device, T>(c, v.get()));
    if (is_non_pod_dtype || use_exclusive_lock_) {
      mutex_lock ml(*v->mu());
    << Compute Functor performing the operation >>   
    } else {
      tf_shared_lock ml(*v->mu());
    << Compute Functor performing the operation >>   
    }

We implemented these operations where the :
RefCountPtr<Var>, processed by LookupResource and EnsureSparseVariableAccess (with locking) is implemented using the TF_GetInputResourceVariable
and TF_MaybeLockVariableInputMutexesInOrder. Having said that if there are helper functions you think are needed we can definitely expose here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kulinseth . We have went through the resource relevant ops and found that the below ops will use some device specific function, such as concat and split and so on. Not only just the memset. And We roughly put it in below table. Will this RFC consider these ops like the variable and add some new high level APIs ?

Resource Ops
TensorArray TensorArrayWrite (V2, V3)
TensorArrayRead (V2, V3)
TensorArrayPack
TensorArrayGather (V2, V3)
TensorArrayConcat (V2, V3)
TensorArrayUnpack
TensorArrayScatter (V2, V3)
TensorArraySplit (V2, V3)
UnBatchResource UnbatchKernel
UnbatchGradKernel

Copy link
Contributor Author

@kulinseth kulinseth May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yanzhang-dev , thanks for the list and feedback. We haven't looked at TensorArray ops in detail, but it seems to me we would need more than device specific functionality. We might need to expose the TF_TensorArray along with necessary methods like TF_GetTensorArray. This might need a separate RFC. Similarly for UnbatchResource:

// A class encapsulating the state and logic for unbatching tensors.
//
// UnbatchResource keeps two data structures indexed by batch-key: one which has
// the continuations for all concurrent kernels which are waiting for tensors
// and another which has tensors which are waiting for their corresponding
// kernels to run. Whenever a kernel runs, we either grab its tensor if it's
// waiting already, or we insert it in the queue and then look at its tensor to
// see if it can be used to dispatch any stored continuations.
class UnbatchResource : public ResourceBase {
 public:
  explicit UnbatchResource(int32 timeout_micros)
      : timeout_micros_(timeout_micros),
        timeout_enforcer_(new serving::PeriodicFunction(
            [this] { EnforceTimeout(); }, 1000 /* 1 ms */)) {}

  ~UnbatchResource() override {
    // Tear down 'timeout_enforcer_' first, since it accesses other state in
    // this class.
    timeout_enforcer_ = nullptr;
  }
 

I agree we can probably expose the TF_ResourceBase, TF_ResourceManager and others and potentially recreate all this logic on Plugin side.

On a separate note I was curious about these operations, which networks did you encounter these for performance critical sections.?

TF_Tensor *dest),
void (*updateFunc)(TF_OpKernelContext *ctx,
TF_Tensor *tensor,
TF_Tensor *value, int Op),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Int OP here is no needed since updateFunc should already contains Op semantic since TF_AssignUpdateVariable is called in plugin side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updateFunctor would need to distinguish between Add and Sub operations. For instance:

enum DenseUpdateType { ADD, SUB, ASSIGN };

void updateFunc(TF_OpKernelContext *ctx, TF_Tensor *tf_var_tensor, TF_Tensor *tf_value, int Op) {
....
    if (Op == ADD) {

    } else if (Op == SUB) {

    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the updateFunc can have semantic like add_func, sub_func, and when we implement the kernel, we pass the corresponding func to TF_AssignUpdateVariable. Any way, I have no strong opinion on this.

template <DenseUpdateTypeOp Op>
AssinUpdateVariableCompute() {
  if(Op == ADD)
    TF_AssignUpdateVariable(add_func)
  elif (Op == Sub)
    TF_AssignUpdateVariable(sub_func)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that's doable. If you prefer, we can change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing the op arg.

Copy link
Member

@saxenasaurabh saxenasaurabh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! Looks good to me overall.

int value_index,
int Op,
int isVariantType,
void (*copyFunc)(TF_OpKernelContext * ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we need both copyFunc and updateFunc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right we can use the ASSIGN in updateFunc to implement copyFunc. I will clean it out during the PR.

int input_index,
int value_index,
int Op,
int isVariantType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It may be worth figuring out if we can avoid the isVariantType arg. This can wait for the actual PR though.

Copy link
Contributor Author

@kulinseth kulinseth May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saxenasaurabh thanks for the review!!. It was to pass the VariantType boolean for PrepareToUpdateVariable :

template <typename Device, typename T>
Status PrepareToUpdateVariable(OpKernelContext* ctx, Tensor* tensor,
                               bool copy_on_read_mode) {
  if (copy_on_read_mode || !tensor->RefCountIsOne()) {
    // Tensor's buffer is in use by some read, so we need to copy before
    // updating.
    PersistentTensor unused;
    Tensor* tmp;
    if (std::is_same<T, Variant>::value) {
      AllocatorAttributes attr;
      attr.set_on_host(true);
      TF_RETURN_IF_ERROR(ctx->allocate_persistent(
          tensor->dtype(), tensor->shape(), &unused, &tmp, attr));

      const auto elements_in = tensor->flat<Variant>();
      auto elements_out = tmp->flat<Variant>();
      for (int64 i = 0; i < elements_in.size(); ++i) {
        elements_out(i) = elements_in(i);
      }
    } else {
      AllocatorAttributes attr;
      attr.set_gpu_compatible(true);
      attr.set_nic_compatible(true);
      TF_RETURN_IF_ERROR(ctx->allocate_persistent(
          tensor->dtype(), tensor->shape(), &unused, &tmp, attr));
      functor::DenseUpdate<Device, T, ASSIGN> copy_functor;
      copy_functor(ctx->eigen_device<Device>(), tmp->flat<T>(),
                   const_cast<const Tensor*>(tensor)->flat<T>());
    }
    *tensor = *tmp;
  }
  return Status::OK();
}

If we can derive this Type information from the input or value tensor then we can drop this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could achieve this using tensor.dtype() == DT_VARIANT.

structures in Tensorflow core like TF_Mutex, TF_ResourceHandle, TF_RefCountPtr
to the plugin vendors. This will allow plugin writers the flexibility to
construct higher level optimizer operations using these lower level primitives
and will be scalable for newer operations. Option #2, is to expose all the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any near-term plans to expose any other ops? If so option #1 maybe worth paying more heed to. You already need to expose TF_GetInputTensorFromVariable for the training ops which does most of what you need functionality-wise. Although, I agree the current API is much cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing on comments from @Jianhui-Li and @jzhoulon: They prefer option #1 for scalability. (Please feel free to elaborate here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saxenasaurabh , for resource variable ops:

  1. some of them fall in the DEVICE_DEFAULT category: like ReadVariableOp, VariableShapeOp, DestroyResourceOp,
  2. ResourceGather , ResourceScatter as discussed above we have implemented using the GetInputTensorFromVariable and LockMutexesInOrder
  3. @saxenasaurabh , that's a good point. AssignUpdateVariable and AssignVariable are currently implemented as an Op, we can use TF_GetInputTensorFromVariable and TF_GetTrainingVariableMutex (this is a helper function we use for TF_MaybeLockVariableInputMutexesInOrder) to implement this functionality. Depending on whether the Op needs exclusive lock we can use the lock or shared_lock to implement it. This is part of the helper function TF_MaybeLockVariableInputMutexesInOrder.

This covers majority of the resource variable operations we have encountered in the Training networks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jianhui-Li and @jzhoulon thanks for your feedback. I agree Option 1 is more scalable. We started at the higher level interface for few reasons:

  1. Most of the common resource variable ops and training ops exhibited a pattern to implement them which we were mostly able to capture using these APIs.
  2. We wanted to keep a smaller / tight interface to keep the code maintainable and allows us to organically grow in the future if we see the need.
  3. With the upcoming new stack of MLIR/TFRT (As mentioned in the StreamExecutor RFC), we wanted to keep the design of interface simple and high-level to represent the needed ops.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kulinseth, @jzhoulon and I discussed and agreed that Option 2 is simpler but we need to cover the other cases we discovered (TensorArray and UnBatchResource), which use resource manager and are dispatched to device. Covering these two cases with extra APIs following Option 2 approach is more manageable than we originally thought, so we don't really against Option 2. Could you please also verify whether the Option 2 approach can cover these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jianhui-Li thanks for getting back!. That's a good point, initially we weren't looking into TensorArray and UnbatchResource, as we didn't encounter them in performance critical sections of the networks we were looking at. I haven't dug into the details, from cursory glance it seems to me we might need to expose TF_TensorArray and some methods handling TF_TensorArray type.

Similarly for UnbatchResource as you mentioned we would need to expose ResourceManager details. Even then I am not sure it will cover all the necessary functionality due to some of the dependency on AsyncOpKernel and AdaptiveBatchScheduler. I feel these are bigger API changes which might need an RFC of their own.

@saxenasaurabh , @Jianhui-Li , @penpornk , @jzhoulon Not sure, what do you guys think?


### Engineering Impact
* The impact to binary size / startup time / build time / test times are minimum
* The TensorFlow team will maintain this code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ematejska
Copy link

This was accepted at the design review meeting today.

@ematejska ematejska added the RFC: Accepted RFC Design Document: Accepted by Review label May 18, 2021
@ematejska ematejska merged commit 4fc1c6b into tensorflow:master May 18, 2021
@penpornk
Copy link
Member

Design review meeting notes

@kulinseth: We started by exposing each helper function for optimizer / Resource variables through C API. Based on @saxenasaurabh's comments, GetInputTensorFromVariable can be used for Resource variables too. We'll refine them in the actual implementation PR.

@saxenasaurabh: How do you scope the set of ops? Do you have a fallback mechanism?
@kulinseth: We looked through 14-15 networks and dumped out all the ops. For TensorArray, we fall back to CPU.

@kulinseth: @jzhoulon, what models do you need TensorArray and UnbatchResource for?
@jzhoulon / @Jianhui-Li: No specific models. Just for ops coverage / completeness. We looked at TensorArray, Variables, ResourceMgr, BatchResource-related ops. The API you exposed is cleaner.

@kulinseth: Are ResourceManager and TensorArray needed?

  • @saxenasaurabh: For Resource variable alone, the low-level API makes sense.
    • TensorArray ops are used mostly in TF 1. We use TensorList* ops in v2.
    • It depends on what models you care about.
  • @kulinseth: We have been looking at v2 networks.
    • Rely on DEVICE_DEFAULT for TensorList*.

@kulinseth: In the PR, we will break up input tensor from Variable.

  • @saxenasaurabh: Longer term exposing low level mutex / refcounting may be more beneficial.
    • Keep high-level API under c_api_experimental for now.
    • Expand set of ops -- revisit if more low-level stuff is easier to maintain.
    • Doesn't need to support TensorArray and ResourceMgr in this RFC. Can follow-up.

@kulinseth: We recently added a helper function TF_GetInputByName to the RFC.

@Jianhui-Li: TensorArray is replaced by TensorList* in v2. What about UnbatchResource?

  • @kulinseth: Also, what about async stuff in general?
  • @saxenasaurabh: Not familiar with this (UnbatchResource). Will have to check.
  • @reedwm: I don't think v2 usually uses batched ops.
  • @kulinseth: Any other ops with async op kernels?
  • @reedwm: Some GPU ops perhaps incorrectly use async op kernel to do some work / transfer.
  • @saxenasaurabh: How do you handle control flow? Functional control flow ops are async in TF for CPU.
  • @kulinseth: We rely on default TF implementation and run the lowered Switch, Merge etc ops.

@kulinseth: We can send out a PR today so everybody can take a closer look.

@kulinseth
Copy link
Contributor Author

Thanks @penpornk for detailed notes. I will update here with PR.

@penpornk
Copy link
Member

Here is @kulinseth's implementation PR: tensorflow/tensorflow#49717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants