Skip to content

RFC: TensorFlow Extension Types #269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 13, 2020
Merged

RFC: TensorFlow Extension Types #269

merged 13 commits into from
Aug 13, 2020

Conversation

edloper
Copy link
Contributor

@edloper edloper commented Jul 21, 2020

Status Proposed
Authors Edward Loper ([email protected])
Sponsor Alex Passos ([email protected])
Updated 2020-07-21

Objective

This RFC proposes a protocol that can be used to define user-defined
object-oriented Python types
that are supported by TensorFlow APIs.

* `DimensionAlignment`: Encodes a correspondence between two related dimensions
(e.g., between a `word` dimension and a `speaker` dimension).

**Tensor-like types:**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another obvious example would be some form of distributed tensor, like those from Mesh-TensorFlow.

Comment on lines +87 to +88
* **Keras**: User-defined types can be used as inputs and outputs for Keras
`Models` and `Layers`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a library like Keras support overrides of TensorFlow operations?

There are basically two alternatives:

  1. It could allow the behavior of Keras objects (e.g., Layer instances) to be overriden explicitly via some mechanism like __tf_dispatch__.
  2. It could allow the behavior of Keras objects to be overriden implicitly based on their underlying implementation in terms of TensorFlow operations.

The main challenge with (1) is that objects are (in general) stateful, so the override mechanism described here for functions would not suffice.

In contrast, (2) seems initially appealing. Keras is defined on arbitrary extension types "for free". The problem is that now Keras' implementation in terms of lower level TensorFlow function is exposed as part of its public API: if the underlying implementation of a Keras layer changes, it could break extension types, which might not implement a function, or worse might implement it in an incompatible way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question. I would lean towards approach (2), but you're right to point out that this adds an element of design risk when using a Keras layer with an extension type. Perhaps we can start with approach (2), but give appropriate warnings/disclaimers about the problem; and evaluate over the next few months whether this is a significant practical problem or not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, there are at least a few other options:

  1. Keras doesn't (yet) support TF extension types (e.g., by always explicitly converting inputs with tf.convert_to_tensor).
  2. Keras only supports TF extension types if an experimental flag is explicitly set. This flag might even be permanently required, if the Keras team doesn't want their implementation details to be considered public API.

I don't think shipping (2) and re-evaluating in a few months is a great option, because the costs of expanded API surface take a long time to become self-evident (i.e., at least a few major releases for users to start depending on a behavior and complain when it breaks). The point at which we notice that it's a problem could well be too late.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also examples of this object-oriented dispatch question in the core TF APIs themselves (e.g. for the various tf.linalg.LinearOperator APIs)

The tricky thing is these may not provide any sort of canonical way to serialize/deserialize traced dispatched operations, because they can't be mapped to/from strings directly at the top-level api.


# We support ops that take tf.Tensor and SimpleMaskedTensor arguments. We
# don't support any other dispatchable argument types (such as tf.RaggedTensor).
__tf_dispatch_types__ = (tf.Tensor, SimpleMaskedTensor)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about builtin numeric types, like int or float. Would these need to be included explicitly in __tf_dispatch_types__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design, arguments that don't define __tf_dispatch__ will be converted to Tensor before dispatch is called. So ints, floats, np.arrays, etc., will get converted to Tensor.

Copy link

@leandro-gracia-gil leandro-gracia-gil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this RFC. I love the ideas suggested here.

* **Tensorflow hub**: User-defined types can be used as inputs and outputs for
`tf.hub` modules.
* **SavedModel**: User-defined types can be used as inputs and outputs for
`SavedModels`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this RFC affect the use of exported signatures from loaded saved models?

Currently, the behavior for structured data seems to be pretty poor. Input and output nested structures are flattened (not always automatically, sometimes you need to do it yourself) and need to be accessed through keyword arguments by guessing the automatically generated tensor names that TF assigns to them. Not only that, but the naming used is not even consistent between inputs and outputs.

I explained these issues in more detail here and proposed an improvement, though this was only for nested structures. I imagine that supporting Python-based extension types in loaded signatures is either a non-goal or needs to C++ support too, but perhaps this is a good time to at least provide an interface that works nicely with tf.nest, so that it's possible to build the rest on top of it if needed.

(Note: you might want to also check my other related comment about the name attribute, since I use a very similar use case as base for my examples.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SavedModel.signatures intentionally flattens nested structures, because it needs to interoperate with non-Python interfaces (e.g. c++). But if you just call the restored function on the SavedModel directly (rather than via the signatures dictionary), then you'll get proper handling of both nested values and extension types. I.e., in tensorflow/tensorflow#40501, if you change loaded_model.signatures['test'](inputs) to loaded_model.test(inputs), then it will handle both nested values and composite tensors (aka extension types) appropriately.

I do have a preliminary design for ExtensionTypes and TypeSpecs in c++ (there's a short blurb in the "Future Work" section), but I'm not including it in this RFC since there are still some details to work out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks!

type specifications are generally not named. However, feedback on this question
is welcome, especially when grounded in concrete use cases. (Does
`TensorSpec.name` get used for other things than concrete signature argument
naming?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have the feeling that they should not use a name attribute. Rather, it seems to me like name is not a good solution for the problem it's trying to solve, and that perhaps we should use this RFC as inspiration to find a better one.

Let me show a few examples of why name can make things quite confusing for structured data. These examples are about exported signatures in SavedModel, how they change before being saved and after being loaded, and how name affects the result. I understand that part of the problem here is probably in how SavedModel handles structured data and that name can have other uses. Still, even if this is not necessarily generalizable, I think it can help as an example.

For this, let's define a simple named tuple named Pair and imagine a hypothetical model function called test that gets one as input.

Pair = namedtuple('Pair', ('x', 'y'))

Let's start with a simple case where we export a signature with no names in its specs. We name our input pair inputs.

# Define specs, export, and load.
specs = Pair(tf.TensorSpec([], tf.int32), tf.TensorSpec([], tf.int32))
signatures = {'test': model.test.get_concrete_function(inputs=specs)}
tf.saved_model.save(model, 'test_model', signatures=signatures)
loaded_model = tf.saved_model.load('test_model')

# The inputs structure we're supposed to be exporting.
print(signatures['test'].structured_input_signature)
((Pair(x=TensorSpec(shape=(), dtype=tf.int32, name='inputs/x'), y=TensorSpec(shape=(), dtype=tf.int32, name='inputs/y')),), {})

# The inputs structure from the loaded model.
print(loaded_model.signatures['test'].structured_input_signature)
((), {'inputs_1': TensorSpec(shape=(), dtype=tf.int32, name='inputs_1'), 'inputs': TensorSpec(shape=(), dtype=tf.int32, name='inputs')})

The first issue we get is that the use of name is inconsistent between signatures in the original and in the loaded model. Note how one is using the argument name inputs and the structure fields x and y, and the other is just using the argument name for all its flattened contents and disambiguating repetitions. This, combined with the fact that position arguments have been converted to keyword ones and that keyword argument dicts have no reliable order, makes recovering structured data trickier and error-prone.

We can try then using the name attribute to help sorting out this. Let's give a name to only one of the tensor specs.

specs = Pair(tf.TensorSpec([], tf.int32, name='x'), tf.TensorSpec([], tf.int32))
# [same export, save and load as before] ...

print(signatures['test'].structured_input_signature)
((Pair(x=TensorSpec(shape=(), dtype=tf.int32, name='x'), y=TensorSpec(shape=(), dtype=tf.int32, name='inputs/y')),), {})

print(loaded_model.signatures['test'].structured_input_signature)
((), {'inputs': TensorSpec(shape=(), dtype=tf.int32, name='inputs'), 'x': TensorSpec(shape=(), dtype=tf.int32, name='x')})

Now the presence of a name is preventing to use the input argument name inputs on the first field. However, this is also dropping the structured inputs/x in favor of just x in the original model. In the loaded one it sort of helps because automatic disambiguation is no longer needed, yet the other non-named spec item gets the argument name. This also means that partially named specs will change the automatically generated disambiguated names of the rest of specs in a nested structure.

Finally, let's show what happens if we have both names.

specs = Pair(tf.TensorSpec([], tf.int32, name='x'), tf.TensorSpec([], tf.int32, name='y'))
# [same export, save and load as before] ...

print(signatures['test'].structured_input_signature)
((Pair(x=TensorSpec(shape=(), dtype=tf.int32, name='x'), y=TensorSpec(shape=(), dtype=tf.int32, name='y')),), {})

print(loaded_model.signatures['test'].structured_input_signature)
((), {'x': TensorSpec(shape=(), dtype=tf.int32, name='x'), 'y': TensorSpec(shape=(), dtype=tf.int32, name='y')})

With both names we no longer have disambiguation. However, note that if we were to have a Pair nesting two other Pair structures, disambiguation would kick in again because x and y would repeat. This is because name is lacking structural information.

So, my point with these examples is just to show some of the cases where the use of the name attribute can get things quite messy. Perhaps this is a good time to think about possible better alternatives that could remove the need of such argument.

* The CompositeTensor base class is replaced with an `ExtensionType` protocol.

* `CompositeTensor._type_spec` is renamed to `ExtensionType.__tf_type_spec__`,
and is changed from an abstractproperty to an abstractmethod.
Copy link

@leandro-gracia-gil leandro-gracia-gil Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will regular tensors, apart from other tensor-like and composite tensor objects, also use the ExtensionType protocol? I'm asking because it would allow doing things like:

specs = tf.nest.map_structure(lambda t: t.__tf_type_spec__(), some_nested_structure)

This can help when doing things such as building function output signatures for tf.map_fn and other cases when we want to easily build structured nested specs from data.

Edit: reformulated the question to make clearer what I mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Tensor will define __tf_type_spec__, and ExtensionType is replacing CompositeTensor.

`TypeSpecs` using `tf.register_type_spec`.

```python
def register_type_spec(type_spec_subclass, name=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be neat to have as a class decorator. It would link the otherwise decoupled declaration and further registration of the class. (Also could be used to build a centralized list of all declarations at compile/build time in a relatively ease way, but not sure about that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- I think this should be a decorator. (Since the common case is that name is None, the only thing we really need to do to make it a class decorator is have it return type_spec_subclass rather than None.)

raise NotImplementedError
```

**Note:** `tf.ExtensionType` is a Python `Protocol`, so it does not need to be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that typing.Protocol is new in Python 3.8. I presume you'll want to preserve some degree of backwards compatibility older versions of Python? Is so, how is that handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will define an tf.is_extension_type method which functions appropriately in all versions of Python.


**Note:** The `to_boxed_tensor` and `from_boxed_tensor` methods are typically
implemented by defining new c++ Kernels that encodes values using tensors with
`dtype=tf.variant`. The gradient for `to_boxed_tensor` typically calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out if/how the RFC changes the SavedModel protobuf representation, and how does it affect the downstream compilation stack in the future.

Does this mean all the extension types will be DT_VARIENT in the protobuf representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extension types (aka composite tensors) are already handled by SavedModel. E.g., you can use SavedModel to save/load models that take SparseTensor or RaggedTensor inputs, or return SparseTensor or RaggedTensor outputs.

In particular, SavedObjectGraph contains a set of SavedConcreteFunctions. Those SavedConcreteFunctions use StructuredValue to record their input & output signatures. StructuredValue (defined in tensorflow/core/protobuf/struct.proto) has a TypeSpecProto message that's used to encode TypeSpecs. Currently, it has an enum (TypeSpecClass) which indicates the actual type; we will be adding an "EXTENSION_TYPE_SPEC" value to that enum, and a string field to TypeSpecProto, which will be used to look up the appropriate TypeSpec in the registry.

When a saved concrete function is called, the TypeSpecs that were saved in the SavedGraphObject are used to decompose the arguments into component tensors, which are then fed into the saved concrete graph. If the result is an extension type, then the TypeSpec in SavedConcreteFunction.output_signature is used to pack the component tensors generated by the saved concrete graph into an appropriate extension type.

types and designs. If general-purpose types are developed that become
sufficiently mature, we may consider bringing them into the main TensorFlow code
base.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Aug 4, 2020
@ematejska
Copy link

This has been accepted at the design review but waiting for the notes to be posted before merging.

@edloper
Copy link
Contributor Author

edloper commented Aug 11, 2020

Notes from design review

Which class should decompose/reconstruct values? In an alternative design, we could move the job of decomposing values into tensors & reconstructing them from tensors from TypeSpec to ExtensionType. This decision has no impact on the functionality/capabilities of the design, only on its ergonomics. apassos@google expressed a preference for the current design. edloper@ agreed that if there are no strong opinions, then we should stick with the current design. Decision: status quo design.

Registry vs. Protocol vs. Base Class - what should we use to define extension types? The current RFC proposes that the API for defining TensorFlow extension types be implemented using protocols. Two alternative APIs that we considered are base classes and a registration mechanism. Decision was to stick with a protocol (status quo).

Should it be possible to define extension types non-invasively? The current RFC does not allow types to be turned into extension types "non-invasively" -- i.e., you can't take a type that you don't own, and make it an extension type (e.g. by registering it). It was agreed that this wasn't a desirable feature.

Should Typespec have a name field? The current RFC does not add a "name" as part of the TypeSpec class. It was agreed that load-bearing names aren't a good idea. Decision: no names.

Should user-defined types be supported in gradients?

  • As inputs, yes. This should just work with very little effort?
  • If these are purely container types, then this should definitely just work…
  • This does mean that specific implementations used in dispatch may need to add differentiability.
  • Outputs can be made to work via tf.ones and dispatch.

Should new symbols (e.g. tf.TypeSpec) be added in an "experimental" namespace first? Decision: yes (e.g. tf.experimental.TypeSpec)

How would external libraries (Keras) support overrides/dispatch?

  • On some level, extension type owners need to own this issue.
  • Library owners need to test this as well. E.g. Keras layers can explicitly indicate that they support certain extension types.
  • Most layers can just rely on overriding ops.
  • A library like addons would need unit tests to declare what types to support.
  • What mechanism do we provide addons to support dispatch?
    • We need to allow others to use the dispatch decorator.

Questions on decorator to declare something supports dispatching:

  • If this is something that third parties can use, should we have a string addition?
    • Right now K uses tf.export’d name
    • What is the canonical string name for ‘a thing that supports dispatching’?
      • package_name.symbol?
  • What about instance methods? Like linear operators. How do these support dispatch?
    • We should call the dispatcher with all args, including ‘self’.
    • Dispatch already supports dispatch-to-multiple-things anyways.
  • What args do we want to dispatch on?
    • Currently only on args that expect Tensor, ListTensor
    • Not on attr, nest-of-tensors.
    • Currently only needed for ‘slice’, and that’s already handled via dispatch on __getitem__ with unpacking dispatcher.
      • This is due to loss of static shape information needed for TPUs.
    • We will need to fix certain code that expects to handle static values as integers.

Long-term question: In the future, does it make sense to allow type extensions all the way down to the runtime?

  • We already implement certain things as Tensor subclasses.
  • Long-term, more things should be encoded in the graph.
    • For composites, we are missing op composition. Python allows us to get parallelism, and specificity on what part(s) of the composite we want to touch.
    • As we get better with composition, it might make more sense to move that down a bit.
    • We should aim to have more of these types deeper in the stack -- the compiler is losing information right now.

What about Variant tensors?

  • Example was not Variant, but Optional.
  • Variant allows more preservation of internal data but you have to implement everything
  • You get different optimization opportunities in different contexts. Exploded optimization lose RT data, but you get common subexpression elimination; variants don’t get this.

What about types that do not naturally map to Tensors? Have we thought about exposing other types to the runtime?

  • That would likely be useful.

Decision: RFC approved.

@ematejska ematejska merged commit 5d4f70e into tensorflow:master Aug 13, 2020
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Oct 21, 2020
…nsorFlow Extension Types](tensorflow/community#269).

PiperOrigin-RevId: 338353976
Change-Id: I7f155854973fb496c5c0f43eb24a46aa7418c8d6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants