diff --git a/.github/ISSUE_TEMPLATE/custom.md b/.github/ISSUE_TEMPLATE/custom.md new file mode 100644 index 000000000..b088830a2 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/custom.md @@ -0,0 +1,14 @@ +--- +name: Custom issue template +about: For community repo issues +title: '' +labels: '' +assignees: '' + +--- + +# Community repo + +The only issues you should file here concern the community's documentation or processes. + +All other bugs should be filed on the appropriate repo, questions directed to the various email groups, or if it's a "how-to" question, StackOverflow. diff --git a/CODEOWNERS b/CODEOWNERS index 4845d0ce9..011df5ac4 100644 --- a/CODEOWNERS +++ b/CODEOWNERS @@ -14,9 +14,9 @@ sigs/io/ @martinwicke @ewilderj @mrry @yongtang @dmitrievanthony sigs/jvm/ @martinwicke @ewilderj @sjamesr @karllessard @tzolov sigs/micro/ @martinwicke @ewilderj @petewarden sigs/mlir/ @ewilderj @pkanwar23 -sigs/swift/ @ewilderj @saeta @dynamicwebpaige +sigs/swift/ @ewilderj @saeta @ematejska sigs/testing/ @ewilderj @dynamicwebpaige # RFCs -rfcs/ @ewilderj @martinwicke @goldiegadde +rfcs/ @ewilderj @martinwicke @theadactyl @ematejska diff --git a/governance/api-reviews.md b/governance/api-reviews.md new file mode 100644 index 000000000..509dfd983 --- /dev/null +++ b/governance/api-reviews.md @@ -0,0 +1,268 @@ +# tensorflow/api-owners review practices + +## Overview + +This is an attempt to gather commonly discussed topics when doing API +reviews. It’ll hopefully be a useful resource to both API owners and people +proposing API changes. [TF API Owners](https://github.com/orgs/tensorflow/teams/api-owners) +meet twice weekly to discuss changes. We try to get to PRs on the next meeting, +but we don’t always make it all the way through. If your change is particularly +urgent, please ping the PR to notify us of any urgency. + +## Process + +We only look at changes which have already been approved by other reviewers. If +there are major outstanding comments, we will wait with API review until those +are resolved. If there are questions for API owners, explicitly raise this in +the comments to get an answer. + + +## High level points + +### Backward and forward compatibility +We avoid backwards-incompatible API changes. We also avoid +backwards-incompatible behavior changes, such as restricting the set of valid +inputs to a function or extending the set of valid outputs of a function. Adding +support for previously not supported behavior is okay, as are changes to +explicitly experimental APIs (see section below). When needing to provide a new +or different behavior, we strongly prefer a new version of the API over breaking +backwards compatibility. Note that we are free to deprecate APIs; we just cannot +break code which relies on their documented behavior. We need to worry about +backward compatibility both of our python APIs and of the serialized GraphDefs, +and in general breaking serialized GraphDefs is worse than breaking the python +APIs. + +Forward compatibility is more subtle: we should avoid changing the graph +produced by currently correct python code without a three weeks notice. This +comes up most frequently when adding new ops, but also applies to non-obvious +things such as the graph emitted by gradients or pfor. + + +### Docstrings + +TF APIs should have comprehensive documentation in the form of docstrings. If at +all possible these docstrings should have runnable examples, and these examples +should form a doctest so they stay correct. The examples should demonstrate an +end-to-end user workflow, such that it’s clear how to generate the necessary +inputs for the API and what to do with the outputs. The docstring should be +understandable by someone who is not familiar with TF. See the [guide to writing +TF docstrings](https://www.tensorflow.org/community/contribute/docs_ref) for +more information. + +Our documentation generator for classes only sees methods, so prefer defining +members as properties instead of assigning them in `__init__`. + +Docstrings should only refer to other public TF API symbols (i.e. do not refer +to other symbols defined in the same file as a function which is just now being +made public) and should refer to public API symbols by their full exported name. + +### Common names + +Prefer keepdims over keep_dims. Prefer axis over dim. Data types are called +dtype. name is a common last argument of ops but backward compatibility mandates +that new arguments are added after the last existing argument, even if that +results in name not being the last argument. + +We generally prefer spelling things out over using abbreviations except when +abbreviations are more standard than spelling things out (i.e. don’t spell out +linalg or svd). When in doubt we ask domain experts or use web search to see +what spelling is most common. + +If possible we prefer to name things in a similar way to numpy (e.g., we would +not pick einsum as a name, but numpy and others before it have, and that +precedent is very strong). + +We prefer experimental namespaces (i.e. tf.data.experimental.foobar) over +experimental-prefixed names (i.e. tf.data.experimental_foobar) except when +adding an experimental class method, or an experimental argument. Experimental +endpoints should be deprecated in a minor release before they can be removed in +the next. We would like new experimental symbols to be things which will +eventually end up in core TF as opposed to things we expect will be phased out +with no clear replacement. The best expectation to have for an experimental +endpoint is that the “experimental” will simply be removed. If you don’t believe +that’ll work, it should probably not be added in its current form. + +### Style + +Generally, follow Google style. + +Avoid redundancy. Do not write arguments of the form `function(..., +enable_feature=False, feature_config=None)` if you can also write `function(..., +feature_config=None)`, where implicitly, `enable_feature = feature_config is not +None`. + +Try to embed well with the ambient language. Think about how your API interacts +with language idioms (e.g., in Python: can it be hashed, i.e., used as a dict +key? Is it iterable? Is it a mapping? Can it be equality compared? +Ordered?). Think about how your API interacts with other pieces of the Python +ecosystem as well— is there an analogue in Numpy or PyTorch that we should +consider aligning with? + +Use language-native constructs wherever you can. In Python, a tuple should be a +tuple. The bar for custom configuration objects is relatively high, a dict or +namedtuple goes a long way. + +In particular, do not expose protobufs directly as part of an API. You can use +protobufs for serialization or to encode network traffic. Protobufs should +always be an implementation detail, and never visible on the API surface. Use +language native constructs (dicts or classes for Python, structs for C/C++) if +you need configuration objects. + +Avoid global (or any non-local) state as much as possible (this includes Python +'with' scopes). If you need global context, consider whether it can be +thread-local. The TF API is supposed to be thread-safe. Avoid stateful operation +(mutability) if you can. Both features make it hard to reason about code, and +make composability harder to achieve. + +We prefer strings ("auto", "never", etc) over enums (tf.namespace.AUTO, +etc). Strings are easier to type, and forces us to document all possible values +and their semantics in the docstrings of all places which accept the string, as +opposed to only in the enum definition, which is a little friendlier. + +### Orthogonality and integration with the existing APIs + +Is the new API implementable in terms of existing APIs? If so, we might want to +consider pointing users to using the existing APIs. Does the new API add enough +value over a combination of existing APIs? Does the API solve only a specific +problem (that’s usually a sign combinations of existing APIs would be +preferred)? + +If not, are existing APIs implementable in terms of the new API? If this is +simple, we might want to steer users towards the new and away from the old API +(possibly, old APIs should be deprecated along with introducing the new API). + +If neither is the case, it might be possible that there is a more general API +which makes both the existing API and the new API easy to express. We try to +keep global consistency of our API in mind when reviewing new changes. + +How will this API work together with others? Does it do something slightly +differently than others? Does it expect inputs which match what other parts of +TensorFlow produce? Does its output match what other parts of TensorFlow can +consume? + +Does it do things the same way other similar pieces in TensorFlow do it? E.g., +if a common pattern to achieve a behavior is an extra argument, don't use a +function decorator to achieve the same in a different area of the API. + +Two wrongs don’t make a right. That is, if a bad API already exists in TF, that +does not give license to new APIs to be bad in the same way. Improvement must be +balanced with consistency, however, and sometimes it’s okay to carry small +imperfections into new APIs for the sake of consistency with old APIs. + +### Optional arguments with default values + +Many APIs have optional arguments with a default value. Our recommendation is to +use `None` as the default value of any optional arguments and have the +implementation be responsible for handling it as opposed to using a default +value that directly represents the behavior (e.g. `aggregate='sum'`). The +latter prevents the implementation from distinguishing between the caller not +setting the argument vs. the caller setting the argument to the default value, +which may be needed when the default behavior is changing. + +### Does it belong in TF at all? + +As TF evolves there’s a tendency to put everything inside of it, with costs +compounding over the long term. If there is a reasonable home for a new API +outside core TF (say in addons, io, TFP, or other projects entirely) that can be +strongly preferrable. If new code can be released as independent libraries, it +should be. This is especially true for APIs that are actively evolving; core TF +imposes many restrictions, so it’s far better to trial new APIs outside of the +core library. + +## Adding new ops + +Adding new ops to TF should be done with care. We generally prefer not adding +new ops if possible, but performance, hardware compatibility, and other concerns +often do require new ops. + +When adding new ops, look for: + + - closure under automatic differentiation (i.e. we avoid ops which are + differentiable but not twice-differentiable, or which are technically + differentiable but not marked as such) + - performant kernels (it’s better not to have an op than to have an op with a + suboptimal kernel; we need to make sure kernel experts have reviewed the + code) + - broadcasting (all numerical ops should broadcast using numpy rules) + - does support for this op have to be added to pfor/vectorized_map? + - dtype support (in general all numerical ops should support the common + integer, floating point, and complex dtypes, if they all make sense; we need + to watch out for int32 on GPUs though) + - device support (cuda kernels should be implemented if possible, and similarly + a tf/xla bridge entry should be added if it makes sense) + - attributes versus inputs (anything which can be an input to an operation + should be an input, and attributes should only be used to parametrize the + operation in ways that affect the output dtypes or sometimes shapes) + - state management (is the op stateful? Can it instead be made stateless and + rely on optimizations like memory reuse for performance? Can it be made to + keep its state using one of the existing mechanisms like variables? If not, + its state should be encapsulated using resource handles if at all possible) + - we generally don’t like ops which are supported only on a single device (be + it CPU, GPU, XLA, TPU, etc) and prefer to have at least a plan for writing + device-agnostic code + - should the python layer for this operation support raggedtensor/sparsetensor? + +## Experimental APIs + +Experimental APIs are APIs which have the word 'experimental' somewhere in their +name; for example `tf.experimental.foo`, or `tf.foo.experimental.Bar`, or +`tf.foo(experimental_bar=True)` or `tf.Foo().experimental_bar()`. We generally +prefer experimental namespaces when possible, so prefer +`tf.foo.experimental.Bar` over `tf.foo.ExperimentalBar`. + +Experimental APIs are APIs intended to be added to TensorFlow as-is, but which +we reserve the right to change in backwards-incompatible ways if we have +to. This is different from apis in `tensorflow/addons`, many of which are not +necessarily intended to be added to core TF as they might have a more narrow use +case initially (if APIs in `tensorflow/addons` do become widely useful they can +"graduate" to core, either using experimental or not). + +No temporary APIs should be added to experimental (i.e. "we just need this until +certain bugfix or certain new feature becomes available" is not a valid reason +to add an API with experimental in the name.) + +No API with known deficiencies should be added to experimental. Experimental +APIs should, to the best of our knowledge, not be expected to change in a known +way (no argument with a known bad name, etc). Experimental can, however, be used +for APIs which are a work-in-progress: it's fine to add experimental methods to +a base class even if those methods are only implemented on some subclasses as +long as we expect all classes to eventually implement those. + +The same amount of due diligence required for a real API is required for an +experimental API: this means tests, benchmarks, documentation, end-to-end +examples, etc + +Experimental APIs are not a license to break users. This means: + 1. we do not remove experimental APIs which are widely used without an effort + to help migrate users away + 2. experimental APIs are not removed without warning and don't have + backwards-incompatible changes made to them without warning (the warning can be + a deprecation on version 2.x and removal on 2.x+1, but plain removal on 2.x + with no notice on 2.x-1 is not ok) + +Small changes which are mentioned in relnotes and have obvious fixes might be +made (for example if adding a new argument to a long argument list and we +believe there are few pass-by-position users we might allow the new argument to +be added to the middle and not the end of the parameter list). + +Large backwards-incompatible changes to experimental APIs still require an +`experimental_foo_v2` or similar backwards-compatible evolution plan to avoid +breaking users of the existing experimental API. + +No API endpoint should stay in experimental forever. If a particular +experimental API hasn't had major changes in two minor releases we should remove +the experimental annotation from the API name or delete it. If we do want to +delete it we need to have a deprecation plan that can migrate all users to some +other API endpoint or composition of existing APIs. In rare cases experimental +APIs can continue to be iterated on after many releases (see TPUStrategy); this +only applies for fairly large API surfaces. + +When removing the experimental annotation we should, if at all possible, allow +escape routes to not break existing code. This means toplevel symbols +`tf.experimental.foo` and methods like `tf.Class.experimental_foo` should get a +deprecation warning on 2.x before deletion on 2.x+1; we should use the +doc_controls decorators to not pollute API docs with deprecated "graduated" +experimental APIs. For experimental function arguments we should consider +catching `**kwargs` to raise the proper warnings for at least one version (note +though that `**kwargs` is generally discouraged from our APIs; we prefer +explicitly named keyword arguments if at all possible). diff --git a/governance/cpp-style.md b/governance/cpp-style.md new file mode 100644 index 000000000..4adaa1882 --- /dev/null +++ b/governance/cpp-style.md @@ -0,0 +1,64 @@ +# C++ Coding Style + +Tensorflow follows [Google C++ style](https://google.github.io/styleguide/cppguide.html), +with a few additions. + +## Status + +Functions which can produce an error should return a `tensorflow::Status`. To propagate an +error status, use the `TF_RETURN_IF_ERROR` macro. + +``` +TF_RETURN_IF_ERROR(f()); +``` + +## StatusOr + +`StatusOr` is the union of a `Status` object and a `T` object. It offers a way to use +return values instead of output parameters for functions which may fail. + +For example, consider the code: + +``` +Output out; +Status s = foo(&out); +if (!s.ok()) { + return s; +} +out.DoSomething(); +``` + +With `StatusOr`, we can write this as + +``` +StatusOr result = foo(); +if (!result.ok()) { + return result.status(); +} +result->DoSomething(); +``` + +**Pros:** + +Return values are +[easier to reason about](https://google.github.io/styleguide/cppguide.html#Output_Parameters) +than output parameters. + +The types returned through `StatusOr` don't need to support empty states. To return a type +as an output parameter, we must either use a `unique_ptr` or support an empty state for the +type so that we can initialize the type before passing it as an output parameter. `StatusOr` +reduces the number of objects we have in an "uninitialized" state. + +**Cons:** + +`StatusOr` adds complexity. It raises questions about what happens when `T` is null and +how `StatusOr` behaves during moves and copies. `StatusOr` also generally comes with +macros such as `ASSIGN_OR_RETURN`, which add additional complexity. + +The current Tensorflow codebase exclusively uses `Status` instead of `StatusOr`, so +switching over would require a significant amount of work. + +**Decision:** + +Tensorflow foregoes the use of `StatusOr<>` because it doesn't add enough value to justify +additional complexity. diff --git a/governance/design-reviews.md b/governance/design-reviews.md new file mode 100644 index 000000000..7fe586eff --- /dev/null +++ b/governance/design-reviews.md @@ -0,0 +1,227 @@ +# tf-design-reviews criteria + +## Overview + +The TensorFlow team has run internal and public design reviews for a while +now. This document tries to capture what type of questions get asked and +concerns get addressed in TF design reviews. It is intended to be used by design +authors as a way of spot checking whether a design review will be useful for +them and by design sponsors as a way of making sure a design proposal clears the +bar for review (ideally every topic in this document should be addressed by the +design proposal for it to be considered). + +The main goal of tf-design-reviews is to socialize big changes to TF, document +them, and ensure all stakeholders get a chance to comment on planned +improvements before they’re implemented. Any time a change is made to TF that +will affect multiple aspects of its design or user interface, we should solicit +a design review. TF design reviews themselves are not binding: final approval +rests with whoever has the authority to approve the required code changes, and +the design review is a tool to get consensus and feedback on big changes before +actual approval. + +By default TF design reviews should go through the open RFC process in the +tensorflow/community repository, but we will on rare occasions accept design +reviews of google-internal TF-related infrastructure which should be kept +private due to reasons beyond our control. + +## General considerations + +Every item in this section should be addressed by a TF design review. We do not +require a solution prior to review but we do want to see that the review author +has considered these issues. It is the design sponsor’s job to ensure that +review documents have thought through these issues. + +### Performance +Performance is the core reason why most end users use TensorFlow at all; hand +writing code with the same level of performance is prohibitively expensive, and +any other similarly-performing or better-performing solution can also be +integrated in the ecosystem in principle. In that vein, all new designs to TF +should carefully consider their performance implications. + +Performance in TF is multi-faceted: we need to worry about scaling from very +small devices (including microcontrollers) to very large devices (beyond TPU +pods); we need to worry about interactive usage (so the cost of making small +changes should be small) and about batch usage (where it’s ok to sacrifice some +startup time to improve steady-state performance); we care both about throughput +(maximizing accelerator utilization saves a lot of money) and latency (as TF is +used in all parts of Google’s software stack); we also care about performance on +many types of hardware. + +Can a particular design proposal be implemented efficiently? Does it impose any +inherent limits on the performance in any of the scenarios above? How will it +interact with our other tools for performance (grappler, XLA, eigen, tf.data, +etc)? + +### Scope + +Does this proposal even belong in TF? As TF itself grows, it’s becoming +substantially more expensive to develop software inside TF itself than as a +separate TF-using project. In this light we need to evaluate whether it’s at all +possible to release a broadly new API or library as its own separate project in +the TF ecosystem. + +Even separate projects in the TF ecosystem can benefit from TF’s devrel, blog, +twitter, etc for promotion. It might be possible to share resources dedicated to +CI or github triage, or share infrastructure around syncing to/from google3. + +Ideally the only things being added to core TF at this point are things which, +if they are not in core TF, they dramatically limit the usefulness of core TF +itself. General protocols and APIs which different libraries in the TF ecosystem +can implement / accept are good examples of things which undoubtedly belong in +core TF. Ops and kernels used to need to be in core TF, but this is no longer +the case as other projects have sustainable releases of their own binary blobs +and the TF team is working to make it cheaper to release ops and kernels outside +core TF. + +Note that we also encourage using the TF design review slot for reviewing +proposals which despite not living inside core TF are expected to be a part of +the broader TF ecosystem. + +### Programmability / flexibility + +TensorFlow is fundamentally a library to be programmed, and not a collection of +packaged black-box solutions. While it’s cheaper for any individual problem to +solve it with a simple one-line push-button packaged solution this tends to work +poorly in the long run, and lead to usability cliffs and undue API pressures. + +For example, let’s think about what would happen if instead of providing tools +to build neural network layers, TF only provided a function that built an entire +network for you. At first we could have very impressively short code examples +(“switch from inception to resnet50 in one line of code!”), but over time users +whose use cases are not exactly covered by this API would either have to +reimplement substantial parts of it themselves or would (most likely) file bugs +asking for small extensions to the API (“can we make resnet52? resnet36?”). Over +time, these bottleneck APIs develop large parameter lists of mutually exclusive +parameters which amount to a poorly defined configuration language for how to +use them. + +A key consideration when evaluating a TF design proposal is what would happen +for use cases that are slightly different from the use cases covered in the +proposal itself. The goal is not that the proposal should cover everything but, +rather, that it should be possible to easily reimplement parts of the proposal +using lower level APIs already in TF. If this is not the case then instead of +first implementing the end-to-end solution we need to discuss what low-level +APIs TF should have in such that this proposal could be easily implemented, and +only then reevaluate this proposal. + +We also worry about proposals which are too device-specific (be it TPU-specific +or GPU-specific or CPU-specific). While many such things seem reasonable when +first proposed, they break down over time as the set of users for different +devices overlaps quite a bit. + +### Integration + +As TF has grown, it has sprouted an ecosystem of tools and libraries both +internal and external to TF. New entries to this ecosystem should, as much as +possible, coexist and peacefully cooperate with other entities in the TF +ecosystem. Failing that, new entries should cleanly replace existing +ones. Awkwardly coexisting is not an option we recommend. + +The ecosystem includes both things currently inside TF such as Estimator, Keras, +tf.data, tf.distribute, tf.tpu, XLA, or tf.saved_model as well as things +developed outside TF, such as TF probability, vizier, TF serving, MLIR, TFRT, +Sonnet, among others. If existing integration points do not suffice, we should +consider developing new integration points (i.e. how the Sonnet team developed +tf.Module to integrate sonnet, which lives outside TF, with tf.train.Checkpoint, +tf.keras, tf.function, or tf.saved_model). + +It is also important that new designs don’t break existing abstractions which TF +supports, such as eager execution, functions, control flow, gradients, or +automatic vectorization. In general, libraries which use simpler TF primitives +(like tf.data) are easier to integrate into the ecosystem than libraries which +try to rewrite TF programs (like TF transform v1). Similarly, we should prefer +proposals which rely on explicit APIs to accomplish things over proposals which +want to do post-hoc graph rewriting (or make converters, or exporters) as those +tend to integrate poorly with each other and tend to be hard to directly +program. + +### Maintenance + +As many proposals for TF improvements come from outside TF or from outside the +subteams in TF which currently maintain related functionality, TF design +proposals should be upfront about the maintenance story for new functionality +and code. + +It is perfectly fine (and common) to punt maintenance on the TF team, but we +should — ahead of the design review — figure out who specifically in the TF team +is signing up to maintain this specific design. + +### User groups + +While TensorFlow cannot be everything for everyone we do try to cover a broad +swath of machine learning use cases, spanning the spectrum from research to +production, from small to large devices, and from commercial to educational +uses. + +It is important for every proposal to TF to talk about which segments of our +user community’s needs are being addressed and for which ones this is expected +to be irrelevant. Specifically, consider stereotypical pure researcher in ML, +researcher applying ML to other fields, students learning ML, industry +professionals applying ML with little to no understanding, industry applied ML +developers, mobile developers, and others. + +## Specific considerations + +Some particular subsystems of TF have their own considerations which are often +relevant for TF design reviews. It is up to individual designs’ sponsors whether +any of these topics needs to be addressed in the document before review. + +### Eager/graph mode + +In TF1.x many libraries implicitly or explicitly assume graph-based +execution. As TF 2.0 has been released, eager execution is on by default. This +means that all new TF APIs should be usable from eager execution or from graphs, +and new design proposals should be implemented so they work with both. + +In practice this means we cannot rely on per-graph global state, reference +dtypes, and graph pruning to ensure program correctness. Similarly it was +possible in some cases in TF1.x to treat a Tensor as a Promise. In TF2, however, +a Tensor is an already-computed value, and if you need a promise use instead a +function which can compute a tensor on-demand. + +### Keras + +Keras has a special status existing both inside and outside TF. As such, changes +to Keras need to consider the impact on the entire Keras community. New APIs to +be added to Keras can comfortably live in tf-addons. Changes to core Keras APIs +need a review owners or sponsor from the Keras team before a TF-wide review. + +Further, changes outside the scope of Keras should address how the change will +interact with Keras users, if at all. For example, if a new CompositeTensor is +proposed, will it be a plausible input to Keras layers? If so, how will support +be implemented? + +### tf.data + +tf.data is TensorFlow recommended API for data (pre)processing and any designs +that pertain to handling data should consider how they relate to tf.data. New +designs pertaining to handling data should provide useful functionality on top +of tf.data (such as the TFDS project or a library of common transformations for +tf.data.Dataset.map) as opposed to alternatives to tf.data. + +In addition, new CompositeTensor subclasses should strongly consider +implementing the optional BatchableTypeSpec interface which is needed for +tf.data to be able to batch and unbatch instances of the subclass. + +### SavedModel + +SavedModel changes in particular need to be both forward and backwards +compatible, as SavedModel files will be written by and read from different TF +versions entirely. In general, removing things from the format is not OK but +adding things is possible if new additions are not required to correctly read +the model from older binaries. + +### “Impossible” designs + +There are many things which are possible to make work for specific cases but +impossible to make work in general. We should avoid proposing changes to TF that +look like they work in general but in practice each new use case needs to be +covered by manual work from the TF team. + +### Distribution Strategies + +tf.distribute is the recommended API for distributing computation over GPUs, +TPUs and multiple machines. It is important to consider the implications of a +new design wrt how it would work in a distributed setting. There may be explicit +changes required to ensure the new functionality works seamlessly with / without +tf.distribute. diff --git a/governance/rfc-admin-guide.md b/governance/rfc-admin-guide.md new file mode 100644 index 000000000..6fdb6ff5e --- /dev/null +++ b/governance/rfc-admin-guide.md @@ -0,0 +1,138 @@ +# Admin guide for RFCs + +## Overview + +This document describes the process for community managers administering +TensorFlow RFCs. + +|Author |Edd Wilder-James [@ewilderj](https://github.com/ewilderj) | +:------------------------------|:-----------------------------| +|Last updated |2019-10-21 | + +## RFC Submission Process + +### 1. PR is submitted to `tensorflow/community` + +When a PR is submitted containing an RFC proposal, check for basic +formatting concerns. + +* The filename should be `rfcs/YYYYMMDD-my-rfc.md` - where YYYYMMDD is the + date, and hyphens connect any naming parts. No underscores. No uppercase + letters. + +* The header block of the RFC should be filled in properly, including the + status field set to "Proposed" + +### 2. Conform the RFC title + +* In GitHub ensure the PR title is `RFC: The RFC's Title`. Check past PRs to + see how they're all consistent. + +### 3. Edit the PR description + +The description (the first comment on the PR) of every RFC should look the + same. They should contain, in order: + + * When the public review period closes. This is at least two weeks from the +date of publication. + + * The header table from the RFC showing author, sponsor, date. + + * A summary of what the RFC is about + +Here's an example: + +
+ +*Comment period is open until 2019-08-28* + +# Kernel and Op Implementation and Registration API + +| Status | Proposed | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | James Ring (sjr@google.com), Anna Revinskaya (annarev@google.com) | +| **Sponsor** | Günhan Gülsoy (gunan@google.com) | +| **Updated** | 2019-08-14 | + +## Objective + +Tensorflow (TF) currently provides a C++ API for implementing kernels and ops. The Voltron project aims to create a modular/plugin-based TF implementation with API and ABI surfaces. Plugins will be able to create and register custom kernel and op implementations. + +In order to provide a stable ABI, the Voltron team has chosen to provide C APIs to plugin authors. This document introduces the C API for op and kernel registration. For authors who wish to continue using C++ to interface with TensorFlow, an ABI-stable C++ header-only API is provided. +
+ +### 4. Apply labels + +* Apply the `RFC: Proposed` label, and any other appropriate label for the + particular area of TensorFlow concerned, e.g. `TFX`. + +### 5. Add the PR to the `RFC Management` project + +### 6. In the `RFC Management` project, move from "Needs attention" to "Under review". + +### 7. Publicize the RFC to `developers@tensorflow.org` and any other community-relevant mailing lists + +Here's a template announcement. Check out the [many examples](https://groups.google.com/a/tensorflow.org/g/developers/search?q=RFC). + +
+To: developers@tensorflow.org
+Subject: [RFC] ACME TensorFlow API
+
+Hi folks,

+ +I'm pleased to announce the publication of a new TensorFlow RFC, +[ACME TensorFlow API](https://github.com/tensorflow/community/pull/162). + +The comment period for this RFC is open through YYYY-MM-DD. Comments are +invited to the [pull request +linked](https://github.com/tensorflow/community/pull/162). You can view the +design doc there and also leave your comments inline on the +[document source](https://github.com/tensorflow/community/pull/162/files). + +**Summary** + +The TensorFlow ACME API allows usage of all vintage cartoon characters +in an agent-based simulation. Wile E Coyote and the Road Runner are +default personas, but we also propose the addition of Yosemite Sam +and Bugs Bunny. + + +Thanks in advance for your feedback! +
+ + +## RFC Acceptance Process + +When an RFC's comment period is over, a review meeting is usually held. +(There may be occasions when one is not needed, consult with the RFC author). +It is the responsibility of the author or sponsor to post the notes from +that review into a comment on the PR, but you may need to remind them to do +this. + +You can move the RFC into the "Awaiting notes" part of the `RFC Management` +project to help keep track. + +**If the RFC is accepted**, ask the proposer to submit a final update, changing +the status to Accepted, and adding the RFC number into the header, per +the template (an RFC's number is the same as the PR number GitHub assigned it.) + +Just occasionally you might have to do this yourself: you can edit the +Markdown in the PR yourself, as a code owner for the repository. + +You can then: + +* Remove the `RFC: Proposed` label and add the `RFC: Accepted` one +* Approve and merge the PR. + +This should automatically move it to `Accepted PRs` in the `RFC Management` +project. + +**Other possible end-states** + +* If revisions are required, note that in the PR comment, keep the PR open but + move it to `In Revision` in the `RFC Management` project. + +* If the RFC is abandoned, note that in the comments, close the PR, and move + it to the `Not progressed` column in the `RFC Management` project. + + diff --git a/rfcs/20180827-api-names.md b/rfcs/20180827-api-names.md index b4dd11cd3..8964aa6a9 100644 --- a/rfcs/20180827-api-names.md +++ b/rfcs/20180827-api-names.md @@ -55,10 +55,10 @@ Furthermore, TensorFlow API has many users. Therefore, we should avoid removing We plan to add the following additional namespaces: -**tf.random** - will contain random sampling ops. -**tf.keras.layers** - will contain all symbols that are currently under `tf.layers`. Note that signatures of these symbols will likely change to match layers under tf.keras.layers better. -**tf.keras.losses** - will contain all symbols that are currently under `tf.losses`. Note that signatures of these symbols will likely change to match losses under tf.keras.losses better. -**tf.keras.metrics** - will contain all symbols that are currently under `tf.metrics`. Note that signatures of these symbols will likely change to match metrics under tf.keras.metrics better. +**tf.random** - will contain random sampling ops. +**tf.keras.layers** - will contain all symbols that are currently under `tf.layers`. Note that signatures of these symbols will likely change to match layers under tf.keras.layers better. +**tf.keras.losses** - will contain all symbols that are currently under `tf.losses`. Note that signatures of these symbols will likely change to match losses under tf.keras.losses better. +**tf.keras.metrics** - will contain all symbols that are currently under `tf.metrics`. Note that signatures of these symbols will likely change to match metrics under tf.keras.metrics better. Note that we already introduced some new namespaces earlier in June, specifically @@ -77,8 +77,8 @@ move [TensorFlow Debugger](https://www.tensorflow.org/guide/debugger) to We plan to deprecate entire contents of the following namespaces: -**tf.logging** - Python `logging` module can be used instead. -**tf.manip** - We will keep endpoints in root for symbols in `tf.manip`. `tf.manip` was added recently but most tensor manipulation ops are frequently used and it makes sense to keep them in root instead. +**tf.logging** - Python `logging` module can be used instead. +**tf.manip** - We will keep endpoints in root for symbols in `tf.manip`. `tf.manip` was added recently but most tensor manipulation ops are frequently used and it makes sense to keep them in root instead. ## Additional endpoints @@ -110,7 +110,7 @@ Endpoints*. Note: the list in the appendix does not include endpoints under `tf. # Impact -Browsing for symbols should become easier. For e.g. page for tf.math namespace should display all math functions that TensorFlow provides. Similarly, tf.sets namespace page should display all available set operations. +Browsing for symbols should become easier. For e.g. page for `tf.math` namespace should display all math functions that TensorFlow provides. Similarly, `tf.sets` namespace page should display all available set operations. Removing symbol endpoints would break references in user code. We plan to apply removals as a part of TensorFlow 2.0 release and provide a conversion script that would replace deprecated references with canonical ones. Initial script is at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/compatibility/tf_upgrade_v2.py. It will be updated to match changes in this doc. diff --git a/rfcs/20180918-functions-not-sessions-20.md b/rfcs/20180918-functions-not-sessions-20.md index ed11a3bcd..390aab5f4 100644 --- a/rfcs/20180918-functions-not-sessions-20.md +++ b/rfcs/20180918-functions-not-sessions-20.md @@ -1,6 +1,6 @@ # TensorFlow 2.0: Functions, not Sessions. -| Status | Proposed | +| Status | Accepted | :-------------- |:---------------------------------------------------- | | **Author(s)** | ashankar@google.com, joshl@google.com | | **Sponsor** | apassos@google.com | @@ -219,7 +219,7 @@ A preview of this implemented in `tf.contrib.eager.defun` today (using [Au ### Functions that create state -In the above code, no `tf.Variable` objects are created inside a `tf.function` decorated function. This is makes it clear that the code will have the same semantics once wrapped. +In the above code, no `tf.Variable` objects are created inside a `tf.function` decorated function. This makes it clear that the code will have the same semantics once wrapped. Note that if the function naturally creates state only on the first trace, all is well: diff --git a/rfcs/20181016-optimizer-unification.md b/rfcs/20181016-optimizer-unification.md index e2fe174d6..cfc6ee6c8 100644 --- a/rfcs/20181016-optimizer-unification.md +++ b/rfcs/20181016-optimizer-unification.md @@ -1,6 +1,6 @@ # TensorFlow 2.0: Optimizer unification -| Status | Proposed | +| Status | Accepted | :-------------- |:---------------------------------------------------- | | **Author(s)** | Francois Chollet (fchollet@google.com) | | **Sponsor** | Martin Wicke (wicke@google.com) | @@ -313,13 +313,13 @@ FtrlOptimizer(learning_rate, Proposed signature: ```Python -FTRL(learning_rate, +FTRL(learning_rate=learning_rate, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, - name="FTRL", - l2_shrinkage_regularization_strength=0.0) + l2_shrinkage_regularization_strength=0.0, + name="FTRL") ``` diff --git a/rfcs/20190305-modular-tensorflow.md b/rfcs/20190305-modular-tensorflow.md new file mode 100644 index 000000000..af9c17eed --- /dev/null +++ b/rfcs/20190305-modular-tensorflow.md @@ -0,0 +1,465 @@ +# Modular TensorFlow + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Gunhan Gulsoy (gunan@google.com) | +| **Sponsor** | Martin Wicke (wicke@google.com) | +| **Updated** | 2019-11-25 | + + +## Motivation + +TensorFlow is a very successful open source project. Since it has been open sourced, [1800+ contributors](https://github.com/tensorflow/tensorflow) have submitted code into TF from outside Google. However, as more and more developers contribute, it becomes more and more difficult to manage contributions in the single repository. + +This project aims to split the TensorFlow codebase into **smaller, more focused**, repositories that can be released and managed separately. These modules will talk to each other using **well defined APIs**. Thanks to the module APIs, these modules are now **managed/owned/released independently.** + +### Problems addressed + +#### Spaghetti dependencies + +Everything being in a single repository encourages everyone to freely use all other code in the repository, even when this does not make sense. This ensures the library is very difficult to modify later. + +#### Long build times + +Many volunteers and developers outside Google use their laptops for development. On such systems, TF development cycles require building all of tensorflow that takes around 2 hours. While there is an argument to be made that bazel caching should ensure that they build all of TF only once, without forge, bazel caching is not working as [well as expected](https://docs.bazel.build/versions/master/remote-caching.html#known-issues). + +#### Adding support for new hardware is very difficult and not scalable + +The ML ecosystem is constantly expanding. New hardware for accelerating ML applications is being worked on by many teams inside and outside Google. As the most popular machine learning framework, TF is expected to add support for many of these hardware as quickly as possible. + +Currently, this means that all such hardware developers need to check in their code into the [main tensorflow repository](http://github.com/tensorflow/tensorflow). This means that all the changes are required to go through TF team's review. This can make merging support for new hardware very very difficult. + +#### Long PR review queue + +TensorFlow is a very successful opensource project. It is the [4th most forked](https://github.com/search?o=desc&q=stars:%3E1&s=forks&type=Repositories) and [5th most starred](https://github.com/search?q=stars%3A%3E0&s=stars&type=Repositories) project on github. This also means that TF receives quite a lot of opensource contributions. The TensorFlow team has to review all contributions to TensorFlow itself. This creates a bottleneck for merging the changes to the main repository. + +#### Flexibility for collaborators + +Currently, any partner or contributor that would like to work with us are subject to all the rules within of the main repository. Some of these can be relaxed through modularization, where work can happen in a separate repository. + +#### Large TF support matrix + +The TF support story is a unique beast. Our support matrix has a lot of orthogonal dimensions. Below are some of the more prominent dimensions: + +* Environment (google3, opensource) +* Operating system (Linux, windows, macos, mobile) +* Architecture (x86, arm, ppc64, …) +* Accelerator (CPU, GPU, TPU) +* Compiler (GCC, Clang, MSVC) +* Python version (2, 3.4, 3.5, 3.6, 3.7) + +More can be added to this list where we have to rebuild TensorFlow to support different network architectures, CUDA versions, SIMD instruction sets, etc. + +Having a monolithic repository means we need to rebuild all of our code for all of these different combinations. However, it makes no sense to rebuild all of our C++ code if the only difference is the Python version. Or rebuild all of our CPU kernels for different CUDA versions. Modularizing our code means we only need to rebuild and test the modules that are directly impacted by the dimensions we are changing in the support matrix. + + +## Overview + +This project aims to split the TensorFlow codebase into **smaller, more focused**, repositories that can be released and managed separately. These modules will talk to each other using **well defined APIs** that will evolve over time. Thanks to these APIs, these modules will be **managed/owned/released independently**. There will be different strategies to break apart pieces based on the languages, but below summarizes the approach for C++ and Python: + + +![alt_text](20190305-modular-tensorflow/big_picture.png "Overview of modular TensorFlow") + +A summary of the above is: + +* Core TF functionality will be implemented in C++ +* Core TF functionality can be extended using shared objects. +* On top of the core C++ libraries, we will have the language bindings (Using the C API) +* There can be more functionality built on top of the core TF bindings in different languages, which can be maintained and distributed separately. +* All different pieces need to use well defined public APIs. + +A few important points to clarify above are: + +* We will try our best to make sure the APIs will stay as close as possible to + the current APIs. +* We are aiming to avoid needing to change most existing custom op and kernel + code. +* The APIs will evolve over time. We will modify the APIs based on our and + user's needs. These modifications are expected to follow versioning guidelines + [described + here](https://github.com/tensorflow/community/blob/592221e839eb9629a9ff4c73d46ee44ccb832d97/rfcs/20190816-tf-project-versioning.md). + + + +### Definitions + +This section will briefly describe the terms we will use in the rest of this design. + +**Modules:** These are the components of core TF library that will "accept" plugins to expand their capabilities. Examples of modules are networking, filesystem, (graph) optimizers. + +**Plugins:** Plugins are extensions to different modules. For example, filesystem module can have a GCP plugin, an S3 plugin, an HDFS plugin. + +**Shared objects:** These are dll/so/dylib files that can house **one or more** plugins for **one or more** modules. + +**Packages:** Python pip packages which may include Python files and/or shared objects. + + +### C++ + +This project aims to implement similar plugin architectures for multiple components of TF code. While these will be listed separately, **there will be a lot of shared pieces between these components**. The modules we would like to handle are: + +1. Networking module, with verbs, gdr plugins initially +1. Filesystems module, with GCP, AWS and HDFS support +1. Kernels module, +1. Optimizers/Graph rewrite module, +1. Accelerator backends module + +The above is not an exhaustive list of modules we would like to introduce. These are just initial ideas for good candidates for modules. + +In the initial plan, we will keep XLA as a part of core TF, because of the large API surface. Once the initial work for the above modules are complete, we can reevaluate this decision with XLA team. + +Each of these aspects require unique special treatment in addition to the common strategy defined below. These unique nuances will be discussed in separate documents. Next, we detail the module/plugin architecture we would like to implement abstracting out the specific components listed above. + +This is a high level description of a single module and multiple plugins for this module: + +![alt_text](20190305-modular-tensorflow/cpp_module.png "C++ module example") + +The big pieces of this design are: + +1. Modules: Well defined components within tensorflow that need multiple implementations, and select different code paths at runtime +1. Plugins: These will be implementations of each module. Each plug-in will implement an "interface" (i.e., the API defined as C functions rather than a pure virtual class in C++). These will be loaded dynamically at runtime, +1. TF framework/platform APIs: Shared functionality needed for implementing the plugins. Main purpose is to pass data/state between plugins and core. + + +### Python + +Below diagram provides a summary of the proposed TensorFlow pip package ecosystem. + + +![alt_text](20190305-modular-tensorflow/py_modules.png "Python module example") + + + +1. TensorFlow Base pip package: This provides the core TensorFlow functionality all of TF will share. While estimator and keras provide high level neural network primitives, base will provide basic matrix operations these two packages will use to build the high level APIs. +1. Required TensorFlow addons: These are pieces of TensorFlow that has to be included in all TF distributions. Examples to this are Estimator, Keras, TensorBoard and base. These are pieces of the public API that are promised to exist by our compatibility guarantees. +1. TensorFlow Metapackage: This will be a thin package that only defines the composition of TensorFlow. Please see the detailed design section for more details on this package. +1. Optional TF packages: These packages will include the optional TF features users may choose to load and enable after they have TF working. Without these, TF will work just fine. Example features we will have as optional packages are GPU support, MKL support, or cloud filesystem support. These will use the C++ modules to load the functions they provide at runtime. + + +## Detailed Design + +We will describe each key design element here in detail. To make the points clearer, trivial examples will be created. + +### Modularity in C/C++ + +This section will describe the key design points for C++ modularity. + +#### Modules + +Each module's main pieces will be a module interface in C, and a registry for plugins implementing this module. As a supporting piece, each module will also need to provide a mechanism for plugins to add themselves to the registry at runtime. Below is a toy example for the described design: + + +``` +// A simple toy module called "M" +typedef struct M_context; + +// The summary of the interface plugins need to implement +typedef struct MInterface { + void (*M_Init)(M_context*, string); + void (*M_Execute)(M_context*); +}; + +// The plugin registry for module M +// The registry here implemented uses C++ class std::map +// This is OK as it is not exposed in our example here. +// As far as module implementations are concerned, they only +// need to see and use M_RegisterPlugin method, and that method takes // care of their addition into this registry. +std::map m_plugin_registry; + +// Function to call for each plugin at load time +// to add themselves to the registry +void M_RegisterPlugin( + string plugin, + void (*M_Init)(M_context*, string), + void (*M_Execute)(M_context*)) { + // Omitting error handling. + m_plugin_registry.insert(plugin, + MInterface(M_Init, M_Execute)); +} + +// Implementation of the interface is just a thin layer to +// get the correct plugin and call it. +// Here we assume that plugin explicitly gets selected by +// the init function. Some modules can go with implicit selection, +// Such as deducing the filesystem from the file path. +void M_Init(M_context* ctx, string id) { + // Just a quick hacky way to select the plugin here + ctx->interface = m_plugin_registry[id]; + ctx->interface.M_Init(ctx, id); +} + +void M_Execute(M_context* ctx) { + ctx->interface.M_Execute(ctx); +} +``` + + +**Please note that the above is a toy example to explain the abstract ideas described above. Exact implementation can vary across different modules.** + +Interface has to be pure C at the ABI level. We can have C++ header-only libraries built on top of these C ABI/APIs. + + +#### Plugins + +Plugins need to include implementation of the interfaces declared by one module. If the module interface requires Init and Compute methods, it will need to implement these two functions, plus a TF_InitPlugin function which will be called at load time. This function will also need to register the plugin as prescribed by the module. + + +``` +// A simple plugin implementation A, for module M. +typedef struct M_context; + +// Here is the meat of the plugin: +void A_Init(M_context* ctx, string id) { + // id can be thrown away, in this example. + // Or can encode different ways plugin can be initialized. + ...... + // initialize the plugin + // Modify the context using functions exposed by core + TF_SetFooInContext(ctx, foo); + TF_InitBarInContext(ctx); +} + +void A_Execute(M_context* ctx) { + .......... +} + +void TF_InitPlugin() { + M_RegisterPlugin("A", A_Init, A_Execute); +} +``` + + +When this plugin is loaded by TF at runtime, `TF_InitPlugin` method will be called. This method will register the plugin as prescribed by the module, and exit. + +Plugin shared objects will need to follow some standards: + + + +1. These cannot export any symbols into global namespace to avoid symbol collisions. +1. They need to communicate to TF through any provided C APIs. +1. They can link anything they like, anyway they like in addition to TF. +1. They can be built and distributed separately + + +#### How plugins are loaded + +TensorFlow will look for and load plugins in two different ways: + + + +1. Default TF plugin directories, such as `...python_dir.../site-packages/tensorflow-plugins` +1. User calls `tf.LoadLibrary` + +Both of these will go through the tf.LoadLibrary method, which does the following: + + + +1. Check the directory for plugins. +1. For each plugin, check if they are loadable using platform string, as defined in: [tensorflow/core/platform/platform_strings.h](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/platform_strings.h) +1. dlopen the library +1. dlsym `TF_InitPlugin symbol`, and call it +1. exit. + +To ensure that correct platform strings are generated in each shared object, each plugin library is required to include the following code snippet: + + +``` +#include "tensorflow/core/platform/platform_strings.h" +TF_PLATFORM_STRINGS() +``` + + + +#### TF Framework/platform API + +While all the above components require a lot of careful planning, this piece of this design will require close to half of the total coding load. Currently, all planned components treat the union of all headers under TF as their API. More than 90% of these headers define C++ APIs. + +Our work here will cumulatively build a large API surface in C. While we can compartmentalize work for each module, with more modules implemented there will be a lot of shared API endpoints in core framework between these modules. + +Below is a 10K ft view of how the "different" APIs may look like in a simple set representation, with varying sizes, varying levels of intersection. + + +![alt_text](20190305-modular-tensorflow/api_picture.png "A simple representation of the API.") + +We expect all of this work to be cumulative and finally defining a large coherent API surface for TF. + +Once this API is ready, header only C++ APIs can be defined using this API. + + +### Modularity in Python + +This section will describe the key design points for modular Python packages for TF. + + +### TensorFlow base pip package + +Contains the base Python API, and "Core TF" C++ shared objects + +This package will be a subset of the current "tensorflow" pip package. It will include all of the core TF API except the high level API modules we will split up. It will define a public API for everything except for the required add on packages. + +### Required tensorflow addons + +These packages are planned to contain high level TF functionality that can be safely split up from TF. Examples for these are tensorboard, estimator and keras. Together with the base TF package, these packages will contain the full Python code of TF, except for top level API wiring. As like any addons, these are only allowed to use public APIs exposed by their dependencies. These packages have two constraints + +1. They are only allowed to use public APIs exposed by their dependencies. +1. They are required to provide backwards compatible public APIs. + +With the backwards compatible public APIs, we expect addons to be able to release independently as long as features they depend on are released in their dependencies. + +These packages will have full control over the versions of their dependencies. We recommend they only set a minimum version for their dependencies. When they need new features, they will bump their minimum requirement to include the new API changes. + + +### TensorFlow Metapackage + +This package will reconstruct the TF public API from the base and other requirements + +Just a simple setup.py file that defines dependencies on specific versions of all required packages and base package, plus an `__init__.py` file that defines the top level tensorflow API. + +The Python code for tensorflow metapackage will be a single `__init__.py` file that will look like this: + + +``` +from tensorflow_base import * +import tensorflow_estimator as estimator +import tensorflow_keras as keras +<………… more packages as needed > +``` + + +A new tensorflow release will mean we will pick combinations of dependencies, run all our integration tests, and then release the above python file with these dependencies in its setup.py file: + + +``` +TENSORFLOW_DEPENDENCIES= [ + "tensorflow_base == 1.x.y", + "tensorflow_estimator == 1.a.b", + "tensorboard == 1.c.d", + "tensorflow_keras == 1.e.f" +] +``` + + + +### TF Public APIs + +As a part of the modularization, to be able to decouple development and releases for each of these packages, each package is required to expose a **well defined, well documented public API**. + + +### Optional TF packages + +Mostly expected to contain the C++ plugins defined in the previous section. These will be simple pip packages that will deploy the shared objects under "site-packages/tensorflow-plugins" + +These shared objects will be automatically loaded by TF core if: + +* They correctly define the compatibility strings using `TF_PLATFORM_STRINGS` +* They are compatible with the system tf core is running on +* They have been properly built and signed (unless running in developer mode) + + +## Alternatives / Potential Issues + +* **Why do we not use C++ APIs instead of C**: Compilers have no guarantees for ABIs generated for C++ code. Any C++ API used will require each shared object to be compiled with the same compiler, using the same version of the compiler, with the same compiler flags ([See github issue 23561](https://github.com/tensorflow/tensorflow/issues/23561)). +* **Why do not we statically link everything**: Single shared object for everything: Anywhere except google does not have access to the massively parallel build system we use here at google. This causes prohibitive build times, causing major developer pain for open source developers. There are many more issues, but the summary is while this is a great solution for google, outside google this is simply infeasible. +* **TF will become a suite of multiple packages, built by multiple authorities. What if the bugs get blamed on TF team**: With the modular model, we expect testing of 3rd party code to become easier. This can also be mitigated if the error messages are better, and if they can clearly point out which module the issue stems from. Finally, we can create an apple-swift like testing model, where we run a Jenkins setup that people can donate their machines to, and we can run continuous integration tests on their plugins. +* **Why not have APIs but still have a monolithic repository** When everything is in a single repository, this enables developers to bypass the APIs, and depend on internals. Moreover, we cannot grant full control over different folders on our repository to our partners in a single repository. As long as they are in a single repository, they are still constrained by our build system and license. Finally, in a single repository we do not provide the option of closed source plugins for contributors. +* **Why not go with the OSS federation solutions?** OSS federation requires all dependencies to be in the federation before adding a repository. This is simply not possible for tensorflow, as eigen, llvm and many other dependencies will never be a part of the federation. +* **Documentation, how/where do we document everything?** With multiple repositories, structure of the documentation will need to be rethought, based on what is a part of "TensorFlow proper" and what is an optional feature. + + +## Testing Plan + +We propose the following principles to be followed for testing in a modular world: + +* Each plugin tested separately. +* Modules can plan their own integration tests. +* Cut combinatorial explosion by divide and conquer. +* Fuzzing at the core-module interface level if possible, in the case we need data marshalling between layers. +* With this proposal, we aim to also simplify testing of tensorflow code. The biggest gain we expect in the modular world will be, we will be able to "divide and conquer" the TensorFlow support matrix. +* Following describes an early modularized tensorflow package structure: + + +![alt_text](20190305-modular-tensorflow/initial_tf_deps.png "Initial package structure") + +In the current setup, we need to test all of the above packages for different Python versions, operating systems, accelerators (CPU, GPU), compilers, and more variants combined. In the modularized world, each of these packages only need to be unit tested for the following: + + +* tensorflow-base: Operating systems, compiler versions and python versions only with CPU +* tf-gpu: With GPU only, for different operating systems. +* tf-estimator: Only for different python versions + +When testing a package that has dependencies, such as tf-estimator, or tf-gpu, tensorflow-base will be installed with its latest stable release, to ensure to avoid any flakes by this package. + +We know that with the current release cadence this is too slow to support TF's rapid development. But with modules, we expect to be able to release much more frequently. + +On top of the proposed unit testing plan above, we will need package level integration tests. We propose these to be run every night at head. We propose the following to act as the TF integration tests: + + + +* Miscellaneous pip utility tests. Build the nightly pip packages, install them. Then make sure you can import TF, run the command line utilities distributed with TF. +* Tutorials/notebook tests. Build the "nightly" pip packages for all above components. Then install this package. Finally extract the Python code from notebooks and run them as graphs +* Models tests: Build and install the nightly pip packages, then run curated models under tensorflow/models for a small amount of steps. + +The above will simply check the sanity of TF, just will check if TF can run without crashing. However, TF requires much more testing. We propose expansion and adoption of the following regression tests to nightly TF test suites: + +* Convergence tests: Run a curated set of small models until convergence. Measure the time to converge, and steps to converge +* Performance tests: Run a curated set of models for a pre-selected number of steps. Measure steps per second, and for image models images per second. + + +## Releases + +In the multi package environment, we aim to have the following with the releases: + +1. Smaller packages: We expect to release multiple small packages, instead of the giant package we are building and releasing right now. +1. Faster releases: We would like the smaller packages to be able to release much faster. With the ability to pin each dependency to a known good version, we would like to be able to compartmentalize issues, and only hold certain components back when only certain code in TF has issues. +1. Independent releases: With compatibility checks, we would like to be able to independently release different packages + +Below summarizes a timeline of releases for three packages, A, B and C, where B depends on A and C depends on B. + + +![alt_text](20190305-modular-tensorflow/releases.png "Release plans") + + +To summarize the above timeline: + +* Different packages set their own release cadences +* Each package will set version boundaries for each of their dependencies. +* Each package is responsible for ensuring that all of their public APIs are working as promised. +* Packages do not need to modify the minimum version requirements unless they start using newly introduced public API symbols. +* TF metapackage releases may choose to hold back individual packages in favor of faster releases. But dependency requirements have to be respected when doing so. +* Major releases still need to be coordinated. + + +## Packaging Issue Scenarios + + +We will now go over a few failure modes in the proposed environment, and propose how things need to be resolved. We will use a simple example, where we have three packages, A, B and C, where C depends on B and A, and B depends on A. + + +#### Scenario 1: B does not work with a new release of A + +Potential issues: + +* B uses non-public symbols from A: + * B has to release a new version avoiding non-public APIs +* A changed public APIs: + * A has to revert changes to its public APIs and release a patch version replacing the bad version + + +#### Scenario 2: New release of B does not work with A + +Potential issues: + +* B depends on unreleased APIs A exposed in a newer version + * B needs to release a patch version, with just a change in the minimum required version of A + + +#### Scenario 3: C and B depend on different minimum versions of A + +As both C and B have to define a range of versions they require from A, the max version should satisfy both constraints. + + +#### Scenario 4: User installs C first, but then uninstalls A + +This is a pip issue. To help diagnose the problem, B and C should print that A is missing, and user needs to install that to use B or C. + + diff --git a/rfcs/20190305-modular-tensorflow/api_picture.png b/rfcs/20190305-modular-tensorflow/api_picture.png new file mode 100644 index 000000000..2e3b6e1dc Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/api_picture.png differ diff --git a/rfcs/20190305-modular-tensorflow/big_picture.png b/rfcs/20190305-modular-tensorflow/big_picture.png new file mode 100644 index 000000000..f4e47875a Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/big_picture.png differ diff --git a/rfcs/20190305-modular-tensorflow/cpp_module.png b/rfcs/20190305-modular-tensorflow/cpp_module.png new file mode 100644 index 000000000..566cc2325 Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/cpp_module.png differ diff --git a/rfcs/20190305-modular-tensorflow/initial_tf_deps.png b/rfcs/20190305-modular-tensorflow/initial_tf_deps.png new file mode 100644 index 000000000..aa23bd62e Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/initial_tf_deps.png differ diff --git a/rfcs/20190305-modular-tensorflow/py_modules.png b/rfcs/20190305-modular-tensorflow/py_modules.png new file mode 100644 index 000000000..0f8fe9c8d Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/py_modules.png differ diff --git a/rfcs/20190305-modular-tensorflow/releases.png b/rfcs/20190305-modular-tensorflow/releases.png new file mode 100644 index 000000000..62d01efc8 Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/releases.png differ diff --git a/rfcs/20190305-modular-tensorflow/simple_package_deps.png b/rfcs/20190305-modular-tensorflow/simple_package_deps.png new file mode 100644 index 000000000..dacaa9a2e Binary files /dev/null and b/rfcs/20190305-modular-tensorflow/simple_package_deps.png differ diff --git a/rfcs/20190315-tflite-control-flow.md b/rfcs/20190315-tflite-control-flow.md new file mode 100644 index 000000000..3401ffc88 --- /dev/null +++ b/rfcs/20190315-tflite-control-flow.md @@ -0,0 +1,236 @@ +# Control Flow in TensorFlow Lite + +Status | Accepted +:------------ | :--------------------------------------------------------------- +**Author(s)** | Yu-Cheng Ling (ycling@google.com) +**Sponsor** | Andrew Selle (aselle@google.com), Jared Duke (jdduke@google.com) +**Updated** | 2019-03-15 + +## Objective + +Support control flow ops in TensorFlow Lite + +## Goals & Non-Goals + +Goals: + +* Discuss how control flow is defined, converted from TensorFlow, and + implemented in TensorFlow Lite + +Non-goals (these are on our radar, and will be discussed separately): + +* Tackle streaming RNN/LSTM use cases (e.g. process one time step in each + invocation and preserve states) +* Handling Tensor Lists (required for dynamic RNN and seq2seq use cases) + +## Background & Motivation + +We aim to make TensorFlow Lite easy to use. One of the ultimate goals is to be +able to **convert any TensorFlow model to TensorFlow Lite and run it +correctly**. The +[Select TensorFlow operators](https://www.tensorflow.org/lite/guide/ops_select) +project increased hundreds of supported ops by running TensorFlow kernels in +TensorFlow Lite. However, some of the TensorFlow ops are not supportible by this +approach. Control flow is one of the biggest missing features. + +Currently TensorFlow Lite doesn't support control flow. The interpreter has a +static execution plan to run all the operators in a fixed order. There's no way +to selectively or repeatedly execute some of the operators. + +Control flow is used in many important use cases, like training, dynamic RNN, +seq2seq...etc. We already implemented some fused RNN/LSTM kernels in TensorFlow +Lite, but these are restricted by the model architecture and conversion flow. + +Implementing control flow is required to make any TensorFlow model convertible +to TensorFlow Lite. It is also a big step towards enabling generalized +RNN/LSTM/seq2seq models. If a RNN/LSTM only uses features that TensorFlow Lite +fused kernels support, we can further fuse these ops and get even better +performance. + +## Defining control flow ops in TensorFlow Lite + +In TensorFlow, users can use `tf.cond` and `tf.while_loop` functions to define +control flow directly. These functions are also called by more advanced +functions like `tf.nn.dynamic_rnn` and `tf.contrib.seq2seq.dynamic_decode`. + +Internally there are 2 ways to represent control flow in a TensorFlow graph: + +* Control flow v2 is enabled by default in TensorFlow 2.0. It uses + **functional control flow ops** like `If` and `While`, where the branches + and loop bodies are represented with TensorFlow Functions. +* Control flow v1 is enabled by default in TensorFlow 1.x. It uses ops like + `Merge`, `Switch`, `Enter`, `Exit`...etc. The control flows are represented + in the same graph. + +We propose to define control flow ops in functional form in TensorFlow Lite. The +TensorFlow model and converted TensorFlow Lite model will be extremely similar, +and the graph structures will be essentially isomorphic. XLA also uses a similar +design for control flow. + +The detailed guidelines of defining TensorFlow Lite control flow ops are listed +as following: + +* TensorFlow Lite control flow ops should be defined like TensorFlow control + flow v2 ops, with exactly the same op semantic, inputs, and outputs + definition +* TensorFlow functions used by control flow ops should be converted to + TensorFlow Lite subgraphs +* For each function attribute on TensorFlow control flow ops, define an + subgraph index field in TensorFlow op option table + +### Example: Defining `If` condition op + +In this section, we use a simple example to explain how `If` op is defined and +how the design guidelines are applied. + +The following diagram illustrates an example which is equivalent to: +`a < b ? a + b : a * b` + + + +Notes how the guidelines are followed: + +* The graph structure is completely isomorphic between TensorFlow and + TensorFlow Lite models. +* We defined TensorFlow Lite `If` with exactly the same op semantic, inputs, + and outputs definition as TensorFlow `If`. + * In TensorFlow `If` operator, the 1st input is a boolean condition the + rest of the inputs are passed into the body (then / else branch). + `then_branch` function is called when condition is true. Otherwise + `else_branch` function is called. + * All these work exactly the same way in TensorFlow Lite, except that + Functions become Subgraphs. +* In TensorFlow Lite, each builtin op comes with a Options FlatBuffer table. + For each function attribute in the TensorFlow op, define an integer subgraph + index in TensorFlow op option table. + +The option table of TensorFlow Lite If will be defined as: + +``` +table IfOptions { + then_subgraph_index:int; + else_subgraph_index:int; +} +``` + +### Supported control flow ops + +Essentially we just need 2 ops to represent any control flow logic: `If` and +`While`, and only these 2 ops are used in TensorFlow 2.0 currently. Therefore +only these 2 ops will be implemented in TensorFlow Lite initially. + +If necessary, other control flow ops can be easily implemented following the +same design guidelines: + +* `Case` was introduced into TensorFlow recently (02/12/2019). We can consider + to support it, or rewrite it to multiple `If` ops in the converter. +* `For` is not required, and it is representable by rewriting to `While` ops. + TensorFlow `For` was only used in previous functional control flow + implementation. We don't expect to see `For` a lot after TensorFlow 2.0 + release, but we can support this for legacy models. +* `StatelessIf` and `StatelessWhile` can be converted to regular `If` and + `While` when converting to TensorFlow Lite. The converter may utilize the + stateless information to do smart optimizations, but it doesn't matter in + TensorFlow Lite runtime. +* `Call` (simply invoking another subgraph) may be implemented in the future + for other purposes. It's not required initially. + +## Converting TensorFlow control flow ops to TensorFlow Lite + +Since we choose to define control flow ops in functional form, it will be +relatively easy to convert TensorFlow control flow ops to TensorFlow Lite: + +* Convert each TensorFlow control flow op to the corresponding TensorFlow Lite + builtin op +* Convert each TensorFlow Function used by control flow ops to a TensorFlow + Lite Subgraph + +To support legacy models which uses control flow v1 ops, the converter will try +to raise control flow v1 ops to v2 ops with best effort. It's not guaranteed to +work because analysing the data flow between `Switch`, `Merge`, `Enter`, `Exit` +ops can be very tricky. However, in our experience most inference graphs should +work if users don't manually insert control flow ops into the graph. + +## Implementation of control flow in TensorFlow Lite runtime + +### Interpreter Implementation + +In TensorFlow Lite, each Subgraph has a static execution plan, and ops are +executed according to the order in the execution plan. This works really well +with functional control flow ops: + +* In each subgraph, the ops in the execution plan are executed one by one + normally. +* When a control flow op kernel is executed, it may invoke another subgraph. + +As of this writing, TensorFlow Lite interpreter was already refactored to be +able to parse and invoke multiple subgraph. Therefore the interpreter is already +ready to run functional control flow ops. + +Currently, each subgraph has its own memory planner and preserves its own +buffer. This will unnecessarily increase memory footage when there are a lot of +subgraphs in the model, since not all subgraphs will be activated at the same +time. To optimize the memory usage, the memory allocator should be refactored to +share allocated memory between subgraphs. + +### Kernel Implementation + +The logic of the 2 fundamental control flow ops isn't very complex: + +* `If`: Check the condition input and invoke one of the 2 subgraphs. +* `While`: + * Invoke the condition subgraph. Break out the loop if result is false. + * Invoke the body subgraph, use the output as the input of the next + iteration. + +It's not hard to implement TensorFlow Lite kernels to make these work. When +invoking a subgraph, a naive implementation is always copying the data between +subgraphs. The complexity comes from **optimizing by avoiding copy** to +**handling dynamic tensor shapes**. + +### Avoiding copy for static shape use case + +In TensorFlow Lite, each subgraph has a memory allocator. It is responsible for +allocating buffers for all tensors in the subgraph, including inputs and +outputs. When the output tensor shapes of control flow ops are static, we can +optimize the execution by avoiding copying the tensor data between subgraphs. + +In the rest of this section, the implementation of `While` will be discussed as +an example. Similar design can be easily applied to `If`, which is simpler than +`While`. + +The flow of `While` execution will be: + +* Copy the input buffer pointers from `While`'s inputs to condition subgraph's + inputs +* Copy the output buffer pointers from `While`'s outputs to body subgraph's + outputs +* Repeat the following steps in a loop: + * Invoke condition subgraph. Break out of the loop if the output is false + * Copy the buffer pointers from condition subgraph's inputs to body + subgraph's inputs + * Invoke body subgraph + * Copy the buffer points from body subgraph's outputs to condition + subgraph's inputs + * Repeat the loop + +See also the illustration below: + + + +The flow is carefully designed to avoid copying data between subgraphs. Since +the body subgraph writes to the output buffer of `While` op, no copy is required +after the loop. This is similar to RVO (return value optimization) technique in +compilers. + +### Supporting dynamic shape use cases + +Whenever a TensorFlow Lite subgraph is invoked, it will dynamically reallocate +tensor buffers if some of the tensor shapes are not static. This use case will +be supported by always propagating tensor shapes and copying tensor data +between subgraphs. + +This isn't optimal, but it isn't typical to have dynamic shapes in control flow. +Note that XLA does not support dynamic use case, and we will already get better +coverage than XLA by having the simple implementation. We can further optimize +this case in the future if necessary. diff --git a/rfcs/20190315-tflite-control-flow/if_model.png b/rfcs/20190315-tflite-control-flow/if_model.png new file mode 100644 index 000000000..814f79bc4 Binary files /dev/null and b/rfcs/20190315-tflite-control-flow/if_model.png differ diff --git a/rfcs/20190315-tflite-control-flow/while_buffer.png b/rfcs/20190315-tflite-control-flow/while_buffer.png new file mode 100644 index 000000000..ac695dfda Binary files /dev/null and b/rfcs/20190315-tflite-control-flow/while_buffer.png differ diff --git a/rfcs/20190430-tokenization-conventions.md b/rfcs/20190430-tokenization-conventions.md new file mode 100644 index 000000000..448b7e0e0 --- /dev/null +++ b/rfcs/20190430-tokenization-conventions.md @@ -0,0 +1,348 @@ +# RFC: Tokenization API & Initial Implementations + +Status | Accepted +:------------ | :----------------------------------- +**Author(s)** | Robby Neale (Google) +**Sponsor** | Mark Omernick (Google), Greg Billock (Google) +**Updated** | 2019-05-29 + +## Objective {#objective} + +Establish common interfaces for Tensorflow tokenizers, and introduce three +concrete op-level tokenizers. + +## Motivation {#motivation} + +There are a number of tokenization methods we wish to make available for +converting text runs into sequenced substrings processed by Tensorflow graphs. +In the past, these steps needed to be performed outside the graph in a data +preprocessing step, or through custom ops. The former had a chance of creating +skew if the preprocessing wasn't performed consistently, and the latter +fragmented Tensorflow NLP usage. + +## User Benefit {#user-benefit} + +To prevent further fragmentation, and to the benefit of all NLP modelers, we +wish to establish two tokenizer interfaces for new tokenizers to implement that +will make them easy to use, switch between, and compose. There is not one +best tokenizer for all use cases, and it is not a goal to establish a single +best tokenizer. + +In addition to these tokenizer interfaces, we intend to discuss new concrete +subclasses - whitespace split, Unicode script split, and wordpiece. + +## Design Proposal {#design-proposal} + +We propose a base Tokenizer class that takes a Tensor or +[RaggedTensor](https://www.tensorflow.org/guide/ragged_tensors) of strings (or +optionally integer Unicode code points) as input, and outputs a RaggedTensor of +tokens. Tokens can be strings or integers (frequently as vocabulary +indices), and may differ from the originating text. By accepting strings, we +wish to make adoption and usage as easy as possible. This standardization on +both input and output formats, also allows for ease in composability between +tokenizers (see example of this in the custom_tokenizer example below). Plus, +the use of a base class allows for a single point of initialization in the +constructor, and not having to reinitialize when reusing the tokenizer. + +```python +class Tokenizer(tf.Module): + def tokenize(self, input): + """ + Args: + input: An N-dimensional UTF-8 string (or optionally integer) Tensor or + RaggedTensor. + Returns: + An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor. + """ +``` + +The number of tokens created from tokenizing a string is unknown. For this +reason, it is impossible to fully tokenize and output a normal tensor with a +uniform shape for a batch of varying strings. Thus, it is expected that each +output will be ragged (except in the vector, rank 1, case when the input is a +string scalar). + +To allow the caller to know which groups of tokens belong to each string, the +innermost ragged dimension will be tokens for the originating string. This means +that the shape of the output will have an additional dimension when compared to +the input. Example: + +```python +>>> tokenizer.tokenize(["This is great!", "Awesome!"]) +[["This", "is", "great!"], + ["Awesome!"]] +``` + +Model authors often want to know the alignment between the tokens and +the original string. For these instances, a separate class is available which +has a *tokenize_with_offsets* that returns a tuple containing the resulting +tokens plus a *best effort* of starting and ending offsets for each token into +the originating string. This is similar to the ops +`tf.strings.unicode_decode_with_offsets` and +`tf.strings.unicode_split_with_offsets`. We propose a new base class, +TokenizeWithOffsets, which extends Tokenizer and provides the added +functionality. This makes it clear whether or not the implementing Tokenizers +support the *_with_offsets* variant of tokenization. + +```python +def TokenizerWithOffsets(Tokenizer): + def tokenize_with_offsets(self, input): + """ + Args: + input: An N-dimensional UTF-8 string (or optionally integer) Tensor or + RaggedTensor. + Returns: + A tuple (tokens, start_offsets, limit_offsets): + * tokens is an N+1-dimensional UTF-8 string or integer Tensor or + RaggedTensor. + * start_offsets is an N+1-dimensional integer Tensor containing the + starting indices of each token (byte indices for input strings). + * limit_offsets is an N+1-dimensional integer Tensor containing the + exclusive ending indices of each token (byte indices for input + strings). + """ +``` + +Here is a basic example of using *tokenize_with_offsets*. + +```python +>>> tokenizer.tokenize_with_offsets(["This is great!", "Awesome!"]) +([["This", "is", "great!"], ["Awesome!"]], + [[0, 5, 8], [0]], + [[4, 7, 14], [8]]) +``` + +Along with these base classes, there are three tokenizers we plan on +introducing - whitespace tokenizer, unicode script tokenizer, and a wordpiece +tokenizer. + +### WhitespaceTokenizer {#whitespace_tokenize} + +A basic tokenization method that splits on International Components for Unicode +(ICU) defined whitespace characters. + +```python +class WhitespaceTokenizer(TokenizerWithOffsets): + def tokenize(self, input): + """ + Args: + input: A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape. + + Returns: + A RaggedTensor of tokenized text. The returned shape is the shape of the + input tensor with an added ragged dimension for tokens of each string. + """ + + def tokenize_with_offsets(self, input): + """ + Args: + input: A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. + + Returns: + A tuple of `RaggedTensor`s `tokens`, `start_offsets`, and `limit_offsets` + where: + * `tokens`: A `RaggedTensor` of tokenized text. + * `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. + * `limit_offsets`: A `RaggedTensor` of the tokens' ending byte offset. + """ +``` + +### UnicodeScriptTokenizer {#unicode_script_tokenize} + +Splits strings based on the script codes of the Unicode code points. Script +codes correspond to ICU UScriptCode values. This means that text may often be +split by language as well as punctuation and whitespace. Similar to the +whitespace tokenizer, whitespace is removed. + +```python +class UnicodeScriptTokenizer(TokenizerWithOffsets): + def tokenize(self, input): + """ + Args: + input: A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. + + Returns: + A RaggedTensor of tokenized text. The returned shape is the shape of the + input tensor with an added ragged dimension for tokens of each string. + """ + + def tokenize_with_offsets(self, input): + """ + Args: + input: A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. + + Returns: + A tuple of `RaggedTensor`s `tokens`, `start_offsets`, and `limit_offsets` + where: + * `tokens`: A `RaggedTensor` of tokenized text. + * `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. + * `limit_offsets`: A `RaggedTensor` of the tokens' ending byte offset. + """ +``` + +#### WordpieceTokenizer {#wordpiece_tokenize} + +Wordpiece is an unsupervised text tokenizer which requires a predetermined +vocabulary for tokenization. It normally also requires a pretokenization step +that splits text into tokens, which wordpiece then splits further into +subwords (prefixes & suffixes). + +[BERT](https://github.com/google-research/bert) currently uses Wordpiece. + +```python +class WordpieceTokenizer(TokenizerWithOffsets): + def __init__(self, vocab_lookup_table, suffix_indicator='##', + max_bytes_per_word=100, token_out_type=tf.int64, + unknown_token='[UNK]'): + """ + Args: + vocab_lookup_table: A lookup table implementing the LookupInterface + containing the vocabulary of subwords. + suffix_indicator: (optional) The characters prepended to a wordpiece to + indicate that it is a suffix to another subword. Default is '##'. + max_bytes_per_word: (optional) Max size of input token. Default is 100. + token_out_type: (optional) The type of the token to return. This can be + `tf.int64` IDs, or `tf.string` subwords. The default is `tf.int64`. + unknown_token: (optional) The value to use when an unknown token is found. + Default is "[UNK]". If this is set to a string, and `token_out_type` is + `tf.int64`, the `vocab_lookup_table` is used to convert the + `unknown_token` to an integer. If this is set to `None`, + out-of-vocabulary tokens are left as is. + """ + + def tokenize(self, input): + """ + Args: + input: An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. + + Returns: + A `RaggedTensor`s `tokens` where `tokens[i1...iN, j]` is the string + contents, or ID in the vocab_lookup_table representing that string, + of the `j`th token in `input[i1...iN]` + """ + + def tokenize_with_offsets(self, input): + """ + Args: + input: An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. + + Returns: + A tuple of `RaggedTensor`s `tokens`, `start_offsets`, and `limit_offsets` + where: + * `tokens[i1...iN, j]` is the string contents, or ID in the + vocab_lookup_table representing that string, of the `j`th token in + `input[i1...iN]` + * `start_offsets[i1...iN, j]` is the byte offset for the start of the + `j`th token in `input[i1...iN]` + * `limit_offsets[i1...iN, j]` is the byte offset for the end of the + `j`th token in `input[i1...iN]` + """ +``` + +#### a CustomTokenizer example {#a-custom_tokenizer-example} + +If all tokenizers follow the same principles, it allows for flexibility in +swapping out tokenization methods, can lend itself to composability, and will be +easy for anybody already familiar with standard tokenization APIs to use. Below +is a custom tokenizer example that extends the Tokenizer base class and thus not +providing a *tokenizer_with_offsets* method. + +```python +class MyCustomTokenizer(Tokenizer): + def tokenize(self, input): + """ + A custom tokenizer for string tensors. + + Args: + input: An N-dimensional string Tensor or RaggedTensor + + Returns: + An N+1-dimensional string or integer Tensor or RaggedTensor. + """ + # normalize & strip control characters + input = tf_text.case_fold_utf8(input) + input = tf.strings.regex_replace(input, r"\p{Cc}|\p{Cf}", "") + + # tokenize based on unicode_script + script_tokenized = tf_text.unicode_script_tokenize(input) + token_codepoints = tf.strings.unicode_script( + tf.strings.unicode_decode(script_tokenized.flat_values, "UTF-8")) + + HAN_SCRIPT_ID = 17 + is_han_script = tf.equal(token_codepoints, HAN_SCRIPT_ID)[:, :1].values + is_emoji = tf_text.wordshape( + script_tokenized.flat_values, text.WordShape.HAS_EMOJI) + + # Further splitting + split_cond = is_han_script | is_emoji + unicode_char_split = tf.strings.unicode_split(script_tokenized, "UTF-8") + unicode_split_tokens = tf.where( + split_cond, + y=tf.expand_dims(script_tokenized.flat_values, 1), + x=unicode_char_split.values) + + # put back into [batch, (num_tokens), (num_unicode_chars)] + mix_tokenized = tf.RaggedTensor.from_row_lengths( + values=unicode_split_tokens, row_lengths=script_tokenized.row_lengths()) + + return mix_tokenized +``` + +## Appendix {#appendix} + +### Appendix A - TF.Data example {#appendix-a} + +A very common use case will be using Tokenizers in the [tf.data +API](https://www.tensorflow.org/guide/datasets). With the recent (in tf-nightly) +support for RaggedTensors in tf.data, this should be straight-forward for +anybody familiar with tf.data and pose no problems. A simple example is provided +below showing how this could look. + +```python +docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], + ["It's a trap!"]]) +tokenizer = text.WhitespaceTokenizer() +tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x)) +iterator = tokenized_docs.make_one_shot_iterator() +tokenized_doc = iterator.get_next() +``` + +### Appendix B - Keras Preprocessing {#appendix-a} + +Keras provides its own set of preprocessing layers, one which tokenizes, +normalizes, and vectorizes the text. An equivalent tokenizer (most likely the +WhitespaceTokenizer described above) will be provided for anybody wanting to +duplicate the tokenization functionality. + +Because of the simplified nature of the Keras tokenization and that the +tokenizer API described above is to be included in a TensorFlow library outside +of core, these tokenizers will not be used from within the Keras preprocessing +layers to prevent the extra dependency from within Keras. However, more +full-featured Keras tokenization layers will be provided in the same library as +these tokenizers and use the API internally. + +### Appendix C - Other tokenizers {#appendix-c} + +Here we will briefly describe other tokenization methods that could extend the +same base classes despite not being Tensorflow ops. + +#### Segmentation {#segmentation} + +ML models trained to determine tokens within a given text are common solutions +for tokenizating CJKT languages, for example, +[SyntaxNet](https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html). +Since they use NN models internally for tokenizing, they do not package well as +ops, but instead could be built from TF ops and called through [Tensorflow +Serving](https://www.tensorflow.org/tfx/guide/serving) or +[TF.Hub](https://www.tensorflow.org/hub). + +#### SentencePiece {#sentencepiece} + +[SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text +tokenizer and detokenizer where the vocabulary size is predetermined prior to +the neural model training. SentencePiece implements subword units (e.g. +byte-pair-encoding (BPE) +[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)) and unigram +language model [Kudo.](https://arxiv.org/abs/1804.10959)) with the extension of +direct training from raw sentences. + diff --git a/rfcs/20190501-tf-tensor-cord.md b/rfcs/20190501-tf-tensor-cord.md new file mode 100644 index 000000000..c17df244a --- /dev/null +++ b/rfcs/20190501-tf-tensor-cord.md @@ -0,0 +1,252 @@ +# TensorCord Variant Object + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Eugene Brevdo (ebrevdo@google.com) | +| **Sponsor** | Alex Passos (apassos@google.com) | +| **Updated** | 2019-06-05 + +## An update on `TensorCord Variant Object` RFC. + +After some internal discussion, we have decided to merge this RFC into another +planned update to how TensorFlow runtime represents and handles strings. When +that updated proposal is available, it will include a section on transparently +representing rope-like objects (without having to use `Variant` for this +behavior). + +## Objective + +This document proposes a new Variant object called TensorCord. A TensorCord +contains a list of `absl::string_view` and release functors / tensor references; and +can be used to store references to other Tensors or other memory locations. The +destructor of a TensorCord reduces the count on its referenced releasers; once +the reference count of a releaser reaches 0, it is executed. + +## Motivation + +A number of ops within TensorFlow could use a rope-like object that +is propagated through the graph. Examples are: + +* `tf.data` input pipelines that perform a lot of string munging (substrings and + concats, especially); these incur a lot of copy overhead. +* `tf.contrib.rpc` and `tf.io.*_proto` (prev. `tf.contrib.proto`) ops that + handle strings and messages and submessages. For example, in proto + encode/decode ops, encoded submessages are contiguous substrings of serialized + messages. Decoding is therefore done by copying substrings on decode, and + encoding is performed by concatenating substrings on encode. +* When serializing numerical tensors, there is currently no equivalent to + `tf.io.decode_raw`. `tf.io.encode_raw` would make sense, but it would incur + the overhead of copying the tensor data into a new string. A more efficient + approach is to create a TensorCord pointed at the old tensor. +* Strings coming in from network I/O are copied out of protos and into + tensors, which also incurs a copy overhead. + +## User Benefit + +Faster input data pipelines and handling of strings and views for users (in a +transparent manner). + +## Design Proposal + +The TensorCord object itself can be implemented using RefCounted objects with a +constructor that takes an `absl::string_view` and either a releaser callback or +a pointer to a `Tensor` or `RefCounted`. + +Below is an example of use: + +```c++ +auto t = strings.flat(); +// old way via copy: +Tensor copy = tensor::DeepCopy(strings); + +// new way: a referencing view: +Tensor view(DT_VARIANT, {strings.NumElements()}); +auto t_view = view.flat(); +for (int i = 0; i < num_elem; ++i) { + t_view(i) = TensorCord(t(i), &strings); +} +``` + +## Alternatives Considered + +A new tensor type `DT_CORD`, which is a tensor of arrays of +`absl::string_view` objects, and additionally has a releaser that runs on its +unref. This implementation seems to be faster but much more invasive from an API +standpoint; e.g. it adds a `CHECK` to the `Tensor::Tensor(dtype, shape)` +constructor so users don't accidentally create a `DT_CORD` tensor without +a releaser. + +| Alternatives | TensorCord DT_VARIANT | DT_CORD | +:------------- |:--------------------- |:------- | +| Separate releasers per element | Yes | No | +| Overhead | Higher (each Tensor element keeps a reference; Variant & RunVariantDtor overhead is more costly) | Lower (can have onereleaser per tensor) | +| Intrusiveness | Lower (use DT_VARIANT) | Higher (add new TF type) | +| Flexibility | High (elements can point to locations backed by different owners) | Lower (all elements must be backed by data whose lifetime depends a shared set of releasers) + +## Detailed Design + +### Public C++ API + +The TensorCord object constructor and Append methods accept a string_view and +either a Tensor pointer or releaser callback. Its underlying string views can +be iterated over or a string can be constructed via explicit cast: + +```c++ +class TensorCord { +public: + typedef void (*Releaser)(void*); + + // At final destruction, releaser will be called as releaser(memory). + // To create a releaser for a std::function that captures objects, use: + // + // template + // TensorCord::Releaser CreateThunkFor(const T& fn) { + // return [](void* ptr) { (*static_cast(ptr))(); }; + // } + // + // auto fn = [&]() { ... }; + // auto releaser = CreateThunkFor(fn); + // auto tc = TensorCord(view, releaser, &fn); + // + // Remember that in this case, fn needs to outlast the TensorCord. + // + // Creates a TensorCord from `view`, with memory releaser `releaser` and releaser + // arg `memory`. + explicit TensorCord(absl::string_view view, Releaser releaser, + void* memory = nullptr); + + // Creates a TensorCord from `view`, with memory backed by `tensor`. If it `view` + // is small enough, no reference is created on `tensor`; instead the memory is + // stored inline. + explicit TensorCord(absl::string_view view, Tensor* tensor); + explicit TensorCord(absl::string_view view, RefCounted* ref_counted); + + void Append(const TensorCord& other); + void Append(absl::string_view view, CordRep::Releaser releaser, + void* memory = nullptr); + void Append(absl::string_view view, Tensor* tensor); + void Append(absl::string_view view, RefCounted* ref_counted); + + size_t size() const; + bool empty() const; + + explicit operator string() const; + + // Usage example: + // for (absl::string_view s : cord.Chunks()) { ... } + ChunkRange Chunks() const; + ChunkIterator chunk_begin() const; + ChunkIterator chunk_end() const; + + // Copy and move constructor, copy and move assignment operators. + // And all the associated Variant-stored object methods + // (Encode, Decode, DebugString, etc). +}; +``` + +### Ops and Op extensions supporting TensorCord: + +The following ops would be extended to support TensorCord: +* basic string ops (join, concat, reduce_join, as_string) +* `tf.contrib.rpc` +* `tf.io.*proto` +* Example parsing ops (`Parse{Single,}{Sequence,}Example`) + +A new op would be added to create views into dense tensors as TensorCord +objects. + +### Python API + +We create a new TensorCord python object: + +```python +class TensorCord(composite_tensor.CompositeTensor): + + def __init__(self, variant): + self._variant = variant + self._as_string = None + if not in eager or tf.function mode: + self._as_string = tf.strings.as_string(variant) + + def as_string(self): + if self._as_string is None: + self._as_string = tf.strings.as_string(self._variant) + return self._as_string + + @property + def variant(self): + return self._variant + + def _to_components(self): + return (self.variant,) + + @classmethod + def _from_components(cls, components, metadata): + variant, = components + return cls(variant=variant) + + @property + def _is_graph_tensor(self): + return getattr(self._variant, "graph", None) is not None + + # also properties/methods like name, op, dtype, graph, shape, numpy, etc. +``` + +Additionally we add a conversion object for `convert_to_tensor(cord, dtype)` to +return `cord.as_string()` when `dtype=string` and return the variant otherwise; +and a similar conversion for `session.run()`: + +```python +def _tensor_cord_to_tensor(value, dtype=None, name=None, as_ref=False): + if as_ref: + raise ValueError + if dtype == dtypes.string: + return value.as_string() + elif dtype in (None, dtypes.variant): + return value.variant + else: + raise ValueError("Don't know how to convert TensorCord to dtype {}".format(dtype)) + +ops.register_tensor_conversion_function(TensorCord, _tensor_cord_to_tensor) + +def _tensor_cord_session_fetch(tensor_cord): + return ([tensor_cord.as_string()], lambda val: val[0]) + +session.register_session_run_conversion_functions( + TensorCord, + fetch_function=_tensor_cord_session_fetch) +``` + +## Performance Implications + +**NOTE** The statement below requires an upcoming PR that allows inlining small +values inside a `Variant` object. + +TL;DR: Creating a TensorCord view of full strings of a `DT_STRING` tensor is +1-1.25x more expensive than a direct copy of `DT_STRING` unless the string lengths +are approximately 128 bytes each. Once string lengths on average are >128 +bytes, the TensorCord approach is more performant. + +We are able to match or exceed `DT_STRING` performance by using a specialized +implementation of TensorCord and modifications to the `Variant` class: + +* TensorCord performs optional inlining and selective referencing of backing + Tensors (usually for strings < 32 bytes in size). This requires a specialized + constructor that knows about `Tensor` objects. + +* The Variant object is modified to inline its underlying data if the stored + value is <= 48 bytes in size (leaving 16 bytes for alignment + additional + stored Variant data). This reduces the amount of overhead and indirection in + storing small values like TensorCord inside Variant and greatly reduces the + cost of `DT_VARIANT` tensor destruction. It keeps the Variant object <= 64 + bytes, which is the per-element aligned size inside `Tensor` + buffers. + +## Questions and Discussion Topics + +* Are there other good rope-like alternatives? +* Any python API considerations? +* Do we need additional Python code to create TensorCord objects from python + strings? +* Considerations for TF-Lite if we extend a number of string processing ops to + using `DT_VARIANT` inputs. diff --git a/rfcs/20190610-standardizing-composite_ops.md b/rfcs/20190610-standardizing-composite_ops.md new file mode 100644 index 000000000..bd24b979d --- /dev/null +++ b/rfcs/20190610-standardizing-composite_ops.md @@ -0,0 +1,532 @@ +# Standardizing composite ops in tensorflow to support efficient inference. + +Status | Accepted +:------------ | :------------------------------------ +**Author(s)** | Mark Sandler (sandler@google.com) +**Sponsor** | Alexandre Passos (apassos@google.com) +**Updated** | 2019-06-10 + +## Objective + +The goal of this proposal is to create a simple API that allows +adding new composite ops in a way that they can be automatically and robustly +processed by downstream _inference_ tooling. The developers (of the composite op) should be able to +iterate on their implementation or even replace them with standard tensorflow +op, without breaking any existing tools. + +Why? Composite ops often provide building blocks for building complex models. +However, and this is especially true for embedded and specialized hardware, +these ops when implemented naively, become unusable due to architectural +(hardware) details. It is thus preferable for the downstream tools to be able to +extract such composite ops and for instance provide specialized implementation. + +The current proposal concentrates on supporting inference-only transformations +and optimizations. However we leave the door open for follow-up gradient +optimization support. See appendix for a few possible ways forward. + +### Goals: + +* Create a standard way for tensorflow community to implement re-usable ops + that can be efficiently processed by Tensorflow core tooling (such as + TOCO/MLIR), grappler, as well as third party tooling, such as conversion to + 3-rd party engines (e.g. TFLite). +* Maximize backward and forward compatibility of such composite ops, while + allowing changes to implementation including switch to native tensorflow op. +* Provide for *future* support of composite op gradient extraction and + optimization by downstream tools. +* Bonus: enable *serialized* models to benefit from more efficient + composite-ops implementations as underlying platform changes. + +### Non Goals + +* Operation fusion or any graph optimizations. This should be handled by + tensorflow compilers. Though part of this proposal might simplify detecting + desirable transformations by MLIR and XLA frameworks. +* Discussion what should live in "core" tensorflow (e.g. `tf.xxx.my_cool_op` ) +* Ultra complex functions (e.g. trained models) that are unlikely to get + specialized implementations in hardware. +* Immediate support for processing gradients (and forward inference in the presense of gradients) + for composite ops. + +## Motivation + +Historically, tensorflow API contained two types of operations: “core” operators +implemented in CUDA/C++, and composite ops that are implemented as a subgraph +containing other tensorflow operations. Here we will refer to "core" operators +that have native implementation as `tf_op`. As ops mature and gain adoption, +efficiency often dictates replacing composite op with their native +implementation. + +Some examples of ops that were composite at some point (or still are today): + +* Composite non-linearities (such as swish, tanh, and sigmoid); +* many flavors of convolutions (such as atrous convolutions (expressible via + transpose/batch_to_depth and regular convolutions), depthwise-convolution + with depth multiplier); +* Normalization methods (e.g. BatchNorm, Instance Norm, etc… ), some unusual + flavors of convolutional padding, etc; +* Advanced numeric functions (e.g. matrix exponentiation); +* Combinatorial algorithms (e.g bipartite matching and nms) +* Specialized losses CTC loss, RNN layer steps +* tf.einsum + +Many of these ops have since became standard tensorflow ops with efficient +native implementations. + +It is important to note that many of the ops above precede or appeared during +the early days of Tensorflow, when compatibility with downstream tooling wasn't +that much of a concern. Nevertheless, for instance the switch from non-fused batchnorm to +fused one, caused some disruption in early tensorflow processing tools. Some of +which are still reflected in the +[comments](https://github.com/tensorflow/tensorflow/blob/84c5a4551e2e71d854932cb389c359db11cfa2b1/tensorflow/python/ops/nn_impl.py#L1241). + +Adding new operations today is much more complex due to existence of a large +eco-system of processing tools. Even within core tensorflow there are multiple +teams targeting dozens of hardware architectures with various priorities so +adding new operations (or even changing the implementation of existing composite +ops) becomes a bad case of chicken-and-egg problem. + +Why is it important? + +Today, new ML operations and composite blocks emerge regularly. They promise +improved functionality, efficiency and often both. These ops can often be +represented as +[simple](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_impl.py#L531), +and +[not-so-simple](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/linalg/linalg_impl.py#L220) +tensorflow subgraphs. On the other hand, to get the full utility a custom +implementation on various platforms is necessary even if composite +implementation in tensorflow is sufficiently performant. For example using un-optimized +activations in mobile applications can increase latency on mobile devices by +more than 100%. This presents a conundrum: should implementers of such +operations create new atomic op and provide efficient implementations for their +target platform and break everyone else? Or should they provide tensorflow based +graph implementation and then rely on graph pattern recognition to extract and +match in the tooling of their choice. + +Who is affected? Tensorflow users, software and hardware vendors. + +### Why no gradients? + +Adding gradient support would dramatically widen the scope of this proposal. See +appendix for details on why it is complicated. We also have outlined several +possible options to add gradient support on top of this proposal, depending +on the needs. + +### Existing prior art. + +Tensorflow Lite have developed +[tf.lite.OpHint](https://www.tensorflow.org/api_docs/python/tf/lite/OpHint), +which solves a very similar problem. However it is positioned as tf.lite +specific extension, which doesn't provide a public api for graph consumers +(other than TOCO) to extract the hints, limiting the potential for broader +adoption by other tooling and limiting its usefulness to the users. + +`tf.lite.OpHint` also adds wrong direction to the flow of dependencies from +tensorflow to lite, should core tensorflow api choose to use to annotate +composite ops. + +Pattern matching is another approach that is currently used. For example TfLite +has pattern matching code that tries to detect various patterns that can then be +converted into specialized ops, such as +[prelu](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/identify_prelu.cc) +and +[lstm](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/identify_lstm.cc) +and others files in tf.lite directory. + +## User Benefit + +The main beneficiaries would be: + +a) ML community that get a clean way of defining of composite ops without having +to commit to particular implementation (or whether to build new `tf_op`) + +b) Tool maintainers that wouldn't need to write complicated and brittle graph +extraction code whose sole purpose is to reverse engineer tensorflow +implementations. For example here are some tf-lite transformations that +identifies composite ops like +[prelu](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/identify_prelu.cc), +[lstm](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/identify_lstm.cc), +[l2 pooling](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/identify_l2_pool.cc), +etc. Which essentially requires updates to lite whenever tensorflow modifies +those implementations. + +_Blog post announcement_: "How to add new functions to tensorflow" + +_Subtitle_: "So you don't regret it later". + +## Design Proposal + +Tensorflow already provides a mechanism to define functions inside a graph +called `tf.function`. In short, `tf.function` allows to define a subgraph with +dedicated inputs and outputs, that then is substituted as a custom node into +graphs. The gist of our proposal is to add a few standardized attributes to +signal to the processing/executing library that this function implements a +“standard” function. The consumer library then can choose to process it +differently should they choose to. The important part is that we define a +standard set of attributes that users and tooling can rely on. + +As a start, we propose to add “implements” as a new attribute +defining a “standard” function. + +For example: + +``` +@function.function( + implements=“google.matmul_low_rank_matrix”) +def matmul_lowrank(x, y): + “Multiplies two low rank matrices. ” + # Basic implementation goes here. +``` + +Note we use a wrapper named function, that under-the-hood will create a standard +tensorflow function with standard attributes set. + +Arguments: + +* `implements`: we propose using namespaces with the top-level namespace being + company name, with a few reserved namespaces indicating built-in ops, if so + desired. This attributes indicates what function this subgraph implements, + and provides a hint for downstream parsers that they can replace this with + their own implementation should they so choose. + +This brings us the following advantages: + +1) *Backward and fork compatibility*: the tooling by default will, ignore +`implements` and just rely on function implementation available in the graph, +thus ensuring that it “just works”. Further, if users have their proprietary +implementations they can still ship models that would transparently work in open +source tensorflow. + +2) *Easy detection*: Tools can easily detect the functions attribute and +substitute with its own more efficient implementation if available. + +3) *Simplified implementations on custom hardware*: The downstream tooling can +use provided reference implementation as a source of ground truth (that can run +on custom hardware out of the box!) when developing custom implementation on +their hardware. + +4) *Reduced implementation dependencies*: Operation maintainer can change the +implementation without the fear of breaking the downstream tools pattern +recognition (since there isn’t any.) + +5) *Forward compatibility*: Operation maintainer can even add an atomic +implementation without breaking existing tools. No overhead in keeping the basic +implementation available. + +6) *Simpler automatic conversion* As tensorflow moves forward it is conceivable +that its computation description language (today: graph def, tomorrow: MLIR) +will be fully be separated from the underlying tensorflow kernel +implementations. This kind of standard attributes allow for more automatic +conversions should the need arise. + +7) Does not change op developer workflow and can be introduced incrementally to +existing code. + +### Behind the scenes changes + +When a new tf.function is created, behind the scenes tensorflow creates up-to 3 +functions that are stored in the graph. These functions are: 1) `Inference`: +x->y, a straightforward function that given composite op implements. 2) `Forward +function`: same as inference but includes extra outputs that are needed for +backprop 3) `Backprop function`: takes as an input the dL/dY, and all the side +outputs of `Forward function` and produces dL/dx + +In the current implementation the annotations will only be added to `inference` +function, but not to `Forward` or `Backprop` functions. + +### More on forward compatibility + +Suppose later, the maintainer of the op decides that certain composite op +deserves its own atomic op due to its usefulness. If we have standardized +attributes like the one above, the TensorFlow compiler can check if op with the +name that this function “implements” is available and substitute it at runtime. +The significant advantage is that even old serialized models will be able to +take advantage of improved implementations. As the op gains more adoption and +its support becomes more widespread the previous composite definition can be +deprecated or retired. However such an approach allows both implementations to +coexist for unlimited period of time and the same graph can be both: efficient +on the platforms that support it, and “not-broken” on the platforms that don’t. + +### Alternatives considered + +#### Alternative 0: Do nothing + +Everyone just continues with ad-hoc implementations. + +##### Pros: + +* No need for RFC or new public API + +##### Cons: + + Downstream tools are stuck between + + a) trying to implement ops, even if the + efficiency is not yet a concern. In which case this just redo's + implementation in the downstream tooling using its own language. Or + + b) trying to extract certain graph patterns that can be optimizing and + following the moving target. Once such pattern is extracted there is a + pressure on op-maintainers to freeze his implementation to avoid breaking + the patterns used by the downstream tools. Or + + c) Trying to roll out their own extension such as `OpHint` for tflite. + +* New ops that could push the industry forward however are stymied due to lack + efficient implementation, software vendors are not interested in + implementation until the ops become popular enough, creating a vicious + cycle. + +#### Alternative 1: Recommend each new op to be implemented as a new atomic node in graphdef. + +#### Pros: + +* Simplified graph processing since the operation is just a single Node in the + graph. +* Clean graph def. +* Tooling (once supports new ops) stays forward compatible. The underlying + implementation is invisible. + +#### Cons: + +Introduces backward incompatible changes that could be avoided. Every time an +operation is added, the versions of Tensorflow that would have been capable of +processing the original graph (with core ops), will no longer be able to read +the graph def that contains the new to process the new version, even though the +only thing that *has* changed is the tensorflow added a faster implementation of +that Op. + +For example: consider matrix exponentiation. It is implemented as a fairly +complex tensorflow graph that uses just regular matrix multiplication and other +standard ops. However, one can easily imagine that this implementation could be +highly optimized if done as a single expm node, however if we replace that, it +will break old tools, which would either need to effectively re-implement +tensorflow original implementation. + +Requires custom version of tensorflow to add new ops outside of tensorflow which +makes it out of reach for most users and introduces incompatibilities within the +ecosystem, effectively forcing large users to be stuck at old versions of +tensorflow. + +#### Alternative 2: Use name scopes for delineation of the ops + +Pros: + +* Simple and intuitive graph-def +* Backward compatible - no new operations are added. + +Cons: + +* very brittle for the tools to optimize such ops. If they depend on scope + names it can easily cause conflicts with models that are doing something + else or accidental renames can cause scopes to become invisible for the + graph processing. +* If tools depend on graph pattern matching, this makes it hard to change + implementations later on. + +* Tooling is not forward compatible. + + +## Appendix: Future support for optimizing gradient functions + +This proposal is concerned with optimiziging inference pass of composite ops . The +motivation today is that the downstream tooling today rarely if ever deals with gradients, and when +it does, it can rely on the provided implementation. However eventually +this is likely to change, and we would like to keep the door open for extending +this composite op framework to support optimization of both forward +and backward passes. In this appendix we provide several options of how this +could be potentially supported in the future, for the reference purposes. + +### Background: why is it hard + +Suppose we have implemented function `my_function(x) = +very(complex(implementation(x)))`. Now, if some downstream library would like to +support optimized implementation of `my_function`, all it needs to do is to +replace the tensorflow implementation with its own implementation. However, if +at any point we need to compute gradient of `my_function(x)`, then to avoid +recomputation, the default implementation of gradient would need all +intermediate values produced by tensorflow implementation. In this case this +would be values of `implementation(x)`, `complex(implementation(x))`, etc. + +This is problematic for two reasons: + +1) Downstream have dependence on tensorflow's implementation of the composite +ops, and for instance can't just choose arbitrary implementation 2) If +tensorflow implementation changes so needs downstream's. + +In this appendix we outline two possible paths forward to resolve these issues. + +### Option 1: Stabilize Gradient Signature + +Option 1 basically revolves about allowing the composite_op to provide +explicit list of side outputs that could possibly be required for the efficient +gradient computation. The `tf.function` machinery would then validate that the +signature actually matches the implementation and respect the order +provided in the signature. The downstream tooling would need to compute the side-output +when providing its implementation. + +This option means that we would not be able to significantly change the +implementation in the future. For instance, if it is discovered that +`my_function(x)` can be computed as `simpler(transformation(x))` we won't be +able to change the implementation it without changing the signature. + +Note, for a non-trivial subset of fast functions this gradient signature could +be empty. In fact, *any* gradient could be computed without any side outputs, by +recomputing the function internally. Thus, side outputs only become important +when the function involve non-trivial compute. + +Thus, this option might be acceptable in the following cases, where there are no +standard outputs, or if side-outputs are unlikely to ever change. + +### Option 2: Allow dynamic substitution of side outputs. + +Consider the `inference` and `forward` functions. The former has signature +`x->y`, the latter is `x->y, s`. Either y or s can be a list of multiple tensors. +Comparing the signature, tooling can select the split point to separate inference_part +and side output for backward part + +> Assumption: side_outputs of forward_XXX are only ever used as inputs to +> inference_backward_XXX, if they are used for any other purposes, then the +> downstream can’t replace the underneath XXX until those uses are eliminated. +> (Possibly via graph transformations). This makes sense because the graph +> depends on implementation detail of tf.function implementation, thus +> tf.function shouldn’t be optimized away.``` + +Suppose the tooling have an efficient implementation of the gradient that needs +its own side outputs let them be t1 ... t_k. Then it can replace all three +functions with the re-implementations with the following signatures. + +Function | Original signature | New Signature +--------- | :---------------------: | -----------------------: +Inference | x -> y | x -> y +Forward | x->y, s1,..., sk | x-> y, t1, ..., tl +Backward | dl/dy, s1,..., sk -> dx | dl/dy, t1, ..., tk -> dx + +Important: Implementing this today would be fairly complex in case of a nested +functions, because the gradient of the outer function, requires side-outputs of +all inner functions, thus not-only the signature of the function that we change +changes, but also the signature of all functions that _call_ this function. Thus +the tooling will need to do a non-trivial whole-graph transformation to update +the signatures of **all** functions that call the optimized function. However, +it desn't seem to be insurmountable and possibly fairly straightforward with +MLIR. + +Another, cleaner option would be to wrap all the side output into a single blob which +contains multiple tensors which the implementation can then replace with its +own. There are no such structure in tensorflow today, but it might be in the +future. We should use this, if it becomes available. In this case this would +essentially make a forward and backward to have stable signature. + +## Questions and Answers + +This list reflects questions raised and the consensus discussed during +Tensorflow Review process. + +1. We should provide namespacing for the operation names early on to avoid + future collisions with function names. One option is to adopt java-style + "org.opname", basically using the developing organization to disambiguate. Another alternative is to use semantic namespaces e.g. `glinalg`. Yet + third, is to use org.semantic.opname. + + Note: since many of these functions will live in user codebase we obviously + can't and shouldn't police them, however having a recommended style guide + will simplify things later on. + + Should there be a `graduation` process? Where ops eventually move to a + higher tier. E.g. dedicated top-tier op like tensorflow.opname? If so, we + might also consider having another attribute 'aliases' to allow history of + names for backward compatibility. + +> The consensus appears to be to follow org.semantic.opname + +1. If we go org/name, do we need centralized registry of accepted "org" names? + +> No + +1. Do we need `reference` field? The idea is that points to the source of + authoritative ground truth implementation (likely not tensorflow based), + which this op is faithfully trying to reproduce? Should this reference be + structured? (For example it could point to existing scipy/numpy/pillow + library, or it could be pointing to a paper?). Should we make this optional + or get rid of it altogether and delegate this to documentation? + +> The consensus seems that this field is not needed. + +1. Name collisions - crash, no crash, etc. + +> The consensus appears to be no-crash is the most natural outcome despite +> initial misgivings. The argument that tilted in this direction is that the +> definition that will be invoked is actually well defined by python code +> semantic, (e.g. user code would call say python_lib.something.my_conv_op, +> which declares (myai.my_conv_op) and user intent is clear. What the downstream +> tooling will do is going to be up-to downstream tooling as long as it follows +> the contract. If there are two implementations available and different parts +> of the code call different > ones, we might end up with two definitions in the +> same function, but every invocation is still well defined in the graph itself, +> and thus preferrable. + +1. tomhennigan: Aligning the definitions with MLIR + + I wonder if we should consider more metadata here. For example within a + dialect MLIR op definitions [0] include a name [1], summary, description. + Maybe we can align with them? Concretely I suggest considering something + like: + +@tf.function( op_def=tf.OpDef( dialect="google", name="mm_low_rank_matrix", +summary="..", description="..", )) def f(..): pass [0] +https://github.com/tensorflow/mlir/blob/master/g3doc/OpDefinitions.md#operation-definition +[1] +https://github.com/tensorflow/mlir/blob/master/g3doc/OpDefinitions.md#operation-name + +> No, because MLIR dialects are not well aligned with semantic dialects that we +> are considering. E.g. MLIR dialects are following framework dialect (e.g. TPU, +> or tensorflow, etc...), instead of "this is a group of related ops". + +1. Should this belong to existing `tf.function` or a new alias is preferrable + e.g. `tf.library_function` + +> Use existing tf.function + +1. Should there be a specialization mechanism, where different block + implementations that are more efficient on different target hardware. (Both + for tensorflow, and downstream tooling). + +> yes, eventually, but seem not critical to get it from get-go. + +1. ycling: Should function_def have dedicated fields for describing these, or + just using attrs attrs is a good option. + +> attrs + +1. joker-eph: what about backward pass? If downstream tooling is looking to + support back-prop efficiently, there will need to be more hooks to implement + the gradient functions. In particular, the back-prop pass won't be able to + use efficient forward implementation, because back-prop requires internal + tensors (unless there is a matching efficient gradient implementation). + +From joker-eph: "Right, replacing a function with the gradient computation seems +to me like a key use-case that we will want to support. Not having a solution +for this makes this proposal much less attractive." + +> I think we run out of time on this one, but the proposal is that this is +> probalby will be left unspecified in this iteration. The potential path +> forward is to automatically wrap gradient into its own function that +> downstream tooling can identify and replace with its own implementation. If +> the tooling needs to support both forward and backward path, it seems that to +> benefit from this the tooling would need to provide both implementations (and +> do internal caching) or simply rely on default implementation. + +1. tomhennigan: Performance implication of wrapping too much stuff into + tf.function: One thing to note is that there is an overhead to calling a + @tf.function from eager mode (currently ~100us) and I suspect we will want + to be careful about adding it around lots of TensorFlow's public API without + careful consideration (e.g. if 100us of overhead would dominate the runtime + of the composite op). + +I did actually try this a while back (wrapping all functions in TF public API +with tf.function) and found it only made a performance improvement for +tf.reduce_logsumexp and tf.einsum (at least for the model I tested with). + +> The consensus is that we will eventually make tf.function fast enough that it +> won't be an issue, and given that this is likely to have very gradual roll out +> we will have time to adapt. diff --git a/rfcs/20190612-mlir-dialect.md b/rfcs/20190612-mlir-dialect.md new file mode 100644 index 000000000..388b44e24 --- /dev/null +++ b/rfcs/20190612-mlir-dialect.md @@ -0,0 +1,335 @@ +# TensorFlow MLIR Dialects + +|Status | Accepted | +|:------------ | :-----------------------------------------| +|**Author(s)** | Mehdi Amini (aminim@google.com) | +| | Tatiana Schpeisman (shpeisman@google.com) | +| | Chris Lattner (clattner@google.com) | +|**Sponsor** | Alexandre Passos (apassos@google.com) | +| | Jacques Pienaar (jpienaar@google.com) | +|**Updated** | 2019-06-10 | + +## Objective + +[MLIR](https://medium.com/tensorflow/mlir-a-new-intermediate-representation-and-compiler-framework-beba999ed18d) +is the intermediate representation and compiler framework we are investing in to +build the compiler infrastructure for TensorFlow. The representation for +TensorFlow exposed in this document will be what future high-level +transformations will operate on. + +We make use of two different dialects to model TensorFlow graphs in MLIR: first +the `tf_executor` dialect that represents the execution model of the TensorFlow +executor (e.g. control dependencies, deadness propagation) and the `tf` dialect +which represent the regular operations in a TensorFlow graph (the ones that +don’t have special contract with the executor). + +One intent of this design is that TensorFlow 2.x features can choose to target +just the `tf` dialect, allowing us to phase out the `tf_executor` dialect in +subsequent TensorFlow releases. The combination of the two dialects allows to +represent arbitrary existing TensorFlow graphs. + +The representation in this document does not address the specific needs of +accelerators or "custom backends" for TensorFlow. We plan to provide a generic +infrastructure for replacing the TF/XLA bridge with a more flexible and reusable +system across targets. A later design proposal will address these aspects. Also +this representation does not address shape inference, an independent design +exploration is being conducted separately at the moment. + +## TensorFlow Dialect + +The TensorFlow dialect in MLIR is an open dialect (it allows operations that +MLIR doesn't know about) that can contain any TensorFlow operation that does not +have a specific handling by the executor. These operations don’t operate on dead +values, don’t have control dependencies, and execute conceptually in program +order. The form used in this dialect aligns with the direction taken by +TensorFlow 2.0 with tf.function and autograph, as well as with the needs of +other frontends. This should ease the development of analyses and +transformations: optimizations operate on a simpler semantics and local graph +transformations can be validated in a local scope. Simple patterns like folding +`x-x` into a constant 0 do not need to update any control dependencies. It +should also be easily lowerable towards multiple accelerators and heterogeneous +systems in general. + +Operations in this dialect usually operate on tensor and scalar types defined in +the standard dialect. The extra defined types are specific to TensorFlow: `QINT` +types like !tf.qint8 (etc), `QUINT` types like !tf.quint8, all of the `REF` +types like !tf.uint8ref, as well as !tf.string, !tf.resource, and !tf.variant +which correspond to the tensorflow types of the same name. + +### Example: + +Below is an example of a function operating on the TensorFlow dialect: + +```mlir {.mlir} +/// This is a regular function, taking inputs by value and returning a new value. +/// The body is a regular CFG. +func some_function(%input : tensor<*xf32>) -> tensor<*xf32> { + // TensorFlow operations are not variadic: this `tf.add` operation always + // takes two inputs and returns a single output. This simplifies + // pattern-matching, verification and rewriting. + %added = tf.Add %input, %input : tensor<*xf32> + // Operations have sequential execution semantics in a basic block, there are + // no control dependencies. The compiler can reorder operations according to + // the as-if rule ( https://en.wikipedia.org/wiki/As-if_rule ). + %three = constant splat, 3.0> + %mul = tf.Mul %input, %three : (tensor<*xf32>, tensor) -> tensor<*xf32> + + // Only control flow v2 is supported in TF dialect. + // The tf.If operation takes three functions that accept the same + // arguments: the condition returns a bool and the two branches must return + // the same type, which is also the return of the tf.If. + %value = "tf.If”(%added, %mul) + {cond: @cond_func, true_branch: @func_foo, false_branch: @func_bar} + : (tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32> + + return %value : tensor<*xf32> +} +``` + +## TensorFlow Executor Dialect + +The `tf_executor` dialect is intended to model the current TensorFlow executor +semantics and (when combined with the `tf` dialect) can represent arbitrary +TensorFlow 1.x and 2.x graphs. As such it follows the executor model, including +deadness propagation, concurrent semantics, and control dependencies. The +`tf_executor` dialect defines two dialect-specific types: + +* `!tf_executor.control` to represent control dependencies. +* `!tf_executor.token` to represent the pair of operations modeling + NextIteration operation. + +The `tf_executor` dialect is closed (operations are all known to MLIR) as there +are only 8 TensorFlow ops with specific graph executor behavior and 4 additional +operations to represent islands of predictability. + +This dialect models the TensorFlow executor semantics; as such, a large part of +the defined operations are mirroring the +[TensorFlow Control Flow Ops](https://www.tensorflow.org/api_docs/cc/group/control-flow-ops) +and +[implement Control Flow In TensorFlow](http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf). +Also, almost all the operations accept a variadic number of control tokens and +return an extra control token as output. Except for `tf_executor.Merge` and +`tf_executor.ControlTrigger`, operations are propagating deadness: if any of the +input (control and non-control) is dead, all the outputs (control and +non-control) are dead as well. For `tf_executor.Merge`, the output is dead only +when either an input control token is dead or all of the regular inputs are +dead. For `tf_executor.ControlTrigger`, a live control output is always produced +even when some control inputs are dead. + +### `tf_executor.graph` Operation + +The `tf_executor.graph` operation contains a region with a single block that +lists the operations in a TensorFlow graph. The operations are topologically +sorted in-order (no cycles are allowed in the SSA values). The execution model +for operations in this block follows the TensorFlow executor semantics: + +1. Operations that don’t have any transitive dependencies through the SSA + def/use chains may be executed in parallel + (`tf_executor.NextIteration.Source` is the exception). +2. SSA values in this block can be implicitly dead. This means that every SSA + value defined in a `tf_executor.graph` can be considered implicitly wrapped + in a conceptual `dead_or` structure, and includes a runtime flag + indicating if the value is dead or present. Operations may have special case + handling of dead values. +3. Operations in this dialect return a value of type `!tf_executor.control` as + last returned value (exceptions are `tf_executor.NextIteration.sink` and + `tf_executor.fetch` which don’t return any value). + +The `tf_executor.graph` op only allows specific `tf_executor` dialect operations +in its body: the `tf_executor.graph` verifier will reject any unknown operation. +In order to execute standard `tf` dialect operations (like `tf.Add`) they must +be wrapped in the `tf_executor.island` operation. + +The `tf_executor.graph` operation does not accept any operands, inputs are +implicitly captured by the region, representing the feeds to the graph. + +The region attached to `tf_executor.graph` is terminated by a +`tf_executor.fetch` operation. The non-control operands of the terminator +correspond to the result values (or fetches) of the `tf_executor.graph` +operation. The behavior is undefined if any of the operands of the +`tf_executor.fetch` is dead. + +```mlir {.mlir} +%fetches = tf_executor.graph : tensor<*xf32> { + // Operations in the current block execute when their inputs are ready, + // possibly concurrently. + // Only operations in the tf_executor dialect are expected here. + // Ops can return multiple outputs and a control token for control + // dependencies. + // We don’t mention the control token in the return type here, it is implicit. + %0, %ctl0 = tf_executor.opA %feed#0, %feed#1 : tensor<*xf32> + %1, %ctl1 = tf_executor.opB : tensor<*xf32> + %2, %ctl2 = tf_executor.opC %1, %ctl0 : tensor<*xf32> + %3, %ctl3 = tf_executor.opD %2 : tensor<*xf32> + tf_executor.fetch %3 : tensor<*xf32> +} // end of the “tf_executor.graph" operation/region +``` + +### ‘tf_executor.island’ Operation + +The `tf_executor.graph` operation does not allow `tf` dialect operations to be +immediately nested underneath it. The `tf_executor.island` is introduced as a +wrapper for general computation (for example, all the `tf` dialect operations): +this results in a more consistent representation which makes analysis and +transformation simpler. + +The `tf_executor.island` operation has a single region with a single block +attached (only functional control flow is allowed). The block is terminated by a +`tf_executor.yield` operation. The operands of the terminator correspond to the +result values of the `tf_executor.graph` operation. An extra result of type +`!_tf_executor.control` is always produced by every `tf_executor.island`. + +Within an island, execution semantics follow standard sequential behavior +consistent with the direction of TensorFlow 2.0 and autograph, and desirable for +compiler analyses and transformations. Values in an island can’t be dead. Other +nested `tf_executor.graph` operations can be present in the region (or called +functions) to re-enable the TensorFlow executor behavior for a subsection of the +code. This is important for the following reasons: + +* Initially the functional control flow operations are calling functions + involving nested graphs, if `tf_executor.graph` weren’t allowed in an + island, these operations would need to have an equivalent in the + `tf_executor` dialect. +* Nesting also allows to form islands without involving inter-procedural + analyzes: any function call may involve a callee with a graph. + +The `tf_executor.island` region allows implicit capture. If any value captured +by a `tf_executor.island` is dead, the whole region does not execute and every +produced value is marked as dead as well. + +An arbitrary number of `tf_executor.control` operands are accepted by a +`tf_executor.island` operation. If any operand is dead, the region is not +executed and dead values are immediately returned for every result. + +```mlir {.mlir} +// The island is capturing implicitly %0 and %1. It is also taking a control +// dependency %ctl0 as input. It produces a tensor<*xf32> value matching the +// argument of the yield terminator, as well as an extra control token. +%2, %ctl2 = tf_executor.island (%ctl0) + : (tensor<*xf32>, !tf_executor<"control">) -> tensor<*xf32> { + %added = tf.Add %1, %0 : tensor<*xf32> + %mul = tf.Mul %added, %1 :tensor<*xf32> + + // The yield terminator operands are the result values of the island. + tf_executor.yield %mul : tensor<*xf32> +} +``` + +The case where a single operation is wrapped inside an island can even be +compressed by inferring the terminator to be the returned value of the +operation. The example above if it only contained the addition with implicit +capture would be displayed as: + +```mlir {.mlir} +%2, %ctl2 = tf_executor.island(%ctl0) wraps tf.Add %1, %0 : tensor<*xf32> +``` + +### `tf_executor.Switch` Operation + +[`tf_executor.Switch`](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/switch): +takes two inputs,`predicate`and`data`and returns two regular +outputs,`true_output`,`false_output`. The`data`input is copied +to`true_output`if`predicate`evaluates to true otherwise it is copied +to`false_output`. The other output is marked as dead. If one of the inputs or a +control token is dead, then all of the outputs are marked as dead as well. + +### `tf_executor.SwitchN` Operation + +[`tf_executor.SwitchN`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/control_flow_ops.cc#L49-L53): +takes two inputs,`data`and`index`and an integer attribute`num_outs`indicating +the number of outputs. The`data`input is copied to output indicated by +the`index` input. The other outputs are marked as dead. If one of the inputs or +a control token is dead, then all of the outputs are marked as dead as well. + +### `tf_executor.Merge` Operation + +[`tf_executor.Merge`](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/merge): +takes a variadic number of inputs, and returns a single output. The output is +defined as a non-dead input (selected in a non-defined way if multiple inputs +are non-dead). If all inputs are dead, the output is also dead. + +### NextIteration: `tf_executor.NextIteration.Source` and `tf_executor.NextIteration.Sink` Operation + +The TensorFlow +[`NextIteration`](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/next-iteration) +op is modeled using these two paired operations. Since _NextIteration_ is +intended for modeling the loop back-edges, breaking it in two different +operations allows to keep a structural +DAG.`tf_executor.NextIteration.Source`does not take any operand and produces two +results: one regular value corresponding to the TensorFlow graph, and a second +value of type`tf_executor.loop_token`. This token is consumed by the +paired`tf_executor.NextIteration.Sink`Operation alongside the value that is +passed through the back-edge. No value is returned +by`tf_executor.NextIteration.Sink`. The type of the result of the source must +match the type of the value operand of the sink. + +`tf_executor.NextIteration.Source` is an exception in the executor model in the +sense that it executes after the paired `tf_executor.NextIteration.Sink` even +though there is no data dependency between them. + +### `tf_executor.LoopCond` Operation + +[`tf_executor.LoopCond`](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/loop-cond): +forwards its boolean input to its output, +[it acts as`pivot` for marking the loop termination condition](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/control_flow_ops.h#L115-L118). + +### `tf_executor.Enter` Operation + +[`tf_executor.Enter`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/control_flow_ops.h##77-L79): +takes a single input and a`name` string attribute that identifies the execution +frame. It forwards its input to its output in the new execution frame. + +### `tf_executor.Exit` Operation + +[`tf_executor.Exit`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/control_flow_ops.h#L90-L92): +forwards its single input to its output, exiting the current execution frame. + +### `tf_executor.ControlTrigger` Operation + +[`tf_executor.ControlTrigger`](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/control-trigger): +it is similar to +[a no-op](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/control_flow_ops.h#L23-L26) +that acts as a placeholder for control dependencies. It always produces a live +control output even when some control inputs are dead. + +### `tf_executor.Send` Operation + +[`tf_executor.Send`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/sendrecv_ops.h#L24): +matches TensorFlow semantics. + +### `tf_executor.Recv` Operation + +[`tf_executor.Recv`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/sendrecv_ops.h#L37): +matches TensorFlow semantics. + +## Example + +Below is an example of a loop decrementing an initial `%_count.init` integer +until it reaches 0 and returns the last value in the loop. + +```mlir {.mlir} +// Loop `%count.init` times and return the last counter (always zero) +%fetches = tf_executor.graph { + + %loop.init, %ctl0 = tf_executor.Enter %count.init : i32 + + %next_count, %tok = tf_executor.NextIteration.Source : i32 + + %loop.body.init, %ctlMerge = tf_executor.Merge %loop.init, %next_count : i32 + + %dec_count, %ctlAdd = tf_executor.island + wraps tf.Add %loop.body.init, -1 : (i32, i32) -> i32 + + %loop_cond, %ctlNE = tf_executor.island + wraps tf.NotEqual %dec_count, 0 : (i32, i32) -> i1 + + %true, %false, %ctlSwitch = tf_executor.Switch %loop_cond, %dec_count : i32 + + tf_executor.NextIteration.Sink[%tok] %false : i32 + + %exit_count, %ctlExit = tf_executor.Exit %true : i32 + + tf_executor.fetch %exit_count : i32 +} // end of the "tf_executor.graph" operation/region +``` + diff --git a/rfcs/20190630-tfx-on-kfp.md b/rfcs/20190630-tfx-on-kfp.md new file mode 100644 index 000000000..6dc4bb75a --- /dev/null +++ b/rfcs/20190630-tfx-on-kfp.md @@ -0,0 +1,239 @@ +# Kubeflow Pipelines & TFX + +Status | Implemented +:------------ | :-------------------------------------------- +**Author(s)** | Ajay Gopinathan (ajaygopinathan@google.com) +**Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) +**Created** | 2019-06-30 + +## Objective + +This RFC documents the design and engineering effort proposed by the +[Kubeflow Pipelines](https://github.com/kubeflow/pipelines) team to support +[TFX](https://www.tensorflow.org/tfx) with Kubeflow Pipelines (KFP). + +TFX is an open-source effort by the TensorFlow team aimed at providing users +with tools for building production grade machine-learning (ML) workflows. TFX +provides an ML pipeline authoring framework in Python which encodes +Google’s best practices for ML pipelines, including: + +* scalable and battle-tested components +* ML-focused pipeline design patterns +* strongly typed artifacts +* artifact provenance tracking through [ML Metadata](https://github.com/google/ml-metadata) + +An important value-proposition of the TFX framework is that it is agnostic to +the orchestration framework. At launch, TFX supported two orchestration engines +natively: + +* [Apache Airflow](https://airflow.apache.org/) for running locally +* [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/) + for running in the cloud + +This document describes how TFX pipelines are run using +Kubeflow Pipelines as its orchestration engine. It can be viewed as an extension +of the main +[TFX orchestration and configuration](https://github.com/tensorflow/community/tree/master/rfcs/20190718-tfx-orchestration.md) +design document. + +## Motivation + +### TFX on Kubeflow Requirements + +The main focus areas for running TFX on Kubeflow were: + +* **Portability:** The user-facing code in a TFX pipeline should be portable. + An early stated goal of our work was that we wanted the same pipeline to be + runnable on both Airflow and KFP with a _single-line change_ in the + pipeline construction code. +* **Scalability:** TFX on KFP should solve the use-case of large-scale + workloads, thereby showcasing the advantages of running on Google Cloud + Platform (GCP). This meant enabling the use of strongly differentiating GCP + services such as BigQuery, DataFlow and Cloud ML Engine for training and + serving in the pipeline code. + +At launch, both of these requirements were achieved. Using KFP required a +single-line change in the pipeline code, and the sample pipeline for KFP +showcased the use of GCP services for running workloads at scale. + +### Overview of TFX pipelines + +A TFX pipeline is a **logical pipeline** consisting of a series of components. +Each component is defined in terms of inputs, outputs, and execution properties. +Inputs and outputs are represented as channels of ML Metadata Artifacts. +Logically, each component consists of three parts: + +* `Driver`: Responsible for resolving input artifacts from the ML Metadata + store. Determines if the execution has been previously cached and if so, + whether the call to the `Executor` can be skipped. +* `Executor`: Executes the main logic of the component, and provides a uniform + interface around TFX libraries, as well as custom logic. +* `Publisher`: Records output artifacts produced by the `Executor`, and passes + these output artifact metadata to downstream steps. + +When running a pipeline under Airflow, the logical pipeline is converted to a +series of _Airflow operators_. Each component comprises 3 operators representing +the `Driver`, `Executor` and `Publisher`: + +![TFX message passing with Apache Airflow's XCom](20190630-tfx-on-kfp/tfx-oss-xcom-passing.png) + +At runtime, each `Driver` is responsible for resolving the metadata of input +artifacts for a given component from MLMD, and for determining if any previously +cached result of the component run can be used instead. If no cached result was +found, the `Driver` invokes the `Executor` which performs the main application +logic of the component. Upon completion, the `Publisher` writes the output +metadata to MLMD. In the case of Airflow, the `Publisher` operator also +publishes the same metadata for consumption by downstream components using +Apache Airflow’s +(XCom)[https://airflow.apache.org/concepts.html?highlight=xcom#xcoms] mechanism. + +## Design proposal + +### Kubeflow Pipelines Orchestration + +KFP uses [Argo](https://argoproj.github.io/argo/) as its orchestration engine. +Argo is a Kubernetes-specific engine for orchestrating the execution of +workflows where each individual workflow step is the execution of a +containerized application. Argo employs a YAML-based specification to construct +the workflow graph, which also specifies how each container’s application should +be invoked. + +Passing data from upstream components to downstream ones is accomplished via +[Argo output parameters](https://argoproj.github.io/docs/argo/examples/readme.html#output-parameters). +The output results of a component are written to named, container-local files +after every iteration. The contents of this file can then be passed as input +parameters to subsequent steps. In particular, the contents are passed as raw +strings which can be used as command-line arguments when invoking the downstream +step using a templating mechanism in the Argo specification. + +In order to run a TFX pipeline on KFP, the user specifies `KubeflowRunner` +instead of `AirflowDAGRunner` in the pipeline definition file. The logical +pipeline definition itself remains unchanged, thus ensuring portability of +pipelines across orchestration engines. + +In contrast to Apache Airflow, using `KubeflowRunner` and running the pipeline +file does not actually launch the pipeline. Instead, the logical pipeline is +**compiled**, resulting in a pipeline definition file in YAML, which contains +the Argo specification for a workflow that can be run on Kubernetes. The user +must then manually upload this pipeline definition file to a cluster running +Kubeflow Pipelines before it can be run. + +![TFX on Kubeflow](20190630-tfx-on-kfp/tfx-on-kubeflow.png) + +In the Kubeflow cluster, users use an interactive UI to select and launch their +pipeline. The KFP APIServer will then submit the uploaded pipeline definition to +the **Argo controller** to orchestrate the actual workflow. The Argo +specification specifies which container to execute and which command line +invocation to use during each step. + +KFP provides a [Python SDK](https://www.kubeflow.org/docs/pipelines/sdk/) for +constructing ML workflows on top of Argo. The main abstraction used is the +[ContainerOp](https://www.kubeflow.org/docs/pipelines/sdk/build-component/) +class, which can be viewed as a Python representation of a containerized +workflow step in Argo. During compilation, each TFX component in the pipeline is +transformed into a `ContainerOp`. There are three key elements of `ContainerOp` +which are used when constructing the individual steps in TFX pipelines: + +* **Image:** All TFX components are executed using the same pre-built + [Docker image](https://hub.docker.com/r/tensorflow/tfx) which contains the + TFX library and its dependencies. +* **Command-line arguments:** The command-line arguments specify how the image + should be invoked. In particular, they specify the exact TFX component and + executor that needs to run for a given step. Metadata representing input + artifacts are passed as arguments to a container step using Argo’s built-in + templating mechanism. +* **File outputs:** Argo can use the contents of container-local files + produced within each step as input data to be passed to downstream steps. + When the TFX container successfully completes the execution of an + `Executor`, it writes the ML Metadata representation (that is, Artifact and + ArtifactType protos) of output artifacts into named local files, which will + be passed along to downstream components by Argo. This can be viewed as the + **_publish_** step equivalent of using Airflow’s XCom mechanism. + +Consider the snippet of a TFX pipeline consisting of components `Transform`, +`SchemaGen` and `Trainer`. `Transform` produces transformed examples as well as +the transform graph itself, which are consumed by the `Trainer` component. +`Trainer` also consumes the schema produced by `SchemaGen` component. + +![TFX with Kubeflow containers](20190630-tfx-on-kfp/tfx-kfp-containers.png) + +In KFP, each component is now represented as the execution of the TFX container +image. Individual components have customized command-line invocations, which are +based on their input arguments and which TFX executor to execute. +The execution of each step is controlled by instances of the +[`ExecutorRunner`](https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/kubeflow/executor_wrappers.py) +base class. This class is responsible for constructing the arguments required by +all TFX executors, namely: + +* `input_dict`: A dictionary of input artifacts. These are constructed at + runtime using the values of the Argo output-parameters that were passed in + as inputs. +* `output_dict`: A dictionary of output artifacts. These are pre-determined + for each derived class of `ExecutorRunner` and specialized per-component. +* `exec_properties`: A dictionary of runtime parameters, whose values may + either be primitive Python types, or serialized JSON representation of + protocol buffers. + +The arguments are constructed and used to call into the specified TFX `Executor` +(for example, `tfx.components.trainer.executor.Executor`). If execution is +successful, `ExecutorRunner` writes each output artifact (as specified in +`output_dict`) and their schema types in JSON-serialized format into a container +local file. The contents of this file are then passed as ML Metadata artifacts +for consumption by downstream steps. The KFP UI visualizes both input and output +parameters for each step. + +![TFX artifacts with the Kubeflow UI](20190630-tfx-on-kfp/tfx-kfp-ui.png) + +### ML Metadata Tracking + +In contrast to Airflow, TFX on KFP does not have drivers and publishers. +Instead, metadata is recorded passively in KFP’s APIServer, by parsing the +status of the Argo workflow custom resource definition (CRD) periodically. Each +Argo workflow CRD status contains recorded values of Argo output parameters +(that is, the contents of the named local files) upon successful completion of +the workflow step. KFP employs a custom Kubernetes controller called +PersistenceAgent, which periodically polls for the latest status of all Argo +workflow resources, and updates the state in the APIServer. + +![TFX with Kubeflow Pipelines and Argo](20190630-tfx-on-kfp/tfx-kfp-argo-workflow.png) + +The APIServer parses Argo workflows and looks for Argo output parameters that +look like serialized MLMD artifacts in specially named files (by convention, the +files are named `/output/ml_metadata/{output_name}`). These artifacts and their +types are then recorded into an MLMD instance powered by the same MySQL server +that backs KFP’s persistent data. + +## Future Work + +While TFX on KFP works, it still does not have feature parity with the Apache +Airflow version. We are exploring the following directions concurrently to close +the gap between the two orchestrators: + +* **Metadata-driven orchestration**: The current version of TFX on KFP records + artifacts in MLMD, but does so passively. This is due to the lack of drivers + and publishers in the initial implementation. Hence, lineage tracking and + caching is not currently possible. +* **Enabling arbitrary user containers with MLMD artifacts as the interface + between pipeline steps:** Currently, incorporating a custom step in a TFX + OSS pipeline requires users to implement a custom executor. Users in Cloud + frequently have an existing application, written in a non-Python language + (such as R, Java, etc), which they would like to plug into their TFX-based + pipeline. +* **Unified pipeline authoring experience:** TFX and KFP both present users + with a Python-based DSL for constructing their pipelines. The DSL constructs + look very similar from the user’s point of view, but are fundamentally very + different underneath. This has led to customer confusion. Unifying the DSL, + and presenting a single user-facing experience for constructing ML pipelines + is a goal that we’re actively exploring. +* **Pipeline-level runtime parameters:** KFP provides the possibility of + specifying pipeline-level parameters so users can run the same pipeline with + different combinations of control parameters. Since the pipeline definition + is a YAML-based file equipped with a templating mechanism, all pipeline + runtime parameters are restricted to string types. This presents a challenge + for specifying pipeline parameters at runtime that are not simple strings + (for example, the number of training steps in `Trainer` is specified in a + protocol buffer which must be serialized at runtime to be consumed by the + component). Contrast this to the Airflow scenario, where arbitrary code can + execute to yield runtime parameters since the pipeline definition and + runtime environment exist in the same execution scope. + diff --git a/rfcs/20190630-tfx-on-kfp/tfx-kfp-argo-workflow.png b/rfcs/20190630-tfx-on-kfp/tfx-kfp-argo-workflow.png new file mode 100644 index 000000000..7e258c68f Binary files /dev/null and b/rfcs/20190630-tfx-on-kfp/tfx-kfp-argo-workflow.png differ diff --git a/rfcs/20190630-tfx-on-kfp/tfx-kfp-containers.png b/rfcs/20190630-tfx-on-kfp/tfx-kfp-containers.png new file mode 100644 index 000000000..05ffb9f36 Binary files /dev/null and b/rfcs/20190630-tfx-on-kfp/tfx-kfp-containers.png differ diff --git a/rfcs/20190630-tfx-on-kfp/tfx-kfp-ui.png b/rfcs/20190630-tfx-on-kfp/tfx-kfp-ui.png new file mode 100644 index 000000000..5cf13fd19 Binary files /dev/null and b/rfcs/20190630-tfx-on-kfp/tfx-kfp-ui.png differ diff --git a/rfcs/20190630-tfx-on-kfp/tfx-on-kubeflow.png b/rfcs/20190630-tfx-on-kfp/tfx-on-kubeflow.png new file mode 100644 index 000000000..873138af0 Binary files /dev/null and b/rfcs/20190630-tfx-on-kfp/tfx-on-kubeflow.png differ diff --git a/rfcs/20190630-tfx-on-kfp/tfx-oss-xcom-passing.png b/rfcs/20190630-tfx-on-kfp/tfx-oss-xcom-passing.png new file mode 100644 index 000000000..d7f249466 Binary files /dev/null and b/rfcs/20190630-tfx-on-kfp/tfx-oss-xcom-passing.png differ diff --git a/rfcs/20190718-tfx-orchestration.md b/rfcs/20190718-tfx-orchestration.md new file mode 100644 index 000000000..e8e0ceb62 --- /dev/null +++ b/rfcs/20190718-tfx-orchestration.md @@ -0,0 +1,558 @@ +# TensorFlow Extended (TFX) orchestration and configuration + +| Status | Implemented | +| :------------ | :-------------------------------------------------- | +| **Author(s)** | Kevin Haas (khaas@google.com), Zhitao Li (zhitaoli@google.com), Ruoyu Liu (ruoyu@google.com) | +| **Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) | +| **Created** | 2018-12-18 | + +Note: This design document captures the initial state of the TFX design as of +Q4 2018 and is being published for historical informational purposes only. It is +not a representation of the current TFX design at the time of publication, but +rather the initial design document as proposed in Q4 2018. + +## Objective + +This RFC documents the initial design of TensorFlow Extended (TFX) using open +source orchestration frameworks, and defining a Python-based configuration +language (DSL). + +TFX will use [Apache Beam](http://beam.apache.org) for data processing, +[ml-metadata](https://www.tensorflow.org/tfx/guide/mlmd) for artifact +management, [TensorFlow](http://tensorflow.org) for training, and will support +two OSS orchestrators ([Apache Airflow](http://airflow.apache.org) and +[Kubeflow Pipelines](https://github.com/kubeflow/pipelines/)). This is achieved +using a Python-based embedded DSL +([domain specific language](https://en.wikipedia.org/wiki/Domain-specific_language)) +as an abstraction layer between the user’s pipeline configuration and the +underlying orchestrators. User pipelines will be constructed as an +implementation-agnostic Python library. + +## TL;DR + +* TFX will be tightly integrated with ml-metadata for artifact tracking. +* TFX will run on two open source orchestrators, Apache Airflow and Kubeflow + Pipelines. It will run on a single machine as well as running at scale. +* TFX pipelines will be configured using Python. The pipelines are also + portable, allowing a single pipeline to be moved interchangeably between all + open source orchestrators. +* Wherever possible, internal TFX code will be reused for the open source TFX + version. +* TFX will be extensible and allows users to create their own components and + executors to be used within a TFX pipeline. + +## Motivation + +### Architecture + +![TFX orchestration](20190718-tfx-orchestration/tfx-oss-architecture.gif) + +### Overview + +The goal of TFX is to allow external users to configure and run TFX pipelines +which are similar to those configured and run internally at Google. The emphasis +Note that these pipelines are similar but not identical: + +* Internal TFX pipelines are configured with service configs (protobufs), + while external pipelines will use Python. To achieve parity, the Python DSL + must be serializable into internal TFX service configs. +* The internal TFX pipelines primarily use the [pubsub design + pattern](https://en.wikipedia.org/wiki/Publish–subscribe_pattern), whereas + the first few workflow engines targeted for orchestration are true + orchestrators. While the pipeline DAG and the executors can be expressed + using the same DSL, the execution of the pipeline will vary across + orchestration and pubsub systems. Given this difference, not all + functionality is expected to be portable across orchestrators. Ideally all + deltas are “system internal” and do not need to be exposed to the pipeline + authors. + +Users will define their pipeline using the TFX DSL and a set of Python classes +that emulate the existing TFX protobufs. The DSL provides methods to instantiate +TFX components link the outputs of one component to the inputs of another. The +pipeline must be a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) +and cycles will cause an exception to be thrown. + +### Orchestration vs choreography + +TFX and Kubeflow both follow a runtime design pattern generally known as +[service +orchestration](https://en.wikipedia.org/wiki/Orchestration_(computing)), which +is different from [service +choreography](https://en.wikipedia.org/wiki/Service_choreography#Service_choreography_and_service_orchestration). +While both TFX and Kubeflow have eventual plans to support both patterns, the +initial launch for each will be service orchestration. + +Note: For anyone not familiar with the differences, Stack Overflow has a good +[explanation](https://stackoverflow.com/questions/4127241/orchestration-vs-choreography) +describing the differences of each. + +## User Benefit + +### The TFX pipeline configuration will be written in Python + +Python will be used for the user-facing +[DSL](https://en.wikipedia.org/wiki/Domain-specific_language). Of the many +options to choose from (go, protos, yaml), Python was chosen due to its +popularity within the machine learning community. The TFX pipeline configuration +language will be called “DSL” for the remainder of this document. + +### The DSL must be portable + +The DSL must be orchestrator-agnostic, and the implementation details of the +orchestrator must not bleed up to the configuration. This will allow users to +easily migrate their pipelines across orchestrators, primarily from a local +implementation into a managed production cluster (for example, Kubeflow). + +### The DSL must be extensible + +As this DSL will also be used by Kubeflow, the DSL must be able to express +non-Tensorflow pipelines as well. The TFX components should be interoperable +with non-Tensorflow pipelines provided the input and output types match what the +TFX components expect. + +### The DSL must be declarative + +The DSL must focus on defining which operations are to be performed. The order +of operations will be determined based on data dependencies. The DSL will allow +users to configure individual components and reuse prior components’ outputs. +Some orchestrator-specific parameters may need to be configured via an external +config file. + +### The DSL must support existing TFX pipelines + +Wherever possible, the DSL must be capable of configuring Google-internal TFX +pipelines using the same APIs and data structures. + +### Component execution must be portable + +The execution of a TFX component must have the same semantics regardless of the +underlying execution environment (local, on-cloud, on-premise). This will ensure +portability of the pipeline between environments. + +### Multiple orchestrators need to be supported + +The initial launch of TFX is based on Apache Airflow and Kubeflow Pipelines, +both of which are +[centrally orchestrated](https://stackoverflow.com/questions/4127241/orchestration-vs-choreography). +Internally at Google, TFX also supports a +[pubsub](https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern) design +pattern. Support for the pubsub orchestration pattern is intended for the +future. + +The DSL must be serializable into a variety of orchestrators, including Apache +Airflow (Python-based) and Kubeflow Pipelines (argo/yaml-based). While this does +add some constraints on the orchestration functionality exposed to the DSL, this +is essential for portability as we do want TFX to be extended to additional +workflow engines as needed. + +### Requirements on Orchestrator + +The underlying orchestration system should support: + +* Linux, macOS and Windows; +* Python 2.7, Python 3.5, Python 3.6, and Python 3.7; +* Running executors/drivers in any language through subprocess and/or + container image; +* Local-mode and cloud execution environments, including GCP +* Support skipping of operations, as cached artifacts from prior runs should + not be recomputed + +## Design Proposal + +### Pipeline configuration + +The pipeline configuration allows users to connect TFX components into a +pipeline with a high-level Python SDK that abstracts both the TFX protos and the +runtime environments (local/cloud, Apache Airflow/Kubeflow Pipelines, etc) from +the user, allowing the pipeline to be portable across all supported environments +with minimal changes. TFX leverages +[ml_metadata](https://github.com/google/ml-metadata) to provide a fully +typechecked system, where the inputs and outputs of a component are defined +using the existing TFX protos. The full definition used internally includes the +artifact names and expected types for each input or output artifact. The pipeline +config will also perform type checks on component interfaces to ensure that a +pipeline is well-constructed. + +Because TFX already has a configuration API for internal usage, we will annotate +these component definitions and use code-gen to generate Python classes for each +component. + +Additional helper libraries will be provided on top of components to connect +them into pipelines, which can be executed on supported orchestrators (for +example, Apache Airflow and Kubeflow). + +## Detailed design + +### Component + +A TFX component is a wrapper around the corresponding TF/TFX libraries, and +manages existing pipeline details such as input type-checking, local/remote +worker execution, and metadata publishing. While users won’t see the internals +of a +[TFX component](https://www.tensorflow.org/tfx/guide#anatomy_of_a_component), +the components contain the following “sandwiched structure”: + +* **Driver**: The driver provides a pre-executor hook for both TFX and + user-specific logic to be executed prior to the executor. The outcome of the + driver may affect whether the executor runs. With the internal pubsub + design, the driver manages the worker scheduling and monitors the actual + work, and determines the inputs for the executor. For TFX, this task is + performed by the workflow engine, and as a result, the drivers perform a + different role. In both cases (orchestration and choreography), the driver + is responsible for checking if a cached artifact already exists. If the + artifact has already been generated using the same executor logic and input, + the executor will be skipped and the cached artifact will be reused. +* **Executor**: An executable, Docker image, or Python module; including + wrappers around other TensorFlow libraries. The executor focuses on running + a specific task. An executor performs the task with no local state + maintained across executions. Filesystem locations of outputs are managed by + TFX and passed down to the executor via command line flags. In some cases, + the executor may execute functions provided by the user (for example, a + [user-supplied preprocessing_fn](https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_utils.py#L106) + for tf.transform). +* **Metadata publisher**: [ml-metadata](http://github.com/google/ml-metadata) + allows TFX to register the artifacts created and executions performed by + each task to track provenance and enable further analytics on executions and + artifacts. Component inputs and outputs are passed using artifact ids, + ensuring all execution and artifact access is properly logged within + ml-metadata. + +![TFX orchestration](20190718-tfx-orchestration/tfx-oss-component.gif) + +The component interface is meant to be extensible to allow the open source +community to build additional components (for example, community-created +components for other machine learning libraries). + +All TFX components must have strongly typed inputs and outputs. It should be +possible to connect TFX components with non-TFX components provided that the +appropriate inputs and outputs are equivalent types. + +#### Driver + +Driver is an optional part of the component. The responsibility of the driver is +to resolve optional inputs to the component when some inputs are neither +explicit user inputs or outputs from upstream tasks in the same execution. This +usually happens within continuous training. + +Some examples how the driver will resolve ambiguous inputs: + +* The trainer component can be warm started by previous training’s checkpoints +* The model validator component needs to resolve the + [last blessed model](https://github.com/tensorflow/tfx/blob/master/tfx/components/model_validator/executor.py#L38) + within the ml-metadata database and compare its evaluation metrics with + current model +* The pusher component needs to resolve the last pushed model to avoid a + duplicate push to Tensorflow Serving + +Drivers must implement the following interface: + +``` +def Driver(input_artifacts: Mapping[Text, metadata.ArtifactAndType], + output_artifacts: Mapping[Text, metadata.ArtifactAndType], + execution_properties: Mapping[Text, Any], + fetch_history: Callable[[List[ArtifactAndType]], List[ArtifactAndType]]) + -> ExecutionDecision + +class ExecutionDecision(object): + __slots__ = ['input_artifacts', 'output_artifacts', 'execution_properties', execution_id'] +``` + +It can use the `fetch_history`functor to obtain additional artifacts by type, +and augment `input_artifacts`. Drivers should be stateless, idempotent and +side-effect free Python functions so it can be easily ported among different +platforms. + +We provide a _default_ driver if the component has no unresolved inputs after +pipeline construction. This usually satisfies simple one-off pipelines and can +reduce the barrier to create a custom component. In the future, users will be +able to extend the default driver with their own custom logic. + +An example flow of how a driver fits in an orchestrator: + +1. Orchestrator resolves ml-metadata artifacts based on inputs from upstream + tasks or less commonly, “external” user provided URIs; +1. Orchestrator invokes component driver with `_fetch_history:_` + * Driver invokes `_fetch_history_` to fetch all `_[execution -> + List[ArtifactAndType]] _mapping _history_mapping`._ + * Based on `_history_mapping_`, `_input_artifacts_` and + `_execution_properties_`, the driver either + * decides not to proceed to execution and skips to the _publishing the + output to ml-metadata_ step below, **OR** + * decides to execute, and then augments `_input_artifacts_` and + `_execution_properties_` accordingly. +1. Orchestrator records a pending execution with all the inputs in ml-metadata; +1. Orchestrator invokes the executor through Python/Docker/etc; +1. Orchestrator publishes the outputs and executions to ml-metadata. +1. Orchestrator records the outputs to its inter-component datastore. + +#### Executor + +The TFX executor is an executable, Docker image, or Python module that wrappers +and invokes TF libraries. TFX will use a common executor interface shared across +all TFX instances and add a thin wrapper on top to translate ml-metadata +artifacts into the common +[`_Do_()`](https://github.com/tensorflow/tfx/blob/master/tfx/components/base/base_executor.py#L51) +function. + +Docker images containing all of the TFX components will be available. The +image(s) will have 1:1 mapped versions to the underlying PyPi packages. This +ensures that an executor always produces the same results regardless of the +environment, thus satisfies the “Execution is portable” requirement. Initially, +the executors will be based on what was shipped as part of the +[TFMA end-to-end example](https://github.com/tensorflow/tfx/tree/master/examples/chicago_taxi). +The long term goal is to open source and reuse the Google-internal TFX executors +as they become available. + +##### Runtime + +Common logic shared by various executors will be provided from a family of +Runtime classes, to provide an abstraction for the underlying environment so +that executors can have consistent: + +* Beam pipeline argument passing +* Initialize logging configuration +* Optional cloud-specific connection and runtime parameters + +#### Publishing to ml-metadata + +After an execution succeeds in the executor, all generated artifacts and the +execution will be **published** to ml-metadata. Only published artifacts can be +used in orchestration to avoid feeding incomplete results to downstream +executors. + +Generalized publishing workflow: + +1. Ensure the executor returns with a successful return code +1. If execution failed, mark the execution as failed in ml-metadata and skip + all dependent downstream tasks +1. Publish the generated output artifacts to ml-metadata +1. Associate published artifacts with underlying execution with the component’s + “output” attribute for consumption by dependent downstream tasks +1. Mark the execution as **DONE** in ml-metadata + +#### Type Checking + +TFX is a strong-typed system, and type checking inputs/outputs of each component +is essential to the correctness of the underlying pipeline. The pipeline DSL +supports the following type checks: + +##### Consistency Validation + +Consistency validation ensures that output artifact type of an upstream +component matches expected type of the dependent downstream component’s input +parameter. This is implemented within the pipeline DSL level by explicitly +declaring the expected types of inputs for each component. In addition, all +artifacts generated by TFX are also strongly typed. The expected inputs and +outputs types of each component are defined by the internal TFX API component +specifications. The component specifications will not be pushed to github at the +initial launch, but will be released at a later date. + +##### Parameter Type Validation + +This ensures that the non-orchestrated parameters of each component have correct +type. This will be implemented by generating type hints in the codegen, allowing +us to integrate wth the +[ML Metadata’s Type System](https://github.com/google/ml-metadata) and support +more powerful type checks and semantics. + +### Pipeline DSL + +The following Python libraries are provided to connect components into a +pipeline and provide pipeline level configurations: + +* PipelineDecorator: A Python decorator using a context manager to register + the pipeline being constructed in the function. Possible parameters include: + + * pipeline_name + * User defined inputs: + + * inputs for + (ExampleGen)[https://github.com/tensorflow/tfx/tree/master/tfx/components/example_gen] + * Hyper-parameters + * user defined functions for the trainer and transform components + +* OrchestratorParameters: A family of place-holders to capture orchestrator + specific configurations. Each supported orchestrator will sub-class this to + allow pipeline users to configure orchestrator in the desired way. For + Airflow this includes: + + * ml-metadata connection config + + * scheduling_interval if running continuously + +### Orchestration + +User pipelines will be constructed as an implementation-agnostic Python library. +Helper APIs will exist for the two targeted orchestrators (Airflow and +Kubeflow). The external interface for the users will be the same, so it will be +trivial for a pipeline to be moved from one orchestrator to another. This +satisfies the “DSL is portable” requirement. + +#### Apache Airflow implementation + +Airflow workers are named Operators, and Airflow supports a +[variety of operators](https://github.com/apache/incubator-airflow/tree/master/airflow/operators) +as part of the base install package. Of these operators, the TFX components on +Apache Airflow will use the following operators: + +* PythonOperator will be used to execute TF/TFX libraries when the pipeline is + running in local mode, as well as any user-defined driver logic specified as + part of the pipeline. + +* The + [BranchPythonOperator](https://airflow.apache.org/concepts.html#branching) + allows branches within the DAG and for the execution path to be determined + at runtime. This approach is used as part of the artifact caching, as if the + artifact to be generated already exists, the executor and computation of + that artifact will be skipped. Downstream operators will continue to operate + unaware that the computation was not recomputed. + +* The SubDagOperator facilitates composition of Apache Airflow operators into + a single template for reuse. Currently, the TFXWorker is implemented as a + subdag with a BranchPythonOperator (artifact caching check), an optional + PythonOperator (the user-provided driver), and a Python/BashOperator (the + executor). + +The TFX Apache Airflow helper APIs provide the following libraries: + +* tfx_airflow.py: This file provides methods to build the Apache Airflow DAG + from the TFX DSL as well as an Apache Airflow template for the TFX + components. As components are defined in the TFX DSL, an internal method in + this file will also build the DAG based on data dependencies between the + various components. +* tfx_types.py: deprecated in 0.12; originally used for type-checking the + inputs. Functionality was moved to ml-metadata in 0.12. + +![Example of a TFX pipeline with Airflow](20190718-tfx-orchestration/tfx-oss-dag.png) + +###### Why no Kubernetes with TFX and Airflow? + +While Airflow supports Kubernetes operators, we’ve chosen to use Kubeflow as the +preferred TFX-on-Kubernetes implementation. + +#### Kubeflow implementation + +Coming soon! + +### Continuous Training + +There are several requirements for continuous training: + +* There must be a mechanism for users to choose to avoid duplicated + computation +* The system must be able to get the status of previous runs +* The system must provide a way to manually execute previous runs +* The system must provide garbage collection support + +We were able to fulfill the requirements above by the combination of +per-component drivers and ml-metadata integration: + +* Each driver will check for previous component runs and decide whether or not + a new run is necessary if caching is turned on +* If caching is turned off, the pipeline will proceed without skipping any + component +* Drivers in some components such as Trainer and Model Validator will fetch + results from previous runs that are necessary for the current run. The + driver of ExampleGen, the entrance of a TFX pipeline, will decide the data + to process in this pipeline run. +* We will provide Python utility function for marking artifacts as 'DELETABLE' + when certain conditions are met, and this garbage collector can be wired as + an optional component into a pipeline to automate the process. + +A pipeline is the smallest unit to run and each component always processes the +output data (cache or uncached) from upstream components. While this is not an +asynchronous system like Google-internal TFX, it is more natural to express +continuous pipeline runs compared to continuous component runs in workflow-based +orchestration systems without losing too much flexibility. + +#### Internal pipeline data representation + +TFX organizes the internal pipeline data in the hierarchy of span, version and +split: + +* Span and version are monotonically increasing integers determined by + ExampleGen driver. We also allow users to define the rule of generating span + number through UDF support. By default, we will associate span number with + execution time. +* Splits are currently fixed to train and eval, but we plan to support + customized splits through SplitConfig on split-aware components. + +#### Partial Success Handling + +Sometimes, a component (especially custom ones) could be a wrapper for a whole +bunch of other logic, producing multiple artifacts. If the component fails when +only part of the artifacts have been produced, this could problematic. + +The proposed solution: orchestrator will keep all artifacts in _PENDING_ state +when an execution is still ongoing, and atomically publish all artifacts into +_PUBLISHED_ state only after execution successfully completes. Notice that +drivers only consider _PUBLISHED_ artifacts as valid orchestrated outputs, so a +_PENDING_ state artifact would not affect cache calculation. The downside of +this design is that the executor might waste some resources to recompute the +already successful partial output. + +#### Component Hangs + +For various reasons, components could be in a “hanging” state in which it is +neither dead nor making progress. + +Proposed solution: Each execution of a component will be assigned a deadline, +and the executor will terminate itself with a DEADLINE_EXCEEDED status message +if work is not finished within this deadline. The deadline could be configured +to be minimum of (global deadline for current dag execution, deadline for +current component). + +### Extending the Pipeline + +(This is how to satisfy the “DSL must be extensible” requirement.) + +There are several levels of customizations to extend a pipeline. A pipeline +author can create new executors to override the default TFX executors or create +new components to run in the TFX pipeline. + +#### Custom Executor + +If an existing component (defined +by same inputs and outputs) can be reused, the pipeline author could choose to provide an +alternative implementation of the executor binary when initializing the +corresponding component. The executor could either be provided in Python as a +sub-class of BaseExecutor, or a binary (docker image) supporting the same +command-line interface. + +#### Custom Component + +Occasionally, TFX pipeline authors will need to create custom components which +are not part of official TFX components (for example, a custom example_gen +component for other data formats). This can be achieved by providing a custom +component which can be hooked into TFX pipelines just like official components. + +To implement a custom component, the component author needs to: + +1. Provide a component definition as a Python class which extends + “BaseComponent”, or a proto definition of the component which can be fed to + codegen +1. Provide an executor implementation which extends a “BaseExecutor” class +1. (Optionally) Provide a driver if the custom component requires one + +The TFX team will provide supporting infrastructure: + +* Build script for packaging the executor into Docker images; +* Abstract test class for testing component definition (mostly type checking), + executor implementation and driver logic. + +### Future work + +After the initial release of TFX with 0.12, we will be exploring the following +areas: + +* Additional workflow patterns beyond Airflow and Kubeflow, including pubsub + systems like Pulsar. +* Apache Beam-based orchestrator. +* Garbage collection policy. +* Incorporating additional TFX executors as they become available. +* Custom components and executors. +* Support for xgboost and pytorch. +* Composable sub-pipelines. +* Providing a symmetric experience between on-local and on-Kubeflow-on-cloud. +* Integration/support for a TFX frontend. + diff --git a/rfcs/20190718-tfx-orchestration/tfx-oss-architecture.gif b/rfcs/20190718-tfx-orchestration/tfx-oss-architecture.gif new file mode 100644 index 000000000..863c5949c Binary files /dev/null and b/rfcs/20190718-tfx-orchestration/tfx-oss-architecture.gif differ diff --git a/rfcs/20190718-tfx-orchestration/tfx-oss-component.gif b/rfcs/20190718-tfx-orchestration/tfx-oss-component.gif new file mode 100644 index 000000000..a2d266c07 Binary files /dev/null and b/rfcs/20190718-tfx-orchestration/tfx-oss-component.gif differ diff --git a/rfcs/20190718-tfx-orchestration/tfx-oss-dag.png b/rfcs/20190718-tfx-orchestration/tfx-oss-dag.png new file mode 100644 index 000000000..29f3215c5 Binary files /dev/null and b/rfcs/20190718-tfx-orchestration/tfx-oss-dag.png differ diff --git a/rfcs/20190722-tflite-training.md b/rfcs/20190722-tflite-training.md new file mode 100644 index 000000000..2fe000b6d --- /dev/null +++ b/rfcs/20190722-tflite-training.md @@ -0,0 +1,444 @@ +# On-Device Training with TensorFlow Lite + +Status | Accepted +:------------ | :--------------------------------------------------------------- +**Author(s)** | Yu-Cheng Ling (ycling@google.com) +**Sponsor** | Andrew Selle (aselle@google.com), Jared Duke (jdduke@google.com) +**Updated** | 2019-07-22 + +## Overview & Roadmap + +TensorFlow Lite is TensorFlow's recommended solution for on-device machine +learning. Initially the project focuses on inference, but more and more users +are asking for on-device training recently. + +The doc scopes a multi-quarter effort to get generalized & optimized on-device +training working with TensorFlow Lite. The project can be broken down into a few +milestones: + +**Milestone 1: Working prototype for basic training (e.g. fully connected / conv layers only)**
+Goal: Have a working prototype to train over fully connected & convolutional layers. + +**Milestone 2A: Optimized basic training**
+Goal: Make inference & training performance comparable with TFMobile. + +**Milestone 2B: Enable on-device training loop with TensorFlow functionality**
+Goal: Encode the on-device training loop inside a TensorFlow model, with TensorFlow functionality like tf.Example and tf.data.Dataset. + +**Milestone 3: Optimized & generalized training (e.g. control flow, RNN)**
+Goal: Be able to train most models that are trainable with TensorFlow. Optimize the training performance for commonly used architectures. + +The following diagram gives an overview of TensorFlow Lite training road map. +Yellow blocks are deliverables. Blue blocks are technical tasks to unblock the +deliverables. The details of technical tasks will be explained below. + + + +## Goals & Non-Goals + +Goals: + +* Illustrate a complete roadmap for TensorFlow Lite on-device training +* Describe user experience of TensorFlow Lite on-device training +* Design for functional on-device training of TensorFlow 2.x models with + TensorFlow Lite + +Non-goals: + +* Quantized training on device +* Support TensorFlow Lite training with legacy model or TensorFlow 1.x + +## User Experience + +Throughout this document, **users** means developers who use TensorFlow Lite. + +This section explains how users can author a TensorFlow graph, convert a +TensorFlow graph to TensorFlow Lite format, use TensorFlow Lite to run training +and inference, and test TensorFlow Lite model correctness with the proposed +design. + +Note: The example defines a single TensorFlow Lite model with multiple subgraphs +for training & inference. However this isn't the only way -- You can also define +separated training model & inference model. + +We're using a simplified example in this section: + +* A model with 2 convolutional layers, 2 dense layers followed by a softmax + activation +* The whole model is trained offline +* The user want to retrain the 2 dense layers using personalized data on + device + +### Authoring TensorFlow graph + +**Defining the model using Keras Layers** + +The code defines the entire model using Keras API. Since we want to be retrain +only dense layers on device, the model is defined in 2 parts (`conv_layers` and +`dense_layers`). + +```python +conv_layers = tf.keras.Sequential([ + tf.keras.layers.Conv2D(10, kernel_size=(3, 3), activation="relu"), + tf.keras.layers.Conv2D(3, kernel_size=(3, 3), activation="relu"), + tf.keras.layers.Flatten(), +]) + +dense_layers = tf.keras.Sequential([ + tf.keras.layers.Dense(50), + tf.keras.layers.Dense(10, activation='softmax'), +]) + +model = tf.keras.Sequential([conv_layers, dense_layers]) +``` + +**Creating TensorFlow Functions for training and inference** + +With TensorFlow 2.0, the recommended way to use TensorFlow Lite is: Define a +TensorFlow function for each invocable behavior in TensorFlow Lite. The +converter will convert each TensorFlow function to a TensorFlow Lite subgraph, +and users can choose subgraphs to invoke. + +The following code defines `inference` and `train` TensorFlow functions. + +```python +@tf.function(input_signature=[ + tf.TensorSpec(shape=[None, 64, 64, 3], dtype=tf.float32)]) +def inference(x): + return model(x) + +_LOSS_FN = tf.keras.losses.mean_squared_error +_OPTIMIZER = tf.optimizers.RMSprop() +@tf.function(input_signature=[ + tf.TensorSpec(shape=[None, 64, 64, 3], dtype=tf.float32), + tf.TensorSpec(shape=[None, 10], dtype=tf.float32), +]) +def train(x, y): + with tf.GradientTape() as tape: + prediction = model(x) + loss = _LOSS_FN(prediction, y) + gradients = tape.gradient(loss, model.trainable_variables) + _OPTIMIZER.apply_gradients(zip(gradients, model.trainable_variables)) +``` + +The `train` function can be used to train the model offline. For retraining only +dense layers, define another `train_dense_layers` TensorFlow function: + +```python + +@tf.function(input_signature=[ + tf.TensorSpec(shape=[None, 64, 64, 3], dtype=tf.float32), + tf.TensorSpec(shape=[None, 10], dtype=tf.float32), +]) +def train_dense_layers(x, y): + activation = conv_layers(x) + # Note: Gradient Tape is calculated only over the dense layers. + with tf.GradientTape() as tape: + prediction = dense_layers(activation) + loss = _LOSS_FN(prediction, y) + # Note: Gradients are only applied to trainable variables in dense layers. + gradients = tape.gradient(loss, dense_layers.trainable_variables) + _OPTIMIZER.apply_gradients(zip(gradients, dense_layers.trainable_variables)) +``` + +Note that `tf.GradientTape` is calculated over only the dense layers, and +`tape.gradients` are only applied to `dense_layers.trainable_variables` too. + +Though this example is simple, it's easy to extend it to support complex use +cases. For example: + +* To retrain dense layers from scratch instead of fine-tuning, define a + `reset_dense_weights` function to reinitialize dense layer weights to zero + or small random values +* Returns loss when training +* Add a function for evaluating model quality +* Add dropout layers which only exists in training graph...etc + +### Converting Training Model to TensorFlow Lite + +After training the model, users can choose a few TensorFlow functions to convert +to TensorFlow Lite. The TensorFlow Lite Converter 2.0 will expose an API that +looks like: + +```python +tf.lite.TFLiteConverter.from_concrete_functions( + [func.get_concrete_function() + for func in [inference, train_dense_layers]]) +``` + +In this case, it's only required to run inference or train dense layers, so only +`inference` and `train_dense_layers` functions are exported. If it's required to +train the whole model on device, add the `train` function when converting. + +### Using the Model in TensorFlow Lite runtime + +There's an ongoing design effort to add enable users to call different subgraphs +in a TensorFlow Lite model. Tentatively the usage may look like: + +``` +Interpreter interpreter = Interpreter(...); +{ + auto subgraph = interpreter.SelectSubgraph("train_dense_layer"); + subgraph.SetInputTensor(0, training_feature); + subgraph.SetInputTensor(1, training_label); + subgraph.Invoke(); +} +{ + auto subgraph = interpreter.SelectSubgraph("inference"); + subgraph.SetInputTensor(0, inference_feature); + subgraph.Invoke(); + auto result = subgraph.GetOutputTensor(0); +} +``` + +Regardless of how the new API design look like, users should be able to choose +to run `train_dense_layer` or `inference` subgraph. + +### Testing TensorFlow Lite Model correctness + +On-device training is complex and it can be hard to troubleshoot. TensorFlow +Lite should provide guidelines and tools to make this easy. The core idea is: If +each TensorFlow function is converted to a TensorFlow Lite subgraph without +changing semantics, it will be easy to test TensorFlow and TensorFlow Lite +functionality side by side. + +In addition to the common ML testing best practices, there are a few specific +suggestions for TensorFlow Lite on-device learning: + +* **Test all functions in TensorFlow** before converting to TensorFlow Lite. + In this use case, in addition to testing `inference` and `train`, also test + if `train_dense_layers` since it's used on device. If it doesn't work well + in TensorFlow, it can't possibly work well in TensorFlow Lite. Check if + there are any coding error, or you may need to tune the model structure or + hyperparameter. +* **Use model diff testing tool to test all functions**. TensorFlow Lite now + has a tool which feeds the same random data into TensorFlow and TensorFlow + Lite, and compare if the results are close enough in a threshold. The model + should be extended to handle multiple subgraphs (including training and + inference subgraphs) and variables. This is an end to end test which can + capture kernel bugs or TensorFlow/TensorFlow Lite discrepancy. + +## Implementation details for basic training + +There is nothing magical in training. Technically, training is doing a series of +**mathematical computation** (requiring these op kernels), and update the weight +**variables** which is **accessible by other (inference) subgraphs**. The +converter also needs to be changed, to support and optimize training use cases. + +### Sharing variables across subgraphs + +For the variable requirement of TensorFlow Lite training, the high level goals +are: + +* Convert TensorFlow Resource Variables to TensorFlow Lite, and preserve the + same semantic +* Able to share variables between multiple TensorFlow Lite subgraphs + +We propose to define variables in TFLite that have similar semantics to resource +variables in TensorFlow: + +* Use int32 tensor to represent variable ID in TFLite instead of defining a + resource type +* Only 2 ops are required to make it work: `AssignVariableOp` and + `ReadVariableOp` +* Other ops can be added into TFLite when necessary + +### Gradient and optimizer operations + +To train a model, it's required to compute **gradients**, and apply the gradient +to trainable variables with **optimizers** (e.g. SGD, ADAM...etc). Technically, +gradients and optimizers are just some mathematical computation and variable +read/write operations. When constructing training graphs in TensorFlow, +sometimes a single fused op is used (e.g. `ReluGrad`), and sometimes it produces +multiple regular ops in unfused form (e.g. `Mul`, `MatMul`, `Add`) to compute +the gradient of one op. + +Today TensorFlow Lite doesn't have these training-specific ops. The following +approaches are considered to run training ops in TensorFlow Lite: + +**Add training ops into Flex runtime (initial implementation)**
+ +Note: "Flex" is the code name for +[Select TensorFlow operators to use in TensorFlow Lite](https://www.tensorflow.org/lite/guide/ops_select) +project, which enables using TensorFlow kernels in TensorFlow Lite directly. +Throughout the document, it will be referred as "Flex" to be concise. + +This is the easiest approach: Just whitelist these training ops to Flex whitelist. However this means Flex runtime is required for training (bloating binary size), and Flex kernels are not optimized for mobile. + +**Implement fused ops as TensorFlow Lite builtin ops (optimization)**
+We can implement these fused ops as TensorFlow Lite builtin ops. This further +enables optimizing for mobile devices (e.g. writing optimized SIMD CPU kernels, +writing GPU delegate...etc). This may need a significant amount of work. + +**Unfusing fused gradient ops (alternative considered)**
+Another interesting thought: Most ops' gradients are representable by +combining simple mathematical calculation. For example, `ReluGrad` + can be written using TensorFlow Lite builtin ops `Less` and `Where`. + +This can be done by writing graph transformation rules in TensorFlow Lite +converter. However, the performance may not be as good as fused ops. It's more +worthwhile to put the effort into writing fused builtin ops. + +### TensorFlow Lite Converter + +It requires a few changes in TensorFlow Lite Converter to support training. The +API change to support multiple functions is already discussed in example section +above. + +In addition, TensorFlow Lite converter **should stop freezing variables which +are written by exported functions**. Currently, the converter always freezes all +variables when converting (all variables are converted to constants). This is +fine for inference-only use case. For training, these variables should not be +converted to constants, so TensorFlow Lite can further modify the variables on +device. + +In the example used in "User Experience" section above: + +* If only `inference` function is exported, all weights should be frozen. +* If `train_dense_layers` is exported, only `conv_layers` variables should be + frozen. +* If `train` is exported, all variables should not be frozen. + +Freezing has another important purpose: Once a variable becomes a constant, it +can be further optimized by constant folding. The implementation described so +far makes on-device training work, but **inference will be slower than the +original graph**. This leads to the next topic: How to **optimize both training +and inference** when training is enabled. + +## Implementation details for related areas + +### Graph Transformation and Optimization + +For optimization purposes, sometimes TensorFlow Lite and TensorFlow have +different op semantics. Some ops takes transposed weights (e.g. `FullyConnected` +and `Conv2D`), and some weights are split or merged. For example, TensorFlow a +`MatMul` op will be converted to TensorFlow Lite `FullyConnected` op, with the +2nd input transposed: + +``` +TFMatMul(lhs, rhs) == TFLiteFullyConnected(lhs, transpose(rhs)) +``` + +In practice, the 2nd input of `MatMul` is often a trained weight variable. +TensorFlow Lite converter will first convert TensorFlow ops to TensorFlow Lite +ops and add ops like `Transpose` or `ReorderAxis` in the same transformation +rule. If it's converted to a constant, In TensorFlow Lite converter can further +perform constant folding and optimize the execution: + + + +However, the constant folding step can't be done if the weights are not frozen, +and the `Transpose` op will remain in the graph. This means **enabling training +may significantly slow down inference** if we don't optimize this case. + +Ideally we want to optimize the inference subgraph to make the structure the +same as when training is disabled. In high level, one way to achieve this is: + +* Allow converter to define new `inference_weights` variable, which may be + different from weights +* Define a subgraph to initialize `inference_weights` from `weights`. Start + from a simple assignment +* Implement specialized graph transformation rules to move complexity from + inference graph to initialization graph + + + +### Control Flow + +Control flow is not required for basic training. However it's useful in the +following cases: + +* **Train models with control flow**: For a model with control flow, control + flow ops are required for both inference and training. E.g. It requires `If` + to compute the gradient of `If`, and it requires `While` loop to compute the + gradient of `While`. +* **Run training loops** inside TensorFlow Lite models: To train a model, it's + usually required to invoke the training step multiple times. It's doable by + writing a loop in application code to invoke the training subgraph multiple + times. It can also be implemented by using a `While` op to iterate over the + training data. + +The detailed design of supporting control flow in TensorFlow Lite is at +[This RFC](https://github.com/tensorflow/community/pull/83). + +### Tensor List + +Tensor List is not required for basic training. However it's important for +training models with control flow loops. + +Whenever there are loops (e.g. While) in the inference graph, Tensor List is +usually required to train the model. Even if Tensor List ops aren't in the +original inference graph, when constructing the training graph, Tensor List ops +are often added automatically to memorize intermediate values, for computing +gradients over the loop. + +Tensor Lists can also be used without loops but this rarely happens. In this +case, Tensor List ops are required for inference and training. + +### Saving Format + +In TensorFlow, users can use `tf.Session` to train a model, and use `tf.train.Saver` to save the trained weights into TensorFlow checkpoint files. + +This proposal enables training with TensorFlow Lite interpreter (which is similar to `tf.Session`). In addition, we can provide utility classes (similar to `tf.train.Saver`) to save the trained format, or users can get the raw data of variables and save it by themselves. + +A few approaches to handle variable saving are described below. In the near term we will focus on the recommended approach. However, note that these approaches are not conflicting with each other, so it's possible to implement multiple of these. + +**Approach 1: Reuse TensorFlow CheckPoint format [recommended]** + +Pros: +* Interoperability: It's potentially easier to load a TensorFlow trained weights into TensorFlow Lite, and vice versa. +* Avoid designing yet another format. + +Cons: +* Currently the TensorFlow checkpoint parsing code is coupled with TensorFlow core. We may need to refactor and decouple with TensorFlow core, or rewrite the parser. + +**Approach 2: Write the variable value back to TensorFlow Lite model** + +Pros: +* Use the existing TensorFlow Lite format. Avoid introducing another format. + +Cons: +* Need to ship the FlatBuffer writer code into the binary. +* Have to write the entire model (which may contain other huge frozen constants) back to storage. +* Low interoperability with TensorFlow checkpoints + +**Approach 3: Implement TensorFlow Lite's own variable saving file format** + +Pros: +* This can be an extremely simple key-value mapping format + +Cons: +* Requires defining yet another format +* Low interoperability with TensorFlow checkpoints + +### Feeding data with tf.Example / tf.data.Dataset + +`tf.Example` and `tf.data.Dataset` are technically not training-specific +requirements. These features can also be used in inference. + +The basic training can work by feeding the raw data (e.g. float values) into +TensorFlow Lite interpreter. On top of this, we can support tf.Example and +tf.data.Dataset to make this easier to use (likely via Flex runtime). There may be a +performance drawback using these features. + +### Selective Registration + +Initially we will need to rely on Flex runtime to execute some of the fused +gradient ops. This requires to add the gradient op kernels to Flex runtime, and +it will bloat up the Flex binary size. + +We also plan to implement fused training ops as TensorFlow Lite builtin ops. +These kernels are not useful if training features are not used. + +We aim to make TensorFlow Lite small, and developers who only run inference +shouldn't get gradient kernels. This makes selective registration (being able to +link only required op kernels) more important. + +This can be done in either of 2 ways: + +* Coarse-grained: Have 2 build targets for each of the TensorFlow Lite / Flex + library. One is inference-only, and the other is inference+training +* Fine-grained: Only include the op kernels which are exactly used. + +The coarse-grained approach is good enough for the initial version of training. +Fine-grained selective registration is a nice improvement for the future. diff --git a/rfcs/20190722-tflite-training/roadmap.png b/rfcs/20190722-tflite-training/roadmap.png new file mode 100644 index 000000000..398325bd0 Binary files /dev/null and b/rfcs/20190722-tflite-training/roadmap.png differ diff --git a/rfcs/20190722-tflite-training/trans1.png b/rfcs/20190722-tflite-training/trans1.png new file mode 100644 index 000000000..77d454543 Binary files /dev/null and b/rfcs/20190722-tflite-training/trans1.png differ diff --git a/rfcs/20190722-tflite-training/trans2.png b/rfcs/20190722-tflite-training/trans2.png new file mode 100644 index 000000000..0d5c9c047 Binary files /dev/null and b/rfcs/20190722-tflite-training/trans2.png differ diff --git a/rfcs/20190726-custom-ops.md b/rfcs/20190726-custom-ops.md new file mode 100644 index 000000000..fa2d543f5 --- /dev/null +++ b/rfcs/20190726-custom-ops.md @@ -0,0 +1,56 @@ +# Best practices for custom operations in TensorFlow + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Alexandre Passos (apassos@google.com | +| **Sponsor** | Karmel Allison (karmel@google.com) | +| **Updated** | 2019-06-10 | + +For most of TF’s history, it was very expensive for third-party packages or +libraries to release their own tf operations. This created pressure to put ops +in tf core or in tf contrib, which created some uncertainty around support +stories and backwards compatibility. + +Around (but technically not a part of) TF 2.0, however, TensorFlow supports [a +straightforward way for third-party package to build and deploy their own custom +TF ops](https://github.com/tensorflow/custom-op/blob/master/README.md). To +maintain a healthy ecosystem, we recommend the following best practices. + +## Experimental ops should live out of tree + +Unless some special considerations apply, experimental op development should not +happen inside the core TensorFlow package. Strongly prefer adding experimental +or new operations to libraries and packages downstream from core TensorFlow. Any +op in core TensorFlow is subject to very strict backward and forward +compatibility policies, as TensorFlow is very aggressive about not breaking +existing GraphDefs, and this includes even meant-to-be experimental operations +in the core TensorFlow package. + +Once things are no longer experimental, and once the TensorFlow team determines +it is ok with taking responsibility for the code, it’s fine to propose adding a +new version with the final intended interface and implementation to core +TensorFlow. The intermediate states are best explored in another package. + +This has many advantages: + - downstream packages often have a faster release cadence than core TensorFlow + - each downstream package can choose its own backward and forward compatibility + processes, allowing fine-grained trade-offs between velocity and stability + +## Out-of-tree ops must be namespaced + +Since an op’s name uniquely identifies it, different TF packages should ensure +their op names are globally unique across the entire TF ecosystem. To do so, +prepend the package’s name to the op’s name and separate with a ‘>’. An op named +“MatMul” inside the “tensorflow_addons” package should be named “Addons>MatMul”, +for example. + +The string used for a package’s component name is any valid op name, but should +be unique to the package. This allows different packages to experiment with ops +without needing a central coordinator to assign unique operation names. Failing +to use unique names will mean two packages are potentially incompatible. + +If a third-party-developed operation is to be integrated in TensorFlow core, it +should be renamed to have no prefix, creating a new op name, and removing any +risk of internal and external versions silently diverging. + + diff --git a/rfcs/20190802-model-garden-redesign.md b/rfcs/20190802-model-garden-redesign.md new file mode 100644 index 000000000..144da10d6 --- /dev/null +++ b/rfcs/20190802-model-garden-redesign.md @@ -0,0 +1,186 @@ +# TensorFlow Official Model Garden Redesign + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Jing Li (jingli@google.com), Hongkun Yu (hongkuny@google.com), Xiaodan Song (xiaodansong@google.com) | +| **Sponsor** | Edd Wilder-James (ewj@google.com) | +| **Updated** | 2019-08-02 | + +## Objective + +This document presents a proposal to redesign TensorFlow official model garden. +We aim to provide a central and reliable place to contain popular examples, +state-of-the-art models and tutorials to demonstrate the best practice in TF2.0 +and illustrate real-world use cases. + +## Motivation + +The current [TF official model garden](https://github.com/tensorflow/models/tree/master/official) +mainly has ad hoc support. Example models are implemented using mixed TensorFlow +APIs in different coding styles and some of them have convergence and/or +performance regression. With TensorFlow 2.0 launch, there’s a great desire to +provide tensorflow users a clear and central place to showcase reliable TF2.0 +models with the best practices to follow. + +We want to take this opportunity to substantially improve the state of the +official model garden, and provide seamlessly end-to-end training and inference +user experience on a wide range of accelerators and mobile device chips. We hope +to encourage community to contribute innovations and improve TensorFlow +efficiency and usability. + +## User Benefit + +We aim to provide the best modeling experience via this revamp effort: + +* Usability and reliability + * keep official models well-maintained and tested for both performance and + convergence. + * provide accessible model distribution via [TensorFlow Hub](https://www.tensorflow.org/hub) and share state-of-the-art research accomplishments. + * make training on both GPU and TPU an easy switch. + * provide reusable components for research and production. +* End-to-end solutions + * provide seamless end-to-end training and inference solutions, where inference covers serving on TPU, GPU, mobile and edge devices. + * provide hyper parameter sets to tune models for various resource constraints. + * provide solutions with hyper parameters to scale model training to TPU pods or multi-worker GPUs. + * provide variants derived from standard models to tackle various practical tasks. + +## Design Proposal + +### Official model directory reorgnization + +We are going to reorganize the official model directory to provide: + +* common libraries, mainly two types: + * Common training util library in TF2.0, model configuration and + hyperparameter definition in a consistent style. + * Model category related common library, e.g. primitives as basic building + block for NLP models, or common networks like resnet, mobilenet. We will follow the fundamental design of Keras + layer/network/model to define and utilize model building blocks. + + **NOTE:** we are still figuring out what level of building block extraction would be the most useful and sharable + during refactoring. Once we confirm the implementation is really useful, we will move it tensorflow/addons and/or tf.text. + +* popular state-of-the-art (SOTA) models for end users as a product. +* reference models for performance benchmark testing. + * For models provided as SOTA models, we will share the network and + modeling code, but have separate *main* modules. The main + module for benchmark testing will have addtional flags and setups for + performance testing. + +The following table shows the detailed view of proposed model directory +structure. The SOTA model list will be updated to cover more categories. + +| Directory | Subdirectories | | Explainations | +:-------------- |:---------------------|:--|:------------------------------ | +| nlp | | | models/tasks for Natural Language Processing | +| | modeling | | NLP modeling library | +| | BERT | | | +| | ALBERT | | | +| | XLNET | | | +| | Transformer | | | +| | ... | | | +| vision | | | models/tasks for Computer Vision | +| | image_classification | | e.g. resnet, EfficientNet, ... | +| | detection | | e.g. RetinaNet, Mask-RCNN, ... | +| | ... | | | +| recommendation| | | | +| | NCF | | | +| utils | | | Miscellaneous Utilities. | +| | ... | | | +| benchmarks | | | benchmark testing and reference models to validate tensorflow | +| staging | | | Utilities not in TF core yet, and not suitable for tf addons | +| r1 | | | tf1.x models and utils | +| | utils | | | +| | resnet50 | | | +| | transformer | | | +| | wide_deep | | | +| | boosted_trees | | | + +### Pretrained model repository + +We are going to provide the pretrained models for research exploration and +real-world application development. The plan is to integrate with [TensorFlow Hub](https://www.tensorflow.org/hub), +where users can access the Hub modules and SavedModel for pretrained checkpoints and links to the code in the model +garden. + +### Convergence and Performance Testing + +We have a benchmark testing framework to execute continuous performance and +accuracy tests for TensorFlow on different types of accelerators. All official +TF2.0 models are required to provide accuracy tests and these tests will be +automatically expanded to performance tests for continuous regression testing +and monitoring. + +## Model Garden Sustainability + +### Model Launch Criteria +To ensure that official models are well-maintained and tested, we are going to enforce the following criteria for launching a new model in the official model garden, except for staging folder: + +* Follow the best practice guideline for each model category. +* Unit tests to verify the basics of the model. +* Integrate the model to benchmark testing to ensure model’s accuracy should be on par with the original paper / SOTA results. +* README with commands and procedures to reproduce the SOTA results, including: + * Input data generation if necessary + * Model execution, including all hyperparameters. + +### Community contribution and staging + +Due to fast ML development, we can’t possibly support all best-in-class models +up to date on our own. We highly encourage users to contribute to the official +model garden. After model garden refactoring (Phase 1), we plan to provide +a full list of wanted models to tensorflow community and encourage tensorflow +users to claim and contribute the models to the model garden. + +We have different requirements from unifying interface, supporting all the chips +and platforms and enabling benchmarks for reference models. Thus, we could have +different stages of models. As we may have immediate needs to add some quick +models for benchmark and debugging, we will provide a staging folder to host +some drafts of SOTA or popular models. Once the staging models can converge and +support major functionalities of standard official models, we can judge whether +they meet the launch standard and migrate to official models or migrate them to +benchmark references. + +### Maintenance and Deprecation + +Given the nature of this repository, old models may become less and less +useful to the community as time goes on. In order to keep the repository +sustainable, we will be performing bi-annual reviews of our models to ensure +everything still belongs to the repo. For models to be retired, the current plan +is to move them to the archive directory and these models won't run regression +tests to ensure the quality and convergence. + +The following details the policy for models in mature and staging phases: + +* Models graduated from staging subdirectory + + The models will be maintained by the model garden team. After we start to + accept community contributions, we will put the contributors as model owners. + + These models will have continuous convergence and performance testing to + make sure no regression. In general, we won’t deprecate these models unless: + * the model isn’t compatible with the TF APIs any more and have to be replaced by a new version + * a strictly better model shows up and the old model isn't needed by the community/market. + +* Models in staging: + The model garden team will do quarterly review to check the status with the + model contributors, such as: + * model convergence + * unit tests + * convergence tests + * coding style meets the TF2.0 best practice. + If there’s no further commitment to improve the status in next 90 days, we + will mark the model as deprecated, which is subject to be deleted. + +### Official Model Releases +We will do release for the model garden starting from TF 2.0. Unit tests and +regression tests need to pass against the TF release. Deprecated models will be +removed from the release branch. + +We will also create pip package per release version. + +## Milestones + +| Phases | Milestones | Notes | +|:-------- |:-----------------| :----------------------| +| Phase_1 | 1. Finished directory reorganization. 2. Add common modeling library. 3. Have 2-3 SOTA models for both NLP and Vision. | Not accepting community contributions during refactorization.| +| Phase_2 | Expand repository to cover more model types| Will accept community contributions on the solicited model list.| diff --git a/rfcs/20190814-kernel-and-op-registration.md b/rfcs/20190814-kernel-and-op-registration.md new file mode 100644 index 000000000..155d39b03 --- /dev/null +++ b/rfcs/20190814-kernel-and-op-registration.md @@ -0,0 +1,295 @@ +# Kernel and Op Implementation and Registration API + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | James Ring (sjr@google.com). | +| **Sponsor** | Günhan Gülsoy (gunan@google.com) | +| **Updated** | 2020-06-02 | + +## Objective + +Tensorflow (TF) currently provides a C++ API for implementing kernels and ops. +The Voltron project aims to create a modular/plugin-based TF implementation with +API and ABI surfaces. Plugins will be able to create and register custom kernel +and op implementations. + +In order to provide a stable ABI, the Voltron team has chosen to provide C APIs +to plugin authors. This document introduces the C API for op and kernel +registration. For authors who wish to continue using C++ to interface with +TensorFlow, an ABI-stable C++ header-only API is provided. + +## Motivation + +Presently, there is no ABI-stable API for extending TensorFlow with new kernels +and ops. There is no guarantee that a plugin written with one compiler will work +with a version of TensorFlow built with another, even on the same operating +system and architecture. This makes it difficult to distribute plugins without +also distributing the source code and requiring end-users to build the plugin +alongside TensorFlow. + +An ABI-stable API for extending TensorFlow will simplify the distribution of +plugins and allow plugin authors to distribute binary artifacts without +necessarily publishing plugin source code. + +## User Benefit + +Plugin authors will be able to publish plugins that users can use more easily. +In turn, the TensorFlow community will benefit from an increase in the number of +variety of available plugins. + +## Design Overview + +In general, the kernel and op registration C APIs aim to permit the +implementation of any kernel or op that is currently possible with the C++ API. +Where possible, existing C++ function implementations are reused from within a C +wrapper. The purpose of the wrapper is simply to provide ABI stability. + +Since plugins will be dynamically loaded (e.g. via `dlopen` on POSIX), the API +avoids relying on static initialization. + +The intention is that existing kernels should be able to be ported to the new +APIs with a minimum of reimplementation effort. This precludes a from-scratch +re-imagining of TensorFlow APIs. + +The following diagram describes the components built with the proposed C and C++ +APIs. + + +----------------+ <--+ + | | | + | Plugin | | + | | | + +----------------+ | + | | | + | C++ header API | | Plugin + | | | my_plugin.so + +--> +----------------+ | + | | | | + | | C API headers | | + | | | | + | +----------------+ <--+ + | | | + | | C API impl | + Core | | | + Tensorflow | +----------------+ + libtf.so | | | + | | Core C++ APIs | + | | | + +--> +----------------+ + +In this example, there are two object files: `my_plugin.so` and +`libtensorflow.so`. `my_plugin.so` is implemented in terms of the C++ +header-only API, which is in turn implemented in terms of the C API headers. The +C API implementation is provided by TensorFlow at runtime when it loads the +plugin's shared object. + +This design addresses changes that are required to the existing C API that are +required to support op and kernel plugins. It also introduces the C++ +header-only API, which currently does not exist. + +## Ops + +This section introduces changes to the C API that are required to support ops. +An alpha version of this API is already checked in at `tensorflow/c/ops.h`. + +### Registration + +In the C++ API, ops are registered at static initialization time using the +`REGISTER_OP` macro. For example: + +```c++ +REGISTER_OP("Bitcast") + .Input("input: T") + .Output("output: type") + .Attr("T: {bfloat16, ...}") + .Attr("type: {bfloat16, ...}") + .SetShapeFn([](InferenceContext* ctx) { ... }) + .Doc("A bitcast operator"); +``` + +The equivalent C API will be a series of functions that operate on +`TF_OpDefinitionBuilder *`, a pointer to an opaque struct (i.e. a struct whose +content is not made known to the user). The functions include, but are not +limited to: + +* `TF_OpDefinitionBuilder* TF_NewOpDefinitionBuilder(const char* op_name)`: + constructs and returns a new op registration builder for an op with the given + name + +* `void TF_OpDefinitionBuilderAddAttr(TF_OpDefinitionBuilder* builder, const + char* attr)`: adds the given attribute to the builder (equivalent to `Attr` + above) + +* `void TF_OpDefinitionBuilderAddInput(TF_OpDefinitionBuilder* builder, const + char* input)`: adds the given input to the builder (equivalent to `Input` + above) + +Additional functions are provided for setting other properties of the operation +(e.g. `TF_OpDefinitionBuilderSetIsCommutative`). + +Registration is then actually performed using the `TF_RegisterOpDefinition` +function. This function populates a `TF_Status` indicating whether registration +was successful and frees the resources associated with the op definition +builder. + +The C equivalent of the bitcast op registration example above is shown below: + +```c++ + +#include "tensorflow/c/ops.h" + +void InferBitcastShape(TF_ShapeInferenceContext* ctx, // see the section below on + TF_Status* status); // shape inference + +void InitPlugin() { + TF_OpDefinitionBuilder* b = TF_NewOpDefinitionBuilder("Bitcast"); + TF_OpDefinitionBuilderAddInput(b, "input: T"); + TF_OpDefinitionBuilderAddOutput(b, "output: type"); + TF_OpDefinitionBuilderAddAttr(b, "T: {bfloat16, ...}"); + TF_OpDefinitionBuilderAddAttr(b, "type: {bfloat16, ...}"); + TF_OpDefinitionBuilderSetShapeInferenceFunction(b, &InferBitcastShape); + + TF_Status* status = TF_NewStatus(); + TF_RegisterOpDefinition(b, status); + if (TF_GetCode(status) != TF_OK) { /* handle errors */ } +} + +``` + +### Shape Inference + +A significant feature of certain ops is their ability to infer their output +shapes. TensorFlow will invoke the registered shape inference function (if one +is provided) when it needs to know the op's output shape. The registration +function declaration is shown below: + + +```c++ +void TF_OpDefinitionBuilderSetShapeInferenceFunction( + TF_OpDefinitionBuilder* builder, + void (*shape_inference_func)(TF_ShapeInferenceContext* ctx, TF_Status* status)); +``` + +A series of functions prefixed with `TF_ShapeInferenceContext` is provided for +the following purposes: + +* Examining operator input shapes (`TF_ShapeInferenceContextGetInput`) + +* Creating and deleting shape and dimension handles (`TF_{New,Delete}ShapeHandle`, `TF_{New,Delete}DimensionHandle`) + +* Manipulating shape and dimension handles (`TF_ShapeInferenceContextWithRank`, `TF_ShapeInferenceContextDim`) + +In general, C analogues to the C++ methods in `tensorflow::shape_inference` +(see `tensorflow/core/framework/shape_inference.h`) will be provided. + +## Kernels + +This section introduces changes to the C API that are required to support +kernels. An alpha version of this API is already checked in at +`tensorflow/c/kernels.h`. + +### Registration + +Kernel registration with the C++ API is accomplished with the +`REGISTER_KERNEL_BUILDER` macro. This macro expands to code that relies on +static initialization to register the provided kernel with the global kernel +registry. See below for an example of registering a kernel with the C++ API: + +```c++ + +#include "tensorflow/core/framework/op_kernel.h" + +class BitcastOp : public OpKernel { + explicit BitcastOp(OpKernelConstruction* context) : OpKernel(context) { … } + void Compute(OpKernelContext* context) override { … } +}; + +REGISTER_KERNEL_BUILDER(Name("Bitcast").Device(DEVICE_CPU), BitcastOp) +``` + +The equivalent C API provides a series of functions that operate on +`TF_KernelBuilder`, an opaque struct obtained with the `TF_NewKernelBuilder` call. +The kernel builder is registered with TensorFlow using the +`TF_RegisterKernelBuilder` function. See below for an example of registering +the bitcast kernel using the C API: + +```c++ +#include "tensorflow/c/kernels.h" + +typedef struct bitcast_kernel { … } bitcast_kernel; + +// Bitcast_Create, Bitcast_Compute and Bitcast_Delete actually implement the +// kernel. See the section below for discussion on kernel implementation. +static void* Bitcast_Create(TF_OpKernelConstruction* context) { + bitcast_kernel* k = (bitcast_kernel*) calloc(1, sizeof(bitcast_kernel)); + /* initialize the fields of k as needed */ + return (void*) k; +} + +static void* Bitcast_Compute(void* k, TF_OpKernelContext* context) { + bitcast_kernel* kernel = (bitcast_kernel*) k; // this is the pointer returned by + // Bitcast_Create + /* compute the result */ + TF_SetOutput(context, ...); +} + +static void Bitcast_Delete(void *k) { free(k); } + +void InitPlugin() { + TF_KernelBuilder* builder = TF_NewKernelBuilder(/*op_name*/"Bitcast", DEVICE_CPU, + &Bitcast_Create, &Bitcast_Compute, &Bitcast_Delete); + TF_Status* status = TF_NewStatus(); + TF_RegisterKernelBuilder(/*kernel_name*/"Bitcast", builder, status); + if (TF_GetCode(status) != TF_OK) { /* handle errors */ } + TF_DeleteStatus(status); +} +``` + +The registration function prototypes are provided below. Kernel authors must +provide a compute function. Creation and deletion functions are optional, but +if a creation function is provided that causes memory allocation, a deletion +function that frees the memory should also be provided, otherwise a leak will +occur. + +```c++ +TF_KernelBuilder* TF_NewKernelBuilder( + const char* op_name, const char* device_name, + void* (*create_func)(TF_OpKernelConstruction*), + void (*compute_func)(void*, TF_OpKernelContext*), + void (*delete_func)(void*)); + +void TF_RegisterKernelBuilder(const char* name, TF_KernelBuilder* builder, + TF_Status* status); +``` + +### Implementation + +The main classes for C++ kernel implementations are `OpKernelCreation` +(provided by TensorFlow to the kernel constructor) and `OpKernelContext` +(provided to the kernel's `Compute` method). The analogues in the C API are +`TF_OpKernelCreation` and `TF_OpKernelContext`. The aim of the C API is to +provide functions for working with these structs that match, as closely as +possible, the C++ API. + +### Inputs and Outputs + +Kernels must be able to retrieve their inputs and provide outputs. In the C++ +API, the tensorflow::OpKernelContext::GetInput and SetOutput family of +functions provide this functionality. The equivalent C calls will be +`TF_GetInput` and `TF_SetInput`. These functions operate on `TF_Tensor`, which +is already part of the existing TensorFlow C API. + +String tensors will be supported in an ABI-stable way. This will require +changes to their binary representation described in the [tstring design +document](https://github.com/tensorflow/community/blob/master/rfcs/20190411-string-unification.md). + +## C++ Header-Only API + +As described above, the main motivation for providing a C API is ABI stability. +However, some programmers may find the C API less convenient than the +non-ABI-stable C++ API. To address this concern, we plan to provide a +header-only C++ API that is implemented in terms of the ABI-stable C API. This +API will contain classes such as `Tensor`, `OpKernelContext`, and +`OpKernelConstruction`, whose names will be familiar to existing C++ API users. +Ideally, this API will be as close as possible to the existing non-ABI-stable +Tensorflow C++ API, so that kernels and ops currently implemented in C++ may be +ported to the ABI-stable C++ with as little implementation churn as possible. diff --git a/rfcs/20190814-kernel-and-op-registration/device_api_overview.png b/rfcs/20190814-kernel-and-op-registration/device_api_overview.png new file mode 100644 index 000000000..8803ead79 Binary files /dev/null and b/rfcs/20190814-kernel-and-op-registration/device_api_overview.png differ diff --git a/rfcs/20190815-tfdbg-v2-callbacks.md b/rfcs/20190815-tfdbg-v2-callbacks.md new file mode 100644 index 000000000..e9d7d5fb1 --- /dev/null +++ b/rfcs/20190815-tfdbg-v2-callbacks.md @@ -0,0 +1,414 @@ +# TensorFlow Debugger v2: Callbacks for Eager Execution and `tf.function`s + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Shanqing Cai (cais@google.com) | +| **Sponsor** | Alexandre Passos (apassos@google.com) | +| **Updated** | 2019-08-15 | + +## Objective + +This RFC presents an API-level design for how to achieve debugging +instrumentation of [eager execution](https://www.tensorflow.org/guide/eager) and +[`tf.function`](https://www.tensorflow.org/beta/guide/autograph) in TensorFlow 2 +("TF2" hereafter). This will enable users to "hook into" the following types of +events in TF2 with a unified API: + + - Eager execution of an operation ("op" hereafter) at runtime + - Creation of an symbolic op at graph-construction time, i.e., when a + user-defined Python functions is transformed into a graph (FuncGraph) with + the + [tf.function](https://www.tensorflow.org/beta/tutorials/eager/tf_function) + API. + - Runtime execution of FuncGraphs. + +## Motivation + +Such "hooks" will allow both observation and overriding of TF op's outgoing +tensors, including concrete `EagerTensor`s and symbolic `Tensor`s (see details +below.) This is a foundational part of the effort to bring [TensorFlow Debugger +(tfdbg)](https://www.tensorflow.org/guide/debugger) up-to-date with TF2's +execution paradigm. + +Currently, tfdbg is compatible with only the `tf.Session` API of TensorFlow 1.x. +However, users of TF2 have raised questions and issues that indicate needs for +a dedicated debugger, ones that cannot be easily met by a generic Python +debugger such as pdb. The most common examples of such needs involve +finding the source of numeric instability issues like NaNs (e.g., see [GitHub +issue](https://github.com/tensorflow/tensorflow/issues/26543) and [StackOverflow +question](https://stackoverflow.com/questions/55823557/converting-darknet53-gives-nan-results-in-tensorflow-2-0)) + +## Design Proposal + +To expand on the aims listed in the Object section, the proposed API will enable +three key capabilities: + + - **Capability A**: The ability to intercept eagerly-executing TF ops. + Specifically, we want a way to register one or more callbacks that are + invoked immediately after an `EagerTensor` has been computed through the + eager execution of an op. The callback should provide visibility into the + input and output `EagerTensor`s of the execution. If so desired, the + callback may _override_ the output `EagerTensor`s and thereby transparently + affect downstream eager execution. + - **Capability B**: The ability to intercept the creation of symbolic ops + during function-to-graph conversion in `tf.function`, including the cases + assisted by [AutoGraph](https://www.tensorflow.org/guide/autograph). This + will form the basis for simulated stepping through lines of + `tf.function`-decorated Python functions to assist debugging of graph + construction in TF2. Like in Capability A, the callbacks should be able to + override the output symbolic tensors of the op in order to affect all + downstream graph ops. + - **Capability C**: Similar to Capability A above, we want the ability + to intercept the runtime execution of FuncGraphs. Note that this + requirement could be folded into Capability A, if FuncGraphs are + regarded as a special type of op. + +Capability B will enable the interception of TF ops executing inside FuncGraphs +at _runtime_. Although this is not listed explicitly as a desired capability, it +is critical to runtime debugging in tfdbg v2. Similar to Capability A above, we +want access to all the intermediate tensor values computed in a FuncGraph. + +### Design: Debug Callbacks for Op Instrumentation + +The following API meets the requirements listed above. The code below shows the +API and the detailed signature and semantics of the callbacks that can be passed +to the API. + +```python +# Exported publicly as: tf.debugging.op_callback() +def op_callback(callback_fn): + """Intercepts op execution and op creation. + + The `callback_fn` will be invoked immediately after any of the three types + of events: + - The execution of an op under eager mode, + - The execution of a FuncGraph under eager mode, + - The creation of an op during graph construction (e.g., in + @tf.function-decorated Python functions.) + + Args: + A callback_fn that has the following signature: + def callback_fn(op_type, + inputs, + attrs, + outputs, + op_name=None, + graph=None): + # op_type: The type of the op, as a string. E.g., "MatMul". + # For the special case of FuncGraph execution, op_type + # takes the name of the graph name, e.g., + # "__inference_my_func_24". + # inputs: (`tuple` of `Tensor`s) Input tensors to the op or the + # FuncGraph. + # In eager execution, these are `EagerTensor`s. + # In graph construction, these are non-eager `Tensor`s + # that form the inputs to the just-created op. + # attrs: The attributes of the op or FuncGraph of which the execution + # or creation caused the current invocation of the callback. + # This is applicable to both eager- and graph-based execution, + # as well as graph construction. + # This is a tuple of alternating attribute keys and attribute + # values, e.g., `('adjoint_a', False, 'adjoint_b', False)`. + # outputs: (`tuple of `Tensor`s) Output tensors from the op or + # FuncGraph. + # In eager execution, these are `EagerTensor`s. + # In graph construction, these are non-eager `Tensor`s that + # are the outputs of the just-created op. + # op_name: Name of the op or FuncGraph. + # If the current invocation of the callback is due to the + # eager execution of an op, this will be `None` as op names + # are meaningless in eager execution. + # If this invocation of the callback is due to the + # eager execution of a FuncGraph, this will be the + # internally-generated name of the FuncGraph. + # In graph construction, this is the name of the op. + # graph: The graph that the op belongs to (if any). + # In eager execution of an op, this is `None`. + # In eager execution of a FuncGraph, the is the FuncGraph + # object per se. + # In graph construction, this is the op's containing graph. + # + # Return values: + # This callback function is expected to return a `list` or `tuple` + # of `Tensor`s, with its length matching `len(outputs)`, in the order + # that corresponds to that of the `output` argument. + # In eager execution, these returned `Tensor`s should be + # `EagerTensor`s. Their values will replace the original values of + # `outputs` for downstream eager execution. + # In graph construction, these returned `Tensor`s should be + # non-eager `Tensor`s. Their values will replace the original + # `outputs` for downstream graph construction. + + Returns: + A thread-local context manager. Within the scope of the context + manager, all eager op/graph execution and graph op construction + will invoke `callback_fn`. + + Raises: + ValueEror: If the `callback` does not return the a `list` or `tuple` + of `Tensor`s with the same length as the `outputs` argument passed to it. + """ +``` + +This API follows the style of TF2's op API, namely a style that **unifies eager +and graph modes**. In TF2, the same op API (e.g., `tf.matmul`) executes +differently depending on where it is called. If it is called in an eager +context, it will execute eagerly, consuming `EagerTensor`s and generating new +`EagerTensor`s as outputs. If called in a graph context (i.e., in +`tf-function`), it'll construct a new symbolic op by consuming inbound symbolic +Tensors. Our proposal captures both cases with a single API. + +The proposed API achieves Capabilities A and B listed above, due to the fact +that the callback(s) registered with the API will be invoked for both eager op +execution and graph op creation. + +Capability C is also met by this API, due to the fact that the callbacks are +invoked not only during the eager execution of ops, but also the execution +of FuncGraphs. In this paradigm, when a FuncGraph has been constructed and +undergoes execution by TF2, it is simply treated as a special type of op. + +The example code snippet below shows how Capability A is met, i.e., how eager +execution of ops is intercepted by the callback mechanism. + +```python +def my_callback(op_type, inputs, attrs, outputs, op_name=None, graph=None): + # Do something with any of the arguments. The author of the callback may: + # - Log any information contained in the callback's input arguments. + # - Return the input `outputs` arg directly as the output, which means + # no change to the outputs of the op. Or + # - Return a list or tuple of output tensors different from the original + # input arg `outputs`, in which case the callback will override + # the outputs of the op, in either eager execution or graph execution. + return outputs + +with tf.debugging.op_callback(my_callback): + x = tf.constant(3.0) + + y = tf.math.log(1 + x) + # ↑ During the execution of the line above, `my_callback` is invoked twice: + # 1. With the `op_type` arg being 'AddV2' and the `outputs` arg being an + # EagerTensor with value 4.0. + # 2. With the `op_type` arg being 'Log' and the `outputs` arg being an + # EagerTensor with value 1.386 (≈log(4.0)). +``` + +The code below illustrates how Capability B is met, i.e., how creation of +ops inside a user-defined Python function decorated by `tf-function`: + +```python +with tf.debugging.op_callback(my_callback): + + @tf.function + def log_1plusp(p): + return tf.math.log(1 + p) + + x = tf.constant(3.0) + + y = log_1plusp(x) + # ↑ During the execution of the line above, `my_callback` is invoked *three* + # times. The first two invocations happen during AutoGraph's transformation + # of the function `log_1plusp` into a FuncGraph: + # 1. With the `op_type` arg being 'AddV2' and the `outputs` arg being the + # symbolic Tensor output by by AddV2 op. + # 2. With the `op_type` arg being 'Log' and the `outputs` arg being the + # symbolic Tensor output by by Log op. + # (In reality, tf.function and AutoGraph may create additional ops such as + # constant ops for Python constants present in the Python function and + # Identity ops to marshal the FuncGraph's input and output values. Those + # extra ops will be captured by `my_callback` as well.) + # + # The third (last) invocation of the callback is due to the eager execution + # of the FuncGraph, with the `op_type` arg being `tf.Graph`, the `op_name` + # arg being something like `_inference_log_1plusp_30`, and the `outputs` arg + # being an EagerTensor of value 1.386 (≈log(4.0)). +``` + +The example above dealt with a relatively simple FuncGraph. But the proposed API +applies to more complex FuncGraphs, including those that involve control flow +such as if-else and while loops. For instance, see the code example below. + +```python +with tf.debugging.op_callback(my_callback): + + @tf.function + def collatz(x): + n = tf.convert_to_tensor((0,)) + while x != 1: + n += 1 + if x % 2 == 0: + x = x // 2 + else: + x = 3 * x + 1 + return n + + y = collatz(tf.constant(42)) + # ↑ During the execution of the line above, `my_callback` is invoked + # for the creation of all ops in the TF While loop's body and condition, + # and the TF Cond op's branches. In addition, `my_callback` will be + # invoked for the runtime execution of the FuncGraph converted from + # `collatz`. +``` + +### Runtime Instrumentation of Ops in FuncGraphs + +This API also supports the runtime instrumentation of ops in FuncGraphs. To see +why, realize the fact that the return values of the callback will override the +original Tensors for downstream graph construction. Hence, new ops can be +created inside the body of the callback. (The ops created inside the calbacks +themselves will be skipped the callback, which avoids infinite loops.) Such new +ops can consume the original output Tensors of the op and generate new output +Tensors for the callback function to return. This workflow results in debugging +nodes being inserted in the op’s graph. + +The two code snippets listed below (labeled "Listing 1" and "Listing 2") provide +a concrete example of the workflow involved. The code in Listing 1 uses +@tf.function to construct a simple TF graph. In particular, the Python code in +my_func() is converted into a FuncGraph. The code in Listing 2 performs the same +action, but with the code placed inside an `tf.debugging.op_callback()` scope. +The callback function passed to the context manager constructs a debug op +(DebugIdentityV2) for each of the op’s output Tensor and returns the output +tensors of these debug ops, the result of which is shown in Panel B of the +figure below: each tensor now has a DebugIdentityV2 op attached to it. + +```python +# Listing 1. Define a FuncGraph without `tf.debugging.op_callback()`. + +@tf.function +def my_func(x): + return tf.math.log(1 + x) +``` + +```python +# Listing 2. Define a FuncGraph with `tf.debugging.op_callback()` and +# by passing a callback that overrides the symbolic output +# tensors of the just-created graph ops. + +def debugger_callback(op_type, inputs, attrs, outputs, + op_name=None, graph=None): + instrumented_outputs = [] + for output_slot, output in enumerate(outputs): + # Construct overriding output tensor. + # Note: The `debug_identity_v2` below is not a public TensorFlow API + # that users can use directly. This is the workflow that will be used + # internally by tfdbg v2. However, TF users can emulate this pattern + # by using any TF built-in or user-defined ops to override the op's output. + # For instance, using + # `instrumented_outputs.append(tf.math.negative(output))` will cause + # all output tensors in the graph to be negated (i.e., sign-flipped). + instrumented_outputs.append(gen_debug_ops.debug_identity_v2( + output, + tensor_name="%s:%d" % (op_name, output_slot), + debug_urls=debug_urls)) + return instrumented_outputs + +with tf.debugging.op_callback(debugger_callback): + @tf.function + def my_func(x): + return tf.math.log(1 + x) +``` + +![FuncGraph instrumentation](./20190815-tfdbg-v2-callbacks/graph-instrumentation.png) + +The debug op’s output is identical to its input, therefore the semantics of the +FuncGraph is preserved by including the callback. But the side effect of the +debug ops will support runtime debugging. Examples of such side effects include: + + - Dumping the tensor’s entire value to disk (similar to [tfdbg v1’s session + wrappers](https://www.tensorflow.org/guide/debugger#debugging_model_training_with_tfdbg)) + - Establishing two-way communication with a gRPC debug server and await + signals from the server before resuming execution, thereby achieving a + “breakpoint” in the graph (similar to [tfdbg v1’s TensorBoard Debugger + Plugin](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md)) + + +### What This Proposal Offers Beyond pdb + +This design is not a replacement for pdb. Instead, it is a proposal for +TensorFlow-specific debugging instrumentation that may supplement pdb. The +proposed callback API will enable the following workflows beyond what the usual +pdb-based interactive debugging can achieve. + +1. It allows registration of a callback that can error out with a helpful + error message with proper stack traces when any of the ops' output tensors + contains NaNs or Infinities. This will catch NaN/Infinity issues in both + eagerly computed tensors and the ones that are computed inside FuncGraphs. +2. It allows registration of a callback that can dump the full history of + eager execution, graph building and in-graph op execution to the filesystem. + This will facilitate post hoc analysis of crashed TF2 programs. + Admittedly, dumping full tensor values is very expensive and not realistic + in general. However, there are ways to reduce the cost of this history + dumping (e.g., by dumping only concise numeric summaries of tensors, + sample only a subset of the execution steps, or dumping only from a subset + of the ops.) +3. It allows real-time transmisson of debug information to a gRPC server, in + order to enable interactive debugging in the style of tfdbg v1's [TensorBoard + Debugger + Plugin](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md). + +### Considerations for Various Cases + +The proposed approach here will work for the following special cases of graph +construction: + +- The proposed callback API will work for + [user-defined ops](https://www.tensorflow.org/guide/extend/op) as it'll work + for TensorFlow's built-in ops. +- The proposad callback API will also work for TensorFlow's composite tensors, + which currently include + [SparseTensors](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor) + and [RaggedTensors](https://www.tensorflow.org/guide/ragged_tensors). + Eager execution and graph-based execution on such tensors will trigger the + registered op callbacks. In fact, given this is a low-level API, the callback + mechanism does not treat composite tensors in any special way. The op types + received by the callbacks will be low-level ops such as "SparseTensorDenseMatMul". +- Gradient graphs (i.e., graphs generated with + [tf.GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape)). +- “Nested” invocation of FuncGraphs, i.e., a FuncGraph invoking another + FuncGraph inside. As mentioned above, this includes control flow v2. In While + v2 and Cond v2, the body and conditions of the control-flow constructs are + themselves FuncGraphs. +- In addition, this approach of instrumenting the graph will work for + DistributionStrategy. For instance, + [MirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/contrib/distribute/MirroredStrategy) + replicates the computation (eager or `tf.function`) across a number of + processors (CPU and GPUs). The proposed callback API will capture all the + eager execution and FuncGraph construction replicated on all the involved + processors. +- [tf.data pipeline](https://www.tensorflow.org/api_docs/python/tf/data) + constructs, including built-in data functions such as + [Dataset.batch()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch), + as well as the graph construction and runtime execution of user-defined + mapping functions supplied to + [Dataset.map()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map). +- Furthermore, the proposed API can support TPU debugging as well. In + particular, if the inserted DebugIdentityV2 ops are placed inside + [tpu.outside_compilation](https://www.tensorflow.org/api_docs/python/tf/tpu/outside_compilation), + they will be properly compiled by XLA and run on TPU clusters. For + performance, we will likely need to consolidate a large number of outside + compilations into a smaller number. But the basic principle of instrumentation + remains the same. + +### Alternatives Considered + +1. Instrumenting the graphs at C++ level. This is in fact the current approach + of tfdbg v1. An advantage of the C++-based approach is that it can reflect + the graph rewriting performed internally by TF (e.g., by Grappler.) However, + compared with the Python-based approach proposed above, it is harder to make + the instrumentation work correctly for all the special cases, including TPU + if we were to pursue that route. The C++-level instrumentation may be pursued + in other parts of the overall tfdbg v2 effort. But as a goal, it is outside + the scope of this design. +2. Instrumenting the Python-level graph in a place such as + [`_EagerDefinedFunction.__init__()`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/function.py#L345), + i.e., a common pathway where all `tf.function`s are created in TF2. To this + end, a graph-rewriting function can be implemented. Compared with the + proposed approach, this approach has the following disadvantages. It + doesn’t result in an elegant API that unifies graph instrumentation with + eager instrumentation (see Capability A above). In addition, it doesn’t + support step-by-step tracing a FuncGraph’s construction phase, which is + supported by the proposed API. + +## Acknowledgments + +Thanks are due to Dan Moldovan (mdan@google.com) for extensive discussion in +the process that led to this design. diff --git a/rfcs/20190815-tfdbg-v2-callbacks/graph-instrumentation.png b/rfcs/20190815-tfdbg-v2-callbacks/graph-instrumentation.png new file mode 100644 index 000000000..a1d45d21b Binary files /dev/null and b/rfcs/20190815-tfdbg-v2-callbacks/graph-instrumentation.png differ diff --git a/rfcs/20190815-tfx-notebook.md b/rfcs/20190815-tfx-notebook.md new file mode 100644 index 000000000..607f34bba --- /dev/null +++ b/rfcs/20190815-tfx-notebook.md @@ -0,0 +1,547 @@ +# TFX Iterative Notebook Proposal + +Status | Approved +:------------ | :------- +**Author(s)** | Charles Chen (ccy@google.com), Joe Lee (joeyounglee@google.com), Kenny Song (kennysong@google.com), Kevin Haas (khaas@google.com), Pushkar Joshi (pushkarj@google.com) +**Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) +**Updated** | 2019-09-17 + +## Objective + +We want to build a notebook user experience for modeling / iterative development +using TFX Components. This will provide a fast, familiar environment for +developing model and pipeline code with standard TensorFlow + TFX utilities, +plus automatic notebook → pipeline export: + +* Imperative, + [define-by-run](https://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html), + cell-by-cell workflow + * Start directly from Notebook/Colab – no running pipeline needed + * Run TFX components as you need them, in separate cells + * No explicit DAG definitions or continuous execution +* Simple Python API per TFX component + * ExampleGen, StatsGen, SchemaGen, Transform, Trainer, Evaluator + * 100% TFX compatible for automatic notebook → pipeline export +* Analyze artifacts natively in Notebook/Colab + * Built-in TensorBoard, Facets, TFMA visualizations + * Dataset, stats, eval metrics available in notebook for custom analysis +* Zero-setup, interactive onboarding tool for new TFX users on + [tensorflow.org](http://tensorflow.org) + +## Motivation + +The benefits of using a notebook include rapidly editing and running code, +immediately seeing the execution and outputs of commands, and running quick +one-off analyses in Python. It’s a simple, no-mental-overhead REPL environment +for iterating on ideas. + +By combining the notebook experience + TFX components, users can easily run + +* ExampleGen to generate the initial dataset used for training +* StatsGen to generate and visualize a statistical report of your data +* SchemaGen to generate a schema of your data (required input of Transform) +* Transform to write feature engineering strategies +* Trainer that wraps standard TF.Estimator or Keras code +* Evaluator to generate, slice, and visualize evaluation metrics +* Custom analyses on the output of any of these components with standard + Python + +To close the loop, the notebook will be automatically exported as a pipeline +configuration that users can directly deploy as a scalable TFX pipeline. There +is no additional modification required. + +## Target Users + +We target users who want to manually iterate on their models & components, and +prefer a notebook environment for the benefits outlined above. This is a wide +range of potential users, and from our user research, spans software engineers +and data scientists within and outside of Google. + +## Design Proposal + +This proposal proposes a set of primitives that match concepts in the current +TFX SDK. + +### Component definition; inputs and outputs + +#### Proposal: components should take inputs, produce outputs (instead of taking predefined upstream components) + +This proposal proposes a set of primitives that match concepts in the current +TFX SDK. We propose to follow the current TFX style of having components +explicitly take input channels (i.e. streams of artifacts of a specific type) +and produce output channels (of another specific type). This could look like +this: + +``` +# Here, with an input_base as an execution parameter with a given +# file path. +example_gen = CsvExampleGen(input_base=examples) + +# Next, we use the 'examples' named output of ExampleGen as the +# input to StatisticsGen. +statistics_gen = StatisticsGen(input_data=example_gen.outputs['examples']) + +# We then similarly use the statsgen output in SchemaGen. +infer_schema = SchemaGen(statistics=statistics_gen.outputs['statistics']) + +# Next, we do example validation. +validate_stats = ExampleValidator( + statistics=statistics_gen.outputs['statistics'], + schema=infer_schema.outputs['schema']) +``` + +### Component execution, execution result objects, visualization + +#### Proposal: InteractiveContext.run(component) returns an ExecutionResult, whose output artifacts can be visualized using InteractiveContext.show(artifacts) + +##### Part 1 (Option 1): add InteractiveContext.run() **[recommended]** + +We propose to add a new `InteractiveContext` class. Distinct from a pipeline +runner which takes in an entire TFX pipeline, an instance of this class allows +interactive execution of individual components. Here, a user would construct +components with appropriate parameters and execution properties, and the +`InteractiveContext.run(component)` method would execute that component, thereby +materializing any output artifacts of that component. + +An advantage of this style is that it does not bifurcate the TFX pipeline runner +concept into "pipeline runners" and "component runners", and it is very clear +that this API is only meant for interactive usage (as opposed to the two +alternatives below). A disadvantage is that we may not want to introduce +interactive usage as a first class citizen, preferring to merge it with the +runner concept. + +(A prototype for this is implemented in +[taxi_pipeline_interactive.ipynb](https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_interactive.ipynb)). +See the "Example notebook usage" section below. + +##### Part 1 (Option 2): add Component.run() + +In this alternative, we propose to add a run() method to the +[BaseComponent](https://github.com/tensorflow/tfx/blob/master/tfx/components/base/base_component.py) +class. Given the appropriate parameters and execution properties, this will run +that component of your pipeline. This will be in-process and not involve any +external TFX orchestrators (like Airflow or Kubeflow) and is suitable only for +small development datasets. + +An advantage of the Component.run() style is that it is simple and intuitive in +the notebook setting. A disadvantage is that this does not encourage the best +practice for production pipeline definition (i.e. defining all pipeline +components and subsequently calling something like Pipeline.run()). To mitigate +this, we can emit a warning when this is called outside a Jupyter notebook +environment. + +An advantage of returning an explicit ExecutionResult is that we now separate +component definition (configuration) from results for a specific run +(execution). + +##### Part 1 (Option 3): don't add Component.run(); have separate run_component() + +Alternatively, we don't have to put the run() method on the Component class. We +can factor out a utility method `run_component(component)` that does the same +thing. This style is less intuitive for the notebook use case but may better +encourage best practices during production. + +##### Part 2 (Option 1): a user can visualize outputs of an ExecutionResult by using Jupyter visualizer for artifact class, or by using InteractiveContext.show(artifact) **[recommended]** + +Here, after a `InteractiveContext.run(component)` call, we get an +ExecutionResult, on which we can retrieve artifacts with +`result.outputs[output_name]`. This will return the Artifact pointers emitted by +that specific component execution. Next, the user may return +`component.output[output_name]` as the return value from a notebook cell. +Alternatively, a user may call +`InteractiveContext.show(component.output[output_name])` which hooks into +artifact-specific logic to visualize each artifact type (see Part 3 below). + +##### Part 2 (Option 2): Artifact execution via show(artifact) + +In this alternative, instead of running components and retrieving artifacts +after they are run, artifacts are "run" implicitly when show(artifact) is +called. This will implicitly execute the component necessary to generate the +artifact. + +Pros: One show() call rather than separate run() and show(). Dependencies can be +handled under the hood, and we can avoid visualizing stale results. + +Cons: Not intuitive as this is not what “show” means. Not the simplest mental +model and potentially confusing. If a user wants to always show artifacts after +running, it is very natural to put run() and show() in the same notebook cell. +Running code and components that are not part of the current executed cell is +also not a notebook-friendly pattern. + +##### Part 3: Notebook visualizations for well-known artifact types can be registered + +We introduce a `NotebookArtifactVisualizationRegistry` class on which we may +register visualizations (e.g. HTML renderings for Colab / Jupyter notebooks), +which are to be returned from ExecutionResult.read() when run in the notebook +environment. For specific artifact types, we allow registration of handlers to +return visualizations for those types. We will write visualizations for +well-known artifact types we use. For example, the `ExampleStatistics` Artifact +type output by StatisticsGen could be visualized by producing an interactive +display of the resulting statistics +[using Facets](https://pair-code.github.io/facets/). + +##### Example notebook usage + +Here is an example of what notebook execution may look like in this scheme. + +**Input[0]:** + +```python +# To begin, we initialize an interactive context. Here, by not passing +# in a base directory or metadata configuration, we create an ephemeral +# context whose outputs will be in a temporary directory. +context = InteractiveContext() + +# Alternatively, we may pass in these directories for a context using a +# persistent store: +# +# context = InteractiveContext(base_dir=my_base_dir, +# metadata_connection_config=my_config) +``` + +**Input[1]:** + +```python +# First, ExampleGen with a run / read. +example_gen = CsvExampleGen(input_base=examples) + +# Note that the output will be of type 'ExamplesPath', for which we +# may have registered a notebook visualization handler. +example_gen_result = context.run(example_gen) + +example_gen.outputs['examples'] + +# alternative style: explicit context.show() method +context.show(example_gen.outputs['examples']) +``` + +**Output[1]:** + +_(notebook visualization indicating we have N examples at some temp path)_ + +**Input[2]:** + +```python +# Next, StatisticsGen with a run / read. +statistics_gen = StatisticsGen(input_data=example_gen.outputs['examples']) + +context.run(statistics_gen).outputs['statistics'] + +# alternative styles: +# context.show(context.run(statistics_gen).outputs['statistics']) +# context.run().read('output', visualization_handler=blah) +# context.run().show('output', visualization_handler=blah, visualization_args=) +``` + +**Output[2]:** + +_(notebook visualization for statsgen output)_ + +**Input[3]:** + +```python +# Next, SchemaGen without a run / read. +infer_schema = SchemaGen(statistics=statistics_gen.outputs['statistics']) + +# Finally, ExampleValidator with a run / read. Note that SchemaGen +# will be implicitly run (see Note 2 below). +validate_stats = ExampleValidator( + statistics=statistics_gen.outputs['statistics'], + schema=infer_schema.outputs['schema']) + +context.run(validate_stats) +``` + +**Output[3]:** + +_(ExecutionResult object for ExampleValidator)_ + +Note that the user may have forgotten to run InteractiveContext.run() on +upstream components in the dependency graph. Instead of implicitly running these +upstream components, we remind the user to explicitly run upstream notebook +cells (with a readable error message). We think this explicit mode of component +execution is more notebook-friendly, and is easy to use with common notebook +actions such as “Run All”, “Run Cells Before”, and “Run Cells After”. + +### Export to a selected orchestration engine (v0) + +#### Filter out InteractiveContext() objects + +##### Option 1: Replace InteractiveContext instances with dummy versions. + +1. Search for possible import alias, e.g. `from + tfx.orchestration.interactive.interactive_context import InteractiveContext + as FooBar` + +* Search for all instances of string ".*InteractiveContext", or the alias name + if found from prior step. +* Replace each instance with `DummyInteractiveContext`, which inherits from + InteractiveContext and basically does nothing / returns empty + ExecutionResult on .run(). + + ``` + class DummyInteractiveContext(InteractiveContext): + def run(self, + component: base_component.BaseComponent, + enable_cache: bool = True): + return None + ``` + + 1. This should cover the case where the class definition is aliased. + + ``` + aliased_class = interactive_context.InteractiveContext + context = aliased_class() + ``` + + 1. This should cover subsequent aliases of InteractiveContext instances. + + ``` + a = InteractiveContext() + b = a + c = InteractiveContext() + d = c + ``` + +Cons: + +* DummyInteractiveContext is now present/clutters the production pipeline code + (it's a no-op so mainly affects readability, not execution). +* Down the line, converting back to a notebook (replacing + DummyInteractiveContext with InteractiveContext) could be fragile. + +##### Option 2: Ensure InteractiveContext only runs in notebook context. **[recommended]** + +* If InteractiveContext is run outside of a notebook context, just log a + warning and return. +* Bi-directional import to notebook from pipeline would "just work". + +Cons: + +* InteractiveContext is still present in the production pipeline as a no-op / + affects readability. +* Puts the burden on user to scrub out calls to InteractiveContext. + +##### Option 3: Mark lines/cells to be skipped during export. + +Add custom magic to mark lines/cells as skip_for_export, can also be used by the +user to skip scratch work in cells. + +Example line magic: + +``` +%skip_for_export context = InteractiveContext() +... +%skip_for_export context.run(example_gen) +``` + +Example cell magic: + +``` +%%skip_for_export +# Cell contains scratch work that doesn't need to be exported. +... +``` + +Cons: + +* Puts burden on the user to filter out the InteractiveContext objects. User + may forget to mark some usages of InteractiveContext, meaning + InteractiveContext instances can get leaked to the final pipeline. + +##### Option 4: Delete the lines containing InteractiveContext variables. + +Cons: + +* Not robust to duplicate references. +* We can find the references to InteractiveContext by either keeping track of + them weakly within the class on __init__, or we can use gc module to + dynamically find the references. But then finding and deleting all + associated lines with each instance seems hard. + * What if user makes a helper function and passes in a context variable? + (not likely, but possible) + +Note each of these options only filters the InteractiveContext usage in the +exported python script, and does not prevent the user from adding it back +afterwards. + +#### Export notebook contents to pipeline + +1. Present the user with a Beam pipeline export template cell. Airflow/Kubeflow + template code can be linked to in documentation, or populated in additional + cells with code commented. + 1. User fills out any globals/configuration code specific to + Beam/Airflow/Kubeflow. + 1. User fills out a `pipeline.Pipeline()` instance to export. + 1. We can alternatively have the user wrap the pipeline.Pipeline() + instance in a function, like `_create_pipeline(...)` in the existing + [pipeline examples](https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_simple.py), + but this could make pipeline export more cumbersome for users who + have not organized their notebook in such a way. We could also + potentially update the notebook example to push users into a + particular notebook organization. +1. When the user runs the cell, or more specifically, when + `context.export_to_pipeline()` is executed, export the notebook code to .py + file. + 1. It seems beneficial to keep the export_to_pipeline() line in the same + cell as the pipeline.Pipeline() declaration so the user can fix any + errors before the export happens. + 1. As a first pass, we can export the entire notebook. + 1. We may consider using IPython magics to filter out specific + lines/cells in the future. + 1. This step requires the user to fill out the notebook filename as + there does not seem to be a robust way for us to programmatically + retrieve this (see comment in examples below). + 1. We can try to hide away parts of the template cells in the notebook and + move them into Jinja template files, but if the user has to fill in + pipeline-specific config, it might be more straightforward for them to + see everything in one place. + +##### Airflow template cell + +``` +# Airflow template cell. + +from tfx.orchestration import metadata +from tfx.orchestration import pipeline +from tfx.orchestration.airflow.airflow_runner import AirflowDAGRunner + +############################################################################## +# TODO(USER): Configs + +# Directory and data locations. This example assumes all of the chicago taxi +# example code and metadata library is relative to $HOME, but you can store +# these files anywhere on your local filesystem. +_tfx_root = os.path.join(os.environ['HOME'], 'tfx') +_pipeline_root = os.path.join(_tfx_root, 'pipelines', _pipeline_name) +# Sqlite ML-metadata db path. +_metadata_path = os.path.join(_tfx_root, 'metadata', _pipeline_name, + 'metadata.db') + +# Airflow-specific configs; these will be passed directly to Airflow. +_airflow_config = { + 'schedule_interval': None, + 'start_date': datetime.datetime(2019, 1, 1), +} +############################################################################## + +# TODO(USER) +p = pipeline.Pipeline( + pipeline_name=, + pipeline_root=, + components=[ + example_gen, statistics_gen, infer_schema, validate_stats, transform, + trainer, model_analyzer, model_validator, pusher + ], + enable_cache=True, + metadata_connection_config=metadata.sqlite_metadata_connection_config( + metadata_path)) + +airflow_pipeline = AirflowDAGRunner(_airflow_config).run(p) + +# Export notebook contents. +context = InteractiveContext() +# TODO(USER): Name of the notebook file to be used for retrieving +# notebook contents. IPython kernels are agnostic to notebook metadata by design, +# and it seems that existing workarounds to retrieve the notebook filename are not +# universally robust (https://github.com/jupyter/notebook/issues/1000). +context.export_to_pipeline(notebook_filename='taxi_pipeline_interactive.ipynb', + pipeline_name='') +``` + +##### Kubeflow template cell + +``` +# Kubeflow template cell. + +from tfx.orchestration import pipeline +from tfx.orchestration.kubeflow.runner import KubeflowRunner + + +############################################################################## +# TODO(USER): Configs + +# Directory and data locations (uses Google Cloud Storage). +_input_bucket = 'gs://my-bucket' +_output_bucket = 'gs://my-bucket' +_tfx_root = os.path.join(_output_bucket, 'tfx') +_pipeline_root = os.path.join(_tfx_root, _pipeline_name) + +# Google Cloud Platform project id to use when deploying this pipeline. +_project_id = 'my-gcp-project' + +# Python module file to inject customized logic into the TFX components. The +# Transform and Trainer both require user-defined functions to run successfully. +# Copy this from the current directory to a GCS bucket and update the location +# below. +_module_file = os.path.join(_input_bucket, 'taxi_utils.py') + +# Path which can be listened to by the model server. Pusher will output the +# trained model here. +_serving_model_dir = os.path.join(_output_bucket, 'serving_model', + _pipeline_name) + +# Region to use for Dataflow jobs and AI Platform training jobs. +# Dataflow: https://cloud.google.com/dataflow/docs/concepts/regional-endpoints +# AI Platform: https://cloud.google.com/ml-engine/docs/tensorflow/regions +_gcp_region = 'us-central1' + +# A dict which contains the training job parameters to be passed to Google +# Cloud AI Platform. For the full set of parameters supported by Google Cloud AI +# Platform, refer to +# https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#Job +_ai_platform_training_args = {...} + +# A dict which contains the serving job parameters to be passed to Google +# Cloud AI Platform. For the full set of parameters supported by Google Cloud AI +# Platform, refer to +# https://cloud.google.com/ml-engine/reference/rest/v1/projects.models +_ai_platform_serving_args = {...} + +# Beam args to run data processing on DataflowRunner. +_beam_pipeline_args = [...] + +# The rate at which to sample rows from the Chicago Taxi dataset using BigQuery. +# The full taxi dataset is > 120M record. In the interest of resource +# savings and time, we've set the default for this example to be much smaller. +# Feel free to crank it up and process the full dataset! +_query_sample_rate = 0.001 # Generate a 0.1% random sample. + +# This is the upper bound of FARM_FINGERPRINT in Bigquery (ie the max value of +# signed int64). +_max_int64 = '0x7FFFFFFFFFFFFFFF' + +# The query that extracts the examples from BigQuery. The Chicago Taxi dataset +# used for this example is a public dataset available on Google AI Platform. +_query = ... +############################################################################## + +# TODO(USER) +p = pipeline.Pipeline( + pipeline_name=, + pipeline_root=, + components=[ + example_gen, statistics_gen, infer_schema, validate_stats, transform, + trainer, model_analyzer, model_validator, pusher + ], + additional_pipeline_args={ + 'beam_pipeline_args': beam_pipeline_args, + # Optional args: + # 'tfx_image': custom docker image to use for components. + # This is needed if TFX package is not installed from an RC + # or released version. + }, + log_root='/var/tmp/tfx/logs') + +kubeflow_pipeline = KubeflowRunner().run(p) + +# Export notebook contents. +context = InteractiveContext() +# TODO(USER): Name of the notebook file to be used for retrieving +# notebook contents. IPython kernels are agnostic to notebook metadata by design, +# and it seems that existing workarounds to retrieve the notebook filename are not +# universally robust (https://github.com/jupyter/notebook/issues/1000). +context.export_to_pipeline(notebook_filename='taxi_pipeline_interactive.ipynb', + type='kubeflow') +``` + diff --git a/rfcs/20190816-tf-project-versioning.md b/rfcs/20190816-tf-project-versioning.md new file mode 100644 index 000000000..ba59bf194 --- /dev/null +++ b/rfcs/20190816-tf-project-versioning.md @@ -0,0 +1,44 @@ +# Project versioning in the TensorFlow organization + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Edd Wilder-James (ewj@google.com), Martin Wicke (wicke@google.com) | +| **Sponsor** | Kevin Haas (khaas@google.com) | +| **Updated** | 2019-08-16 | + +## Objective + +This document describes best practices for numbering versions of projects +that form part of the TensorFlow suite of projects. This practice is required for dependent +projects hosted under the [TensorFlow organization](https://github.com/tensorflow) on +GitHub, and advisory for dependent projects hosted elsewhere. + +## Definitions + +"TensorFlow" in this document refers to the core TensorFlow project, as developed in +GitHub `tensorflow/tensorflow`. + +## Motivation + +As the number of projects dependent on TensorFlow increases, such as those shipped by +SIG Addons or IO, it is helpful to maintainers to understand the constraints on how +to number their releases. + +## Versioning Policy + +All projects must follow [semantic versioning](https://semver.org/). + +Until a project reaches 1.0, it does not have to make any backward compatibility guarantees. + +Projects should not try to track major TensorFlow versioning to indicate compatibility +with particular TensorFlow releases. Instead, compatibility must be signalled +by the use of dependencies in `pip`, or whichever package manager is being used by the project. + +Within the constraints of semantic versioning, project maintainers should feel free to do +whatever is best for their projects and users. + +## Review Feedback + +Included as advisory but not binding. + +* Jason Zaman: It might be a good idea to also mention stability guarantees, and things that are excluded from them. eg TensorFlow itself says anything that's `tf.foo.experimental.bar` is not stable and is allowed to change at anytime, and other projects should think about having a similar mechanism if needed. diff --git a/rfcs/20190821-nodejs-saved-model.md b/rfcs/20190821-nodejs-saved-model.md new file mode 100644 index 000000000..a815da923 --- /dev/null +++ b/rfcs/20190821-nodejs-saved-model.md @@ -0,0 +1,359 @@ +# Native SavedModel execution in Node.js + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | kangyizhang@google.com | +| **Sponsor** | smilkov@google.com, nsthorat@google.com, piyu@google.com | +| **Updated** | 2019-09-27 | + +## Objective + +This project is aiming to enable native TF SavedModel execution for inference in Node.js environment without conversion. + +### Goals + +* Implement an API to load and execute TF SavedModel Signature for inference only in Node.js. +* This API should works for SavedModel exported in both TF 1.x and 2.0 +* Wrap the loaded SavedModel as a new subtype implementing [tf.InferenceModel](https://github.com/tensorflow/tfjs/blob/81225adc2fcf6fcf633b4119e4b89a3bf55be824/tfjs-core/src/model_types.ts#L36) +* Enable the ability to inspect the SavedModel metaGraph and signature in Node.js with protocol buffers in JavaScript. + +### Non-goals + +* Enable execution tf.function in Node.js +* Enable support for training a SavedModel +* Enable support for exporting a SavedModel + +## **Motivation** + +TensorFlow.js brings TensorFlow into the JavaScript world. It provides APIs to develop and train models, and also tools to convert models trained in other languages. + +Currently users could use [tfjs-converter](https://github.com/tensorflow/tfjs-converter) to convert TensorFlow SavedModel and TensorFlow Hub module to js friendly format and run inference through TensorFlow.js through the following steps: + + +1. Install tf-nightly-2.0-preview and tensorflowjs pip packages +2. Run the converter script to convert the model to js friendly format + +``` +tensorflowjs_converter \ + --input_format=tf_saved_model \ + --output_format=tfjs_graph_model \ + --signature_name=serving_default \ + --saved_model_tags=serve \ + /mobilenet/saved_model \ + /mobilenet/web_model +``` + +3. Load and run the converted model in javascript through [tf.loadGraphModel()](https://js.tensorflow.org/api/latest/#loadGraphModel) or [tf.loadLayersModel()](https://js.tensorflow.org/api/latest/#loadLayersModel) API based on the model type + +``` +const model = await tf.loadGraphModel(MODEL_URL); +const cat = document.getElementById('cat'); +model.predict(tf.browser.fromPixels(cat)); +``` + +The above steps require developers to install Python TensorFlow package and some parameters/configurations, which we have noticed users are struggling with. + +The tfjs-node repository provides native TensorFlow execution in Node.js environment through TensorFlow C library under the hood. It provides the same API (190+ ops) as [TensorFlow.js](https://js.tensorflow.org/api/latest/), which is a subset of the TensorFlow ops (900+). + +Here there is an opportunity to support native SavedModel execution in Node.js with TensorFlow C library so that 1) tfjs-node can support models which contain ops that are not supported in TensorFlow.js yet, and 2) users do not need to go through the model conversion process. + +This project uses the non-eager APIs in TensorFlow C library to enable loading and executing TF SavedModel for inference in Node.js environment without conversion. + + +## **Design Proposal** + + +### User-facing code + +This project will provide a new API `tf.node.loadSavedModel` to load a Signature in SavedModel as a new class `TFSavedModel` in Node.js, which can be used to execute the SavedModel Signature for inference. + +The loadSavedModel API takes a `path`, which is the absolute path to the SavedModel directory, a `tag_set` to identify which MetaGraph to load, and `signature` name as params. It returns a `TFSavedModel` object, implementing [tf.InferenceModel](https://github.com/tensorflow/tfjs/blob/81225adc2fcf6fcf633b4119e4b89a3bf55be824/tfjs-core/src/model_types.ts#L36) class. + + +``` +const savedModel = tf.node.loadSavedModel(__dirname + 'saved_model_dir', 'tag_set', 'signature_def_name'); +``` + + +The returned TFSavedModel object has a `predict()` function to execute the SavedModel signature for inference. The param of this predict() function would be a single tensor if there is single input for the model or an array of tensors if the model has multiple inputs. + +The TFSaveModel object also has an `execute()` function to execute the inference for the input tensors and return activation values for specified output node names. + + +``` +const input = tensor1d([123], 'int32'); +// Execute the loaded signatureDef of the SavedModel +const output = savedModel.predict([input_tensors]); +``` + + +The TFSavedModel object also has a `delete()` function to free the SavedModel related memory. + + +``` +savedModel.delete() +// The following line will throw an exception saying the SavedModel has been deleted. +const output = savedModel.predict([input_tensors]); +``` + + + +### Internal Change + + +#### Load SavedModel + +A [SavedModel](https://www.tensorflow.org/beta/guide/saved_model) is a directory containing serialized signatures and the states needed to run them. + + +``` +assets saved_model.pb variables +``` + + +The directory has a saved_model.pb (or saved_model.pbtxt) file containing a set of named signatures, each identifying a function. + +SavedModels may contain multiple sets of signatures (multiple MetaGraphs, identified with the tag-sets). When serving a model for inference, usually only one signature is used. + + +#### Designate the MetaGraph and Signature to execute + +Though the C API supports loading multiple MetaGraph, and one loaded MetaGraph may have several SignatureDefs, this project only supports loading one MetaGraph and executing one SignatureDef through the JavaScript API, so that it’s clear to users that the loaded SavedModel is only using the specified Signature for inference. This also aligns with the current TensorFlow.js [models API](https://js.tensorflow.org/api/latest/#class:GraphModel), and the current workflow with tfjs-converter. + +Users are able to load multiple signature from the same SavedModel by calling JavaScript API multiple times. The detailed discussion is provided later in SavedModel management section. + +#### Deserialize saved_model.pb with protobuf in javascript to get MetaGraph and Signature info + +For JavaScript developers, who do not have a lot of machine learning experience, their use case might be that they find an open source model and they want to use it in their Node.js project. The MetaGraph and Signatures are unclear to them and they don’t know how to get the model metadata in saved_model.pb file. + +While TensorFlow provide the [SavedModel CLI tool](https://www.tensorflow.org/beta/guide/saved_model#details_of_the_savedmodel_command_line_interface) to inspect and execute a SavedModel, this project will make it convenient for JS developers to do all the work in JavaScript. + +A new API will be added in TensorFlow.js node environment to allow users to inspect SavedModel, similar to [saved_model_cli show](https://www.tensorflow.org/beta/guide/saved_model#show_command), so that users can know what value to provide as MetaGraph and Signature. + + +``` +const modelInfo = tf.node.inspectSavedModel(__dirname + 'saved_model_dir'); + +console.log(modelInfo); +/* The modelInfo should include the following information: +{ + tags: ['serve'], + signatureDef: { + serving_default: { + 'inputs': { + x: { + 'name': 'serving_default_x:0', + 'dtype': ..., + 'tensorShape': ... + } + }, + 'outputs': { + output_0: { + 'name': 'StatefulPartitionedCall:0', + 'dtype': ..., + 'tensorShape': ... + } + }, + 'methodName': 'tensorflow/serving/predict' + } + } +} +*/ +``` + + +Google’s Protocol Buffers is also available in javascript. It provides a [Protocol compiler](https://github.com/protocolbuffers/protobuf/releases) to translate the xxx.proto file to js file, and a JavaScript Protocol Buffers runtime library [google-protobuf](https://www.npmjs.com/package/google-protobuf) to construct and parse the messages. + +To use Protocol Buffers in javascript, first the saved_model.proto file need to be translated: + + +``` +$ protoc --js_out=import_style=commonjs,binary:. saved_model.proto +``` + + +The above command will translate the [saved_model.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/saved_model.proto) file to saved_model_pb.js file. Then in js code, the saved_model.pb file can be parsed as SavedModel object through the translated js file. + + +``` +var messages = require('./tensorflow/core/protobuf/saved_model_pb'); +var fs = require('fs'); + +var SavedModel = new messages.SavedModel(); +const mobileModel = fs.readFileSync('./saved_model.pb'); +const array = new Uint8Array(mobileModel); + +const model = messages.SavedModel.deserializeBinary(array); + +console.log(model.getSavedModelSchemaVersion()); +console.log(model.getMetaGraphsList()); +``` + + +With protobuf in JavaScript, the MetaGraphDef tag-sets and SignatureDef keys in SavedModel are available to be retrieved in JavaScript. + + +#### Use TF C API TF_LoadSessionFromSavedModel to load SavedModel + +The TensorFlow C library has a [TF_LoadSessionFromSavedModel](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api.h#L1211) API, which creates a new TF_Session and then initializes states. This API supports both TF 1.x and TF 2.0 models. So with this API, the same code in tfjs-node works for both TF 1.x and 2.0. The `export_dir` and `tag` parameters are the `path` and `tag_set` value provided by users in javascript API. + + +``` +TF_Session *session = TF_LoadSessionFromSavedModel( + session_options, run_options, export_dir, tags, tags_leng, graph, + metagraph, tf_status.status); +``` + + +The returned TF_Session can be run with [TF_SessionRun](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api.h#L1254) API to execute the graph associated with the session. + + +### Do inference through running the loaded Session + +TF C API provides a [TF_SessionRun](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api.h#L1254) function to execute the graph associated with the input session, which is loaded from the SavedModel through TF_LoadSessionFromSavedModel, as discussed above. + + +``` +TF_CAPI_EXPORT extern void TF_SessionRun( + TF_Session* session, + // RunOptions, may be NULL + const TF_Buffer* run_options, + // Input tensors + const TF_Output* inputs, TF_Tensor* const* input_values, int ninputs, + // Output tensors + const TF_Output* outputs, TF_Tensor** output_values, int noutputs, + // Target operations + const TF_Operation* const* target_opers, int ntargets, + // RunMetadata, may be NULL + TF_Buffer* run_metadata, + // Output status + TF_Status*); +``` + + +If the session is successfully executed, the tensors corresponding to output ops are placed in output_values, which are TF_Tensor type. They will be converted to TFE_TensorHandle and registered in tfjs-node backend, which is the same as how tensor is managed currently in tfjs-node addon. + +When running the session, input and output op names are the input/output names of the Signature provided when loading the SavedModel. + + +### New functions in node C++ addon (TFJSBackend) + +Several new functions and members are added into [TFJSBackend](https://github.com/tensorflow/tfjs/blob/master/tfjs-node/binding/tfjs_backend.h#L29) to support SavedModel execution in node. + + +#### Tf_savedmodel_map and InsertSavedModel() + +A map is added in the TFJSBackend to manage the loaded session from SavedModel. Similar to tfe_handle_map, the key of this map is a number of savedmodel_id. The value of this map is a pair of the loaded TF_Session and TF_Graph from SavedModel. + + +``` +std::map> tf_savedmodel_map_; +``` + + +#### LoadSavedModel + +LoadSavedModel function is added to load a SavedModel from a path. It will get TF_Session from the SavedModel and insert the session into tf_savedmodel_map. + + +``` + // load a SavedModel from a path: + // - export_dir (string) + napi_value LoadSavedModel(napi_env env, napi_value export_dir, napi_value tag_set); +``` + + +#### RunSavedModel + +The backend will need savedmodel id, input tensor ids, and input/output names to execute the TF_Session. + + +``` + // Execute a session with the provided input/output name: + // - savedmodel_id (number) + // - input_tensor_ids (array of input tensor IDs) + // - input_op_names (string) + // - output_op_names (string) + napi_value RunSavedModel(napi_env env, napi_value savedmodel_id, + napi_value input_tensor_ids, napi_value input_op_names, + napi_value output_op_names); +``` + + +#### DeleteSavedModel + +When user does not need the SaveModel, DeleteSaveModel needs to be called to delete the corresponding TF_Session to release the memory. + + +``` + // Delete the corresponding TF_Session and TF_Graph + // - savedmodel_id (number) + void DeleteSavedModel(napi_env env, napi_value savedmodel_id); +``` + + + +#### TFJSBinding API + +The [TFJSBinding](https://github.com/tensorflow/tfjs/blob/master/tfjs-node/src/tfjs_binding.ts#L32) interface will have corresponding functions to load, run and delete the SavedModel in JavaScript. + + + +### Manage SavedModel in JavaScript + +To manage and execute loaded sesion from SavedModel, a new TFSavedModel javascript class is added [nodejs_kernel_backend](https://github.com/tensorflow/tfjs/blob/master/tfjs-node/src/nodejs_kernel_backend.ts#L38). + + +``` +class TFSavedModel implement InferenceModel { + private readonly id: number; + private deleted: boolean; + private readonly inputOpName: string[]; + private readonly outputOpName: string[]; + + constructor(id: number, backend: NodeJSKernelBackend) {} + + predict(inputs: Tensor|Tensor[]|NamedTensorMap, config: ModelPredictConfig): + Tensor|Tensor[]|NamedTensorMap; + + execute(inputs: Tensor|Tensor[]|NamedTensorMap, outputs: string|string[]): + Tensor|Tensor[]; + + + delete() {} +} +``` + + +The instance of TFSavedModel could only be created in nodejs_kernel_backend object when calling TFJSBinding’s LoadSavedModel function. And the id value is the number returned from the TFJSBachend. + +Following is how user will use SavedModel in tfjs-node: + + +``` +const model = tf.node.loadSavedModel(__dirname + 'saved_model', 'serve', 'serving_default'); + +const input = tensor1d([123], 'int32'); + +const output = model.predict([input_tensor]); + +const output = savedModel.execute({'input_op_names':input_tensors}, ['output_op_names']); + +model.delete(); +``` + + + +#### Load multiple signatures from the same SavedModel + +If users want to use multiple signatures from the same SavedModel, they can call tf.node.loadSavedModel() API several times to get multiple instances. The node backend will keep track of SavedModel paths that have been loaded. When doing a new loading, if the path to SavedModel have been loaded, the node backend will use the existing Session in addon module and will not load new thing through TF C API again. + + +# Test Plan + +Several different types of SavedModel will be added as test artifacts. And tests will run against these real SavedModel. These tests will also cover memory leaking checks to make sure corresponding memories are released when a SavedModel is deleted in Node.js runtime. + + +# Benchmarking + +A job to benchmark executing SavedModel (probably mobilenet) in tfjs-node vs tf python will be added to the current [benchmark infrastructure](https://github.com/tensorflow/tfjs/tree/master/tfjs/integration_tests#running-tfjs-node-benchmarks). diff --git a/rfcs/20190828-tfx-resolver.md b/rfcs/20190828-tfx-resolver.md new file mode 100644 index 000000000..5995fef3d --- /dev/null +++ b/rfcs/20190828-tfx-resolver.md @@ -0,0 +1,156 @@ +# TFX Resolver + +Status | Accepted +:------------ | :-------------------------------------------- +**Author(s)** | Ruoyu Liu (ruoyu@google.com) +**Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) +**Updated** | 2019-08-28 + +## Objective + +This RFC proposes the design of Resolver, which serves as an optional plugin in +TFX DSL to handle artifact resolution before a component execution. The +following can be achieved with this design: + +* Enable artifact resolution across pipeline run boundaries +* Make artifact resolution easy to customize directly through pipeline + definition + +## Motivation + +In the original design of TFX orchestration, Driver is used to prepare all +artifacts needed for a component execution and feed the result artifacts into +Executor for execution. The default behavior of input artifact resolution is to +take the outputs of upstream components in the same pipeline run. Any behavior +other than that requires a customized driver. While Driver is sufficient in +terms of functionality, it is essentially a blackbox for TFX end users that is +hard to reason about. Customization and maintenance are also hard since a Driver +also contains other logic such as execution decisions making. + +To address the aforementioned problem, we propose to extract the artifact +resolution part into a separate unit, named Resolver. It has the following +attributes: + +* It is an optional plugin. Users that do not need this feature do not need to + understand Resolver so that simple use cases still remain simple +* It is easy to understand. Pipeline authors and users no longer need to dig + into hidden driver code to reason about the artifacts' flow into a component +* It is easy to write and test. A Resolver definition is no more than a lambda + expression + +## Detailed Design + +### API + +A Resolver contains the definition of how to query back artifacts given source +Channels, an optional configuration and the access to historical context of +previous artifacts and executions. The API is similar as below: + +```python +class BaseResolver(object): + def __init__(self, configuration: Optional[Any] = None): + self._configuration = configuration + + @abstractmethod + def resolve( + self, + metadata_handler: metadata.Metadata, + input_dict: Dict[Text, Channel]) -> Dict[Text, Iterable[Artifact]]: + raise NotImplementedError +``` + +The parameter `metadata_handler` passed into `resolve()` is read-only since no +write should be allowed during artifact resolution stage. The other parameter +input is a mapping from tags to Channels. Each Channel provides the type +information that will be used when querying ML metadata. + +### DSL integration + +There are two options to integrate *Resolver* into TFX DSL: + +1. Make *Resolver* an optional parameter for component interface +2. Build a special node *ResolverNode* as the wrapper of *Resolver* logic and + make it independent of existing component interface. The definition of + *ResolverNode* is shown below + +```python +class ResolverNode(BaseNode): + def __init__(self, + name: Text, + resolver: Type[BaseResolver], + **kwargs: Channel): + ... +``` + +We choose to adopt option (2) for the following reasons: + +* It keeps simple cases simple. Users do not need to care about *Resolver* if + there is no need for cross-pipeline-run artifact resolution +* It has cleaner and clearer interface than Option (1), especially when + cross-pipeline-run artifact resolution is needed only for some of the inputs + to a component +* It allows not only resolution logic sharing but also resolution results + sharing. Instead of repeating the same *Resolver* multiple times, + ResolverNode allows reusing artifact resolution results with little work + +### Example + +The following example demonstrate our design. There are a couple of requirements +in this scenario: + +* Train with the latest n pieces of Example artifacts, including the one + produced within the pipeline run +* Transform and Trainer should operate on the same set of Example artifacts + +First, create a new resolver that implements the desired artifact resolution +logic: + +```python +# This class implements an artifact resolution logic that will return the latest +# n artifacts for each given Channel. +class LatestRollingWindowResolver(BaseResolver): + def resolve( + self, + metadata_handler: MetadataStore, + Input_dict: Dict[Text, Channel]) -> Dict[Text, Iterable[Artifact]]: + result = {} + for key, source_channel in input_dict.items(): + result[key] = self._get_artifacts_from_channel( + metadata=metadata_handler, + channel=source_channel, + sort_fn=_latest_first, + maximum_count=self._configuration.window) + return result +``` + +Next, create a new ResolverNode instance in the pipeline definition. An instance +of `LatestRollingWindowResolver` is passed in to serve as the resolution logic +unit. Since `transform` and `trainer` all use the output of the same +ResolverNode instance, they will share the same artifact resolution results. + +```python +def create_pipeline(): + ... + + example_gen = CsvExampleGen(input_base=...) + + resolver_node = ResolverNode( + examples=example_gen.outputs['examples'], + resolver=LatestRollingWindowResolver(generate_config(window=5))) + + transform = Transform( + examples=resolver_node.outputs['examples'], + ...) + + trainer = Trainer( + examples=resolver_node.outputs['examples'], + transform_output=transform.outputs['transform_output'], + ...) + ... + +``` + +## Future work + +With the ability to resolve artifacts from past runs, continuous training can be +enabled to take us one step further in ML production automation. diff --git a/rfcs/20190829-tfx-container-component-execution.md b/rfcs/20190829-tfx-container-component-execution.md new file mode 100644 index 000000000..9f7dd96da --- /dev/null +++ b/rfcs/20190829-tfx-container-component-execution.md @@ -0,0 +1,570 @@ +# TFX Container Component Execution Proposal + +Status | Accepted +:------------- | :------- +**Author(s)** | Ajay Gopinathan (ajaygopinathan@google.com), Hongye Sun (hongyes@google.com), Makoto Uchida (muchida@google.com), Ruoyu Liu (ruoyu@google.com) +**Sponsor(s)** | Konstantinos Katsiapis (katsiapis@google.com), Pavel Dournov (dournov@google.com), Ramesh Chandra (rameshuc@google.com) +**Updated** | 2019-08-29 + +## Objective + +This RFC proposes an orchestrator agnostic way to reliably execute a user’s +container in the TFX pipeline. The proposal can support: + +* Running an arbitrary container in either a local Docker environment or a remote + Kubernetes cluster. +* Passing data into the container +* Passing output data from the container +* Capturing logs from the container +* Handling errors and retries +* Cancelling the container execution if the pipeline is terminated + +## Motivation + +Currently, the execution of a generic container as a step in a TFX pipeline is +not supported. Without this feature, users cannot bring their own containers +into the pipeline. This blocks the following use cases: + +* User already has a docker image and wants to run the image as one of the + steps in a TFX pipeline. +* User wants to use non-Python code (R, for example) as one of the steps in a TFX + pipeline. +* User wants to have an isolated Python environment for their component code. + +This RFC is a follow-up design for +[Container-based Component RFC](https://github.com/tensorflow/community/pull/146). +This design defines how to execute the container spec as part of a TFX pipeline. +The execution may occurs in local Docker container or in a remote Kubernetes cluster. + +### Existing solutions + +#### Kubeflow Pipeline (KFP) ContainerOp + +Today, KFP’s ContainerOp leverages +[Argo container template API](https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/workflow_types.go) +to launch user’s container in a Kubernetes pod. Argo, as the orchestrator, controls when +to launch the POD and it uses a sidecar container to report output files back +and wait for user’s container to complete. We are not proposing to use Argo API +because of the following reasons: + +* Argo’s API is orchestrator-specific and cannot be ported to Airflow or local + runners. +* Argo’s API doesn’t provide an extensible way to run custom code before and + after POD API, which is critical to support metadata tracking and caching + features. +* Argo doesn’t provide an easy way to recover from user’s transient errors, + which is critical in production workload. + +#### Airflow Kubernetes pod operator + +Airflow supports launching a Kubernetes pod by an +[operator](https://github.com/apache/airflow/blob/master/airflow/contrib/operators/kubernetes_pod_operator.py). +This approach is closer to what we are proposing in the document. However, we +cannot directly use the operator because: + +* Airflow operator requires to be run inside an Airflow pipeline which is not + the case for local and KF runners. +* Airflow operator exposes a subset of POD’s API, where we want to expose the + full pod spec to the user. +* Airflow operator doesn’t provide a reliable way to retry user’s container + and recover from transient errors. +* Airflow does not support initializing an operator inside another operator. + Going back to using multiple Airflow operators for a component is a regression + now that we have `BaseComponentLauncher` ready. + +## Proposed Design + +### TLDR + +We propose to solve the above problems with the following design: + +* Define a container as an executor spec. +* Launch a container via a component launcher in either a local docker or Kubernetes pod. +* Use a platform config to specify a platform-specific settings config. + +The proposed solution has the following parts: + +* Extensible `ExecutorSpec` concept which can support a container as an + executor. +* Extensible `BaseComponentLauncher` concept to support pluggable component + launchers in a TFX runner. + * `DockerComponentLauncher` which launches `ExecutorContainerSpec` in + a Docker environment. + * `KubernetesPodComponentLauncher` which launches `ExecutorContainerSpec` + in a Kubernetes environment. +* Extensible `PlatformConfig` framework. + * `KubernetesPodPlatformConfig` to support Kubernetes pod spec as a config. + * `DockerPlatformConfig` to support docker run configs. + +### Architecture + +Architecture that allows local container execution. + +![TFX local container execution](20190829-tfx-container-component-execution/tfx-local-container-execution.png) + +Architecture that allows Kubernetes container execution. + +![TFX Kubernetes container execution](20190829-tfx-container-component-execution/tfx-k8s-container-execution.png) + +Class diagram that allows container execution + +![TFX container execution_classes](20190829-tfx-container-component-execution/tfx-container-execution-classes.png) + +### Python DSL experience + +In order to use container base component in TFX DSL, user needs follow these +steps. Step 1 and Step 2 follow the DSL extension proposed by [TFX Generic Container-based Component](https://github.com/tensorflow/community/pull/146). + +#### Step 1: Define the container based component by `ExecutorContainerSpec` + +```python +class MyContainerBasedExampleGen(BaseComponent): + + SPEC_CLASS = types.make_spec_class( + inputs={ +      "raw_input": ChannelParameter(type=standard_artifacts.ExternalArtifact), +  } + outputs={ +      "examples": ChannelParameter(type=standard_artifacts.Examples), +  } + parameters={ +      "num_csv_columns": ExecutionParameter(type=int), +  } + ) + + EXECUTOR_SPEC = ExecutorContainerSpec( + container_image='gcr.io/my_project/my_example_gen:stable', + command=['python'], + args=['my_example_gen.py', + '--input_csv_file', '{{ inputs.raw_input.uri }}', + '--output_examples', '{{ outputs.examples.uri }}', + '--num_csv_columns', '{{ exec_props.num_csv_columns }}' ], + ) +``` + +#### Step 2: Create pipeline from container based component + +```python +def create_pipeline(): +  my_csv_file = Channel('CSVFile', uri="/path/to/csv_file") + +  my_example_gen = MyContainerBasedExampleGen( +        raw_input=my_csv_file, num_csv_columns=20) + +  return pipeline.Pipeline( +    pipeline_name = 'my_container_based_pipeline', +    pipeline_root = 'gs://path/to/root', +    components = [my_example_gen], +    ... +  ) +``` + +#### Step 3(a): Set docker config via runner’s config + +```python +_ = BeamRunner(platform_configs={ + 'MyContainerBasedExampleGen': [DockerPlatformConfig(volumes={...})] +}).run(create_pipeline()) +``` + +#### Step 3(b): Set Kubernetes platform config via runner’s config + +```python +_ = KubeflowDagRunner(platform_configs={ + 'default': [KubernetesPodPlatformConfig(Pod().use_gcp_secret().spec()] + 'MyContainerBasedExampleGen': [ + KubernetesPodPlatformConfig(Pod(cpu=2, memory='1GB').spec())]} +).run(create_pipeline()) +``` + +### Component launcher + +A component launcher launches a component by invoking a driver, an executor and +a publisher. It understands how to launch a component executor from an +`ExecutorSpec`. The `BaseComponentLauncher` is an abstract base class with two +abstract methods: + +* `can_launch`: public method to check whether the launcher can launch an + instance of `ExecutorSpec` with a specified `PlatformConfig` instance. The + method will be used by `TfxRunner` to choose launcher for a component. +* `_run_executor`: a protected method to launch an `ExecutorSpec` instance. + The method is invoked by `BaseComponentLauncher.launch` method. + +Subclasses of the base component launcher can support launching executors in +different target platforms. For example: + +* `InProcessComponentLauncher` can launch an executor class in the same Python + process. +* `DockerComponentLauncher` can launch a container executor in a Docker + environment. +* `KubernetesPodComponentLauncher` can launch a container executor in a Kubernetes + environment. +* A Dataflow launcher can launch a beam executor in Dataflow service. + +Pseudo implementation: + +```python +class BaseComponentLauncher(with_metaclass(abc.ABCMeta, object)): + @abc.abstractmethod + @classmethod + def can_launch(cls, executor_spec: ExecutorSpec, + platform_spec: Optional[PlatformConfig]) -> bool: + return False + + @abc.abstractmethod + def _run_executor(execution_id: int, + input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> Any: + pass + +class InProcessComponentLauncher(BaseComponentLauncher): + # InProcessComponentLauncher implements default launcher for python executor. + # It doesn't support platform_spec. + @classmethod + def can_launch(cls, executor_spec: ExecutorSpec, + platform_spec: Optional[PlatformConfig]) -> bool: + if platform_spec: + return False + return isinstance(executor_spec, ExecutorClassSpec) + + def _run_executor(execution_id: int, + input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> Any: + # Python in process launcher implementation. + # Subclass should override this method to implement platform launcher + … + +class DockerComponentLauncher(BaseComponentLauncher): + + @classmethod + def can_launch(cls, executor_spec: ExecutorSpec, + platform_spec: Optional[PlatformConfig]) -> bool: + if not isinstance(executor_spec, ExecutorContainerSpec): + return false + + if not platform_spec: + return true + + return isinstance(platform_spec, DockerPlatformConfig): + + def _run_executor(execution_id: int, + input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> None: + # Docker launcher implementation + ... + +class KubernetesPodComponentLauncher(BaseComponentLauncher): + @classmethod + def can_launch(cls, executor_spec: ExecutorSpec, + platform_spec: Optional[PlatformConfig]) -> bool: + if not isinstance(executor_spec, ExecutorContainerSpec): + return false + + if not platform_spec: + return true + + return isinstance(platform_spec, DockerPlatformConfig): + + def _run_executor(execution_id: int, + input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> None: + # Kubernetes pod launcher implementation + … +``` + +### Platform config + +Platform config carries platform specific configs. Usually, one platform config +type maps to one component launcher type. For example, +`DockerPlatformConfig` can only be used by `DockerComponentLauncher` and +`KubernetesPodPlatformConfig` can only be used by +`KubernetesPodComponentLauncher`. + +Each platform config can be merged with another config with the same type. This +capacity is needed to support a layered configuration system in runner’s config: + +* User can define a default platform config list which will be applied to all + components in the pipeline. +* User can define component specific config by using component’s name as a + selector. +* Component specific config should override the default config. + +Pseudo implementation: + +```python +class PlatformConfig(with_metaclass(abc.ABCMeta, object)): + def merge(self, platform_config: PlatformConfig) -> PlatformConfig: + """Merge the current config with a new config. + Usually, it only happens when component config is merged with default config. + """ + # Simple recursive dictionary merge logic + +class DockerPlatformConfig(PlatformConfig): + def __init__(self, **kwargs): + # The kwargs is the same as the list defined in + # https://docker-py.readthedocs.io/en/stable/containers.html#docker.models.containers.ContainerCollection.run + self.run_kwargs = kwargs + +class KubernetesPodPlatformConfig(PlatformConfig): + def __init__(self, pod_spec: V1PodSpec): + self.pod_spec = pod_spec +``` + +#### Pod spec layers + +A final pod spec is merged by 3 layers of pod specs. They are: + +* Base pod spec layer +* Default config spec layer +* Component specific config spec layer + +The merge logic follows +[strategic merge patch](https://kubernetes.io/docs/tasks/run-application/update-api-object-kubectl-patch/#use-a-strategic-merge-patch-to-update-a-deployment) +to merge layers in order: base -> default -> component config. + +Strategic merge patch is different from JSON patch by merging lists and maps +instead of replacing them entirely. In this way, the patch layer doesn’t have to +specify the full content of a list or map. + +The base pod spec layer is created from user’s container spec. The pod spec +includes a main container spec with image path and entrypoint of the container. + +Default and component platform configs are configured by runner’s constructor. + +For example: + +```yaml +# base pod spec +apiVersion: v1 +kind: Pod +spec: + containers: + - name: main + image: tensorflow/tensorflow:v1.13 + command: ["python", "-c", "ml/app.py"] + +# pipeline pod spec +spec: + serviceAccountName: PipelineRunner + containers: + - name: main + resources: + limits: + memory: "128Mi" + cpu: "500m" + +# component config pod spec +spec: + containers: + - name: main + env: + - name: MYSQL_ROOT_PASSWORD + value: "password" + +# final pod spec +apiVersion: v1 +kind: Pod +spec: + serviceAccountName: PipelineRunner + containers: + - name: main + image: tensorflow/tensorflow:v1.13 + command: ["python", "-c", "ml/app.py"] + resources: + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: MYSQL_ROOT_PASSWORD + value: "password" +``` + +### TFX runner + +A `TFXRunner` compiles a logical pipeline into the underlying orchestrator’s DSL. In +this proposal, the base runner should accept launchers and `platform_configs` +and provide a default strategy to choose launcher for each component. + +The default choosing logic is: + +* If `platform_configs` is set, use it along with executor spec to find the + first launcher which can support them. +* Otherwise, find the first launcher which can support the executor spec + without `platform_configs`. +* `platform_configs` has higher priority than `default_platform_configs`. + +Pseudo implementation: + +```python +class TfxRunner(with_metaclass(abc.ABCMeta, object)): + def __init__(self, launchers: List[BaseComponentLauncher], + platform_configs: Dict[Text, List[PlatformConfig]]): + self.launchers = launchers + self.default_platform_configs = platform_configs.get('default') + self.platform_configs = platform_configs + + def _get_component_launch_info( + self, component: BaseComponent) -> ComponentLaunchInfo: + component_platform_configs = self.platform_configs.get(component.name) + # Use PlatformConfig.merge to merge configs with the same type. + component_platform_configs = self._merge_platform_configs( + component_platform_configs, self.default_platform_configs) + # Select launcher by platform config. + for platform_config in component_platform_configs: + for launcher in self.launchers: + if launcher.can_launch(component.executor_spec, platform_config): + return ComponentLaunchInfo(component, launcher, platform_config) + for launcher in self.launchers: + if launcher.can_launch(component.executor_spec): + return ComponentLaunchInfo(component, launcher) + + def run(self, pipeline) -> Optional[Any]: + component_launcher_infos = {c.name: self._get_component_launch_info(c) + for c in pipeline.components)} + return self._run(self, pipeline, component_launcher_infos) + + @abc.abstractmethod + def _run(self, pipeline, component_launcher_infos) -> Optional[Any]: + pass +``` + +### Output interface + +User container can receive a +[tmp directory path from default artifact store](https://github.com/tensorflow/community/blob/2c0b009ef955975b15a3cc18b1378e0ed38f394e/rfcs/20190904-tfx-generic-container-based-component.md#artifact-properties-after-execution-is-complete) +to write output data. The directory parameter will be called +`exec_properties.tmp_path`, which can be passed in as a command line argument. +The executor will look for `output.json` file under `exec_properties.tmp_path` +to get the outputs from the component. The output file follows the following +schema: + +```yaml +"$id": https://pipeline.mlx.org/output.schema.json" +"$schema": http://json-schema.org/draft-07/schema#" +type: object +title: Output +properties: + error_status: { "$ref": "#/definitions/OutputErrorStatus" } + outputs: + type: object + exec_properties: + type: object +definitions: + OutputErrorStatus: + type: object + properties: + code: + type: string + enum: [PERMANENT_ERROR, RETRYABLE_ERROR] + message: + type: string +``` + +The output.json file is optional, but if the user’s container writes to the file. It +overrides the default handling of the Kubernetes pod launcher. The output fields are: + +* error_status: tells the executor whether it should retry or fail +* outputs and exec_properties: used to override the execution and + output artifact metadata in MLMD. + +The output interfaces rely on `BaseComponentLauncher` to update states back to +MLMD from executor. + +### Auth context resolution + +The Kubernetes pod launcher internally uses the Kubernetes Python client. The auth context resolution +logic is as follows: + +1. If the current env is in a cluster, use `load_incluster_config` to load k8s + context. +1. If not, use default Kubernetes active context to connect to remote cluster. + +### Pod launcher resiliency + +In this design section, we focused more on the launcher resiliency under +`KubeflowDAGRunner`. In `AirflowDAGRunner`, the launcher code is running in the +same process of Airflow orchestrator, and we rely on Airflow to ensure the +resiliency of the process. `BeamDAGRunner`, however, is considered mainly for local testing +purpose and we won't add support for it to be resilient. + +In `KubeflowDAGRunner`, a pipeline step will create two pods in order to execute +user’s container: + +* A launcher pod which contains the driver, Kubernetes pod launcher, and publisher code. +* A user pod with user’s container. + +A pod in Kubernetes is not resilient by itself. We will use Argo’s retry feature to make +the launcher pod partially resilient. The details are as follows: + +* Each Argo launcher step will be configured with a default retry count. +* Argo will retry the step in case of failure, no matter what type of error. +* The launcher container will create a tmp workdir in `pipeline_root`. +* It will keep intermediate results (for example, the ID of the created pod) in the tmp workdir. +* The Kubernetes pod launcher will be implemented in a way that it will resume the + operation based on the intermediate results in the tmp workdir. +* The launcher will also record a permanent failure data in the tmp workdir so + it won’t resume the operation in case of non-retriable failures. + +### Default retry strategy + +K8s pod launcher supports exponential backoff retry. This strategy applies to +all runners which can support Kubernetes pod launcher. Docker launchers are not in the +scope of the design as it is mainly for local development use case. + +The retry only happens if the error is retriable. An error is retriable only +when: + +* It’s a transient error code from Kubernetes pod API. +* Or, the output.json file from artifact store indicates it’s a retriable error. +* Or, the pod get deleted (For example: GKE preemptible pod feature). + +### Log streaming + +The container launcher streams the log from user’s docker container or Kubernetes pod through the +API. It will start a thread which constantly pulls new logs and outputs them to +local stdout. + +## Discussions + +* In the Argo runner, each step requires 2 pods with total 3 containers (launcher + main container + launcher argo wait container + user main container) to run. + Although each launcher container requires minimal Kubernetes resources, + resource usage is still a concern. + + * With an additional pod, it gives launcher more control over execution and reduce the discrepancy between different orchestrators. We decided to go with the platform launcher approach and the additional container resource can be ignored. + +* For executor container spec, will environment variables be supported? + + * It’s not added right now to the container spec. Most things could be passed down using command line and arguments. So there is a workaround right now. Also, environment variables can be platform specific. Kubernetes for example, has certain conventions that don’t apply in other platforms. Hence, this could be part of platform_config instead of spec. + +* Step 3 indicates we first create a DAG, then use node identity to apply platform configuration. Another possibility is to do it directly during DAG construction. For example, if user goes back and changes the DAG, the platform configuration may stop mapping well, and will need to be revisited by the user. Did we consider the second option? + + * We don’t want users to add platform configuration at the pipeline level since it won’t be portable. We want the same pipeline to be compatible with local run and say running on k8s. Right now, we’d like to keep the pipeline specification itself clean from platform-specific abstractions. + + * The current proposal uses the component name as the key for binding. Different instantiations may have late binding for their names. For example, if I have 3 ExampleGen, should we be binding to the name instead of the type? + + * The names need to be unique else compilation fails. The instance names are actually component id, which is enforced to be unique at compile time. + + * How is caching affected by container tags? We should be careful with using container tags, since these are mutable. We should be relying on digest instead. If we cannot safely get digest, we should disable caching so we don’t fail due to the inability to obtain the digest at runtime. E.g. ‘latest’ and ‘nightly’ tags are not good candidates + + * By default, if we provide a tag name, the image will be cached in the cluster. We should log an error if caching requirement cannot be met at runtime. Note that currently, executor code changes don’t affect caching behaviour. We should change the above and ensure caching takes the above into account as well. + + * Will we ensure we have adequate test coverage? + + * We will add e2e tests for both Docker container and Kubernetes container launchers. The former using Beam as orchestrator, and the latter using Kubeflow orchestrator. + + + * What are the major user-facing differences between using TFX DSL with these extensions compared with KFP’s SDK today? + + * here is a difference in how users will specify platform-specific configuration in the pipeline. In KFP’s SDK, the user specifies this in-place when writing the logical pipeline. In TFX DSL, the need to ensure the logical pipeline is portable necessarily means the platform configuration needs to be specified outside the logical pipeline, which may be slightly more cumbersome than the KFP experience today. Note that the separation of configuration sometimes appears in KFP SDK too, when users want to apply global settings. + + * We don’t support or provide mechanisms for users to control container lifetime, e.g. container cancellation. + + * A lot of cancellation operations are best effort anyway. Failures on cancel operations are hard to handle. Users need to understand from the document that we are not aiming to enable such operations. + * If user starts a long-running job from a container, and the pipeline is canceled, users may want the container to receive this message and cancel gracefully. + * Can we guarantee that workflows will not stop until we get confirmation of cancellation of long-running operations? + * This seems difficult, and best effort may be enough, given that this is all Kubernetes itself does today. diff --git a/rfcs/20190829-tfx-container-component-execution/tfx-container-execution-classes.png b/rfcs/20190829-tfx-container-component-execution/tfx-container-execution-classes.png new file mode 100644 index 000000000..3018b37b1 Binary files /dev/null and b/rfcs/20190829-tfx-container-component-execution/tfx-container-execution-classes.png differ diff --git a/rfcs/20190829-tfx-container-component-execution/tfx-k8s-container-execution.png b/rfcs/20190829-tfx-container-component-execution/tfx-k8s-container-execution.png new file mode 100644 index 000000000..708b623c3 Binary files /dev/null and b/rfcs/20190829-tfx-container-component-execution/tfx-k8s-container-execution.png differ diff --git a/rfcs/20190829-tfx-container-component-execution/tfx-local-container-execution.png b/rfcs/20190829-tfx-container-component-execution/tfx-local-container-execution.png new file mode 100644 index 000000000..bca49f159 Binary files /dev/null and b/rfcs/20190829-tfx-container-component-execution/tfx-local-container-execution.png differ diff --git a/rfcs/20190904-tfx-generic-container-based-component.md b/rfcs/20190904-tfx-generic-container-based-component.md new file mode 100644 index 000000000..fed4091b7 --- /dev/null +++ b/rfcs/20190904-tfx-generic-container-based-component.md @@ -0,0 +1,889 @@ +# TFX Generic Container-based Component Proposal + +Status | Accepted +:------------ | :------- +**RFC #** | [146](https://github.com/tensorflow/community/pull/146) +**Author(s)** | Ajay Gopinathan (ajaygopinathan@google.com), Hongye Sun (hongyes@google.com), Makoto Uchida (muchida@google.com) +**Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) +**Updated** | 2019-09-04 + +## Objective + +This document proposes a design to enable users to attach an arbitrary +containerized program as a component to a pipeline authored using the TFX DSL, +in a way that inter-operates with other components. + +This RFC assumes some clarification on the TFX’s use of artifacts and metadata +as explained in [this section of TFX user guide](https://www.tensorflow.org/tfx/guide#artifacts). + +## Motivation + +A key value proposition provided by +[Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) +is letting users orchestrate arbitrary containers as part of a Machine Learning +(ML) pipeline. Many users already have custom ML applications written in +languages other than Python (e.g. in R, Java, C++ etc), and the ability to chain +existing containerized application programs with other pipeline steps through a +Python DSL is valuable. As of 2019/09 +([tfx 0.14](https://github.com/tensorflow/tfx/blob/0.14.0/RELEASE.md#version-0140), +even though TFX supports `KubeflowDagRunner` as an orchestrator, the +[TFX DSL](https://github.com/tensorflow/tfx/tree/0.14.0/tfx) does not provide +a mechanism to accomplish this. + +## User Benefit + +This document proposes a way to define a proper component based on any +containerized program as a solution to address this problem. The proposed +**container-based component** is realized by a simple DSL extension for custom +containers to the TFX DSL. It enables users to easily declare input and output +artifacts of a pipeline step implemented as a custom container-based program, as +well as a prescription for how to invoke the container’s entrypoint while +passing in the relevant artifact metadata and execution parameters. + +In doing so, we will not only enable custom containerized application programs +to be used in TFX, but also augment KFP-based pipelines with the following +capabilities: + +* **Metadata-centric interface**: The proposed container-based component + provides a framework to clearly declare input- and output- signatures of the + container-based component in terms of + [strongly typed](https://github.com/google/ml-metadata/blob/ba69ae039bd2205ec2d7b982b3bfdda4718bf8df/ml_metadata/proto/metadata_store.proto#L55) + [artifact metadata](https://github.com/google/ml-metadata/blob/ba69ae039bd2205ec2d7b982b3bfdda4718bf8df/ml_metadata/proto/metadata_store.proto#L29). + This is a key value proposition of TFX DSL. + + * As of [kfp 0.1.26](https://github.com/kubeflow/pipelines/tree/0.1.26), + the + [KFP DSL](https://github.com/kubeflow/pipelines/tree/0.1.26/sdk/python/kfp/dsl) + has type system to input- and output- of a component. However, it is in + a way that it doesn’t attach semantic meaning to the input- and output- + variables in a way compatible to ML Metadata, and the way how TFX + uitilizes it to realize its functionalies. + +* **Metadata-driven component execution**: Metadata-centric interface of + components enables TFX Driver and Publisher logic to be applied + consistently, thus enabling caching of component execution as well + component-specific (meta)data-driven execution decisions + + * *Example*: ModelValidator can validate models trained with *any* + component, so long as the produced artifact is of the *type* Model. + +* **Inter-Component communication via ML Metadata store**: This enables higher + order component-specific driver logic that depends on another component’s + behavior. + + * *Example*: Pusher can choose to proceed or halt pushing depending on + output artifact from ModelValidator. + +* **Ability to share and reuse a wrapped component as drop-in replacement with + another**: As a result of the artifact centric, strongly typed input- and + output- signatures, it enables robust sharing and drop-in replacement of + components, so long as signatures (the list of input- and output- artifact + *types*) are the same, and payload is compatible (see also appendix). + +Additionally, the proposed container-based component will enable the following +new features in TFX DSL, which already exist In Kubeflow Pipelines: + +* **Ability to use any containerized application program in a pipeline**: The + proposed container-based component does not preclude any containerized + application programs from being used as a pipeline step. + +* **Ability to have fine-grained control on the underlying k8s execution + environment**: The proposed container-based component preserves the user’s + ability to control underlying Kubernetes runtime configuration for + containerized program execution. + +## Design Proposal + +As stated previously, custom containers may be written in arbitrary languages. +The input interface to containers is restricted to the command-line used to +invoke the application, while the output interface is through file I/O (either +STDOUT, STDOUT, or container-local files). + +Since container applications may not have access to TFX libraries, they are not +able to (or don’t even wish to) serialize/de-serialize +[ML Metadata](https://github.com/google/ml-metadata/blob/master/g3doc/get_started.md) +representing input- and output- artifacts, which is what defines the interface +to all TFX components[^1]. Hence, this design proposes DSL extensions that allow +users to declare the types of the input and output artifacts, and directly +reference the URIs of these artifacts when invoking their containers, while +retaining the ability to invoke the containerized program in exactly the way it +expects with regards to flags, arguments and environmental variables. The TFX +framework itself will generate the necessary driver-level code so that metadata +is logged and used for caching executions. + +[^1]: ML Metadata is also what enables inter-component communication to realize + artifact driven component-specific behavior (such as ExampleValidator and + ModelValidator) + +In order to make this proposal concrete, let’s consider a motivating example of +creating a simple 2-step pipeline. The first step generates examples, and the +second trains a model based on the previously produced examples. The following +shows some example code for these two steps: + +* `my_example_gen.py` + +```python +import argparse +import pandas as pd + +def convert_and_save(df, file): + ... + +if __name__ == '__main__': + + parser = argparse.ArgumentParser() + + # URI location of input CSV file + parser.add_argument('--input_csv_file', ...) + # URI location of output CSV file + parser.add_argument('--output_csv_file', ...) + # Number of CSV columns in the input file + parser.add_argument('--num_csv_columns', ...) + + args = parser.parse_args() + arguments = args.__dict__ + + df = pd.read_csv(arguments['input_csv_file'], + usecols=range(0, int(arguments['num_csv_columns']))) + + # implementation of business logic to ingest dataset + convert_and_save(df, arguments['output_csv_file']) + +``` + +* `my_trainer.py` + +```python +import argparse +import pandas as pd +import sklearn.linear_model import Logisticregression + +def load_dataset(file): + ... + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + + # URI location of input file with pre-processed examples. + parser.add_argument('--input_examples', ...) + # URI location to output the trained model + parser.add_argument('--output_model', ...) + # Number of training steps to use. + parser.add_argument('--train_steps', ...) + + args = parser.parse_args() + arguments = args.__dict__ + + x_train, y_train, x_eval, y_eval = load_datset(arguments['input_examples']) + + model = LogisticRegression() + model.fit(x_train, y_train, args.train_steps) + + # + # ... perform grid search, etc, ... + # + + # Write the trained model. + joblib.dump(model, argument['output_model']) +``` + +Given the two (containerized) applications above, our goal is to chain them into +a 2-step pipeline which will automatically track metadata related to: + +* **Artifacts**: These include the input CSV file, the output CSV file with + training and eval examples, and the final Model file. +* **Executions**: These include execution steps `MyExampleGen`, with runtime + execution property `num_csv_columns`, and `MyTrainer`, with runtime + execution property `train_steps`. + +Notably, since `my_example_gen.py` produces *Examples*, we make the pipeline +understand that it is *Examples*, and record it as such in ML Metadata. Doing so +will enable downstream components, such as Trainer, to understand that it is +receiving *Examples*, and also realize higher-level functionality such as +ExampleValidation. Similarly, since `my_trainer.py` produces Model, other parts +of the pipeline should understand it to enable higher-level functionality such +as ModelValidation and Pusher to serving. + +Hence, we propose to provide an extension to the DSL that allows users to +declare inputs and outputs in terms of ML Metadata of artifacts during pipeline +construction. The proposed DSL extension would translate the input and output +artifacts at pipeline runtime, abstracting reading and writing artifact metadata +from and to storage on behalf of the wrapped containerized application program. + +## Detailed Design + +We propose the following syntax for wrapping user’s containers in the DSL by +introducing `ExecutorSpec`. This syntax complements and extends the +[ComponentSpec](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/types/component_spec.py#L73) +design previously implemented in TFX, and generalize `EXECUTOR_CLASS` to +[EXECUTOR_SPEC](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/components/base/base_component.py#L63) +attribute of components. We propose to use the same `ComponentSpec` base class +to describe the custom container-based component’s input and output artifacts, +and its parameters. + +* Base `ExecutorSpec` class for container based component. + +```python +class ExecutorSpec(with_metaclass(abc.ABCMeta, object)): + """Base class that specifies 'what' to execute for the component.""" + pass + + +class ExecutorClassSpec(ExecutorSpec): + """Execution spec for a Python class-based Executor.""" + + # An in-process Python Executor class, derived from TFX + # base_executor.Executor class. + executor_class = ... + + def __init__(self, executor_class): + ... + + +class ExecutorContainerSpec(ExecutorSpec): + """Execution specification of the container-based component.""" + + # Container image that has executor application. + # Assumption is that this container image is separately release-managed, + # and tagged/versioned accordingly. + container_image = ... + + # Container entrypoint command line program and args. + # The Jinja templating mechanism is used for constructing a user-specified + # command-line invocation based on input and output metadata at runtime. + # The entry point can be as generic as `/bin/sh -c "..."`, which retains + # the ability to control inputs and/or exec_props with environment + # variables. + command = ... + args = ... + + # Additional specifications of execution specific to Runner's environment. + # For example, k8s pod configuration for launching the containerized + # application would be included here. + # Note that each Runner also has a way to specify it’s orchestrator-specific + # configuration, such as KubeflowDagRunnerConfig for KubeflowDagRunner. + # Details and relationship between platform_config and Runner’s config + # are subject to change. Detailed design document particularly on + # this point to follow. + platform_config = ... + + def __init__(self, container_image, command, args, platform_config): + ... +``` + +* `ExecutorSpec` and `ComponentSpec` for `my_example_gen.py` + +```python +# Container-based ExampleGen's execution spec. +my_example_gen_exec_spec = ExecutorContainerSpec( + # Container image of my_example_gen.py. + # Assumption is that this container image is separately release-managed, + # and tagged accordingly. This example demonstrates the ':stable' tag. + container_image='gcr.io/my_project/my_example_gen:stable', + command=['python'], + args=['my_example_gen.py', + '--input_csv_file', '{{ inputs.raw_input.uri }}', + '--output_examples', ' {{ outputs.examples.uri }}', + '--num_csv_columns', ' {{ exec_props.num_csv_columns }}' ], + platform_config=... +) + + +# One can choose to define a dev-instance of MyExampleGen container-based +# component, based on a different (newer) version of the container image. +# +# Alternatively, by using continuous integration tools, it is possible to +# dynamically build the docker image, and inject its image SHA id to +# this code via a flag. +my_dev_example_gen_exec_spec = ExecutorContainerSpec( + container_image='gcr.io/my_project/my_example_gen:dev', + command=['python'], + args=['my_example_gen.py', + '--input_csv_file', '{{ inputs.raw_input.uri }}', + '--output_examples', ' {{ outputs.examples.uri }}', + '--num_csv_columns', ' {{ exec_props.num_csv_columns }}' ], + platform_config=... +) + + +# Container-based ExampleGen's component spec. +# Notice that this is similar to FileBasedExampleGenSpec, +# but with a different set of PARAMETERS. +class MyContainerBasedExampleGenSpec(ComponentSpec): + """ComponentSpec to drive my_example_gen.py as a Component.""" + # Input artifacts. + INPUTS = { + "raw_input": ChannelParameter(type=standard_artifacts.ExternalArtifact), + } + + # Output artifacts. + OUTPUTS = { + "examples": ChannelParameter(type=standard_artifacts.Examples), + } + + # Parameters. + PARAMETERS = { + "num_csv_columns": ExecutionParameter(type=int), + } +``` + +* `ExecutorSpec` and `ComponentSpec` for `my_trainer.py` + +```python +# Container-based trainer's executor spec +my_trainer_exec_spec = ExecutorContainerSpec( + container_image='gcr.io/my_project/my_trainer:stable', + command=['python'] + args=['my_trainer.py', + '--input_examples', '{{ inputs.my_inputs.uri }}', + '--output_model', '{{ outputs.my_model.uri }}', + '--train_steps', '{{ exec_props.train_steps }}',] + # Platform config would specify use of GPU node pool for k8s, for example. + platform_config = ... +) + +# Container-based trainer's component spec. +# Notice that this is quite different from TrainerSpec, because of +# the command line flags that my_trainer.py takes are different from what +# TFX stock trainer takes. Nevertheless, it does produce an instance of +# Model artifacts, which can then be consumed by downstream components. +class MyContainerBasedTrainerSpec(ComponentSpec): + """ComponentSpec to drive my_trainer.py as a component.""" + # Input artifacts. + INPUTS = { + "my_input": ChannelParameter(type_name=standard_artifacts.Examples), + } + + # Output artifacts. + OUTPUTS = { + "my_model": ChannelParameter(type_name=standard.Artifacts.Model), + } + + # Parameters + PARAMETERS = { + # Execution properties. + "train_steps": ExecutionParameter(type=int), + } +``` + +### Component definition based on a generic containerized program + +With the introduction of `ExecutorContainerSpec`, the way to define a component +based on a containerized application is no different from any other custom +component. Below are illustrations to define the components from the previous +section in full, and their use in an end-to-end pipeline. + +* Component definitions + +```python + +class MyContainerBasedExampleGen(BaseComponent): + """Wraps my_example_gen.py.""" + SPEC_CLASS = MyContainerBasedExampleGenSpec + + EXECUTOR_SPEC = my_example_gen_exec_sepc + + # Optionally, if custom driver behavior is desired, such as checking + # mtime for file updates, one can define a custom Driver class to control + # the behavior of my_exmaple_gen.py inside the container. + DRIVER_CLASS = ... + + +class MyContainerBasedTrainer(BaseComponent): + """Wraps my_trainer.py.""" + SPEC_CLASS = MyContainerBasedTrainerSpec + + EXECUTOR_SPEC = my_trainer_exec_spec + +``` + +* `pipeline.py` + +```python + +def create_pipeline(): + my_csv_file = dsl_utils.external_input(uri="/path/to/csv_file") + + my_example_gen = MyContainerBasedExampleGen( + raw_input=my_csv_file, num_csv_column=20) + my_trainer = MyContainerBasedTrainer( + my_input=example_gen.outputs.examples, train_steps=200) + + return pipeline.Pipeline( + pipeline_name = 'my_container_based_pipeline', + pipeline_root = 'gs://path/to/root', + components = [my_example_gen, my_trainer], + ... + ) + + +# It may be the case that some TfxRunner implementation (the ComponentLauncher +# thereof) does not have the ability to run a container-based component, +# in which case, an Exception is raised at the time when the logical pipeline +# is compiled for execution by the TfxRunner. +# See the next section of this doc about ComponentLauncher. +_ = KubeflowDagRunner().run(create_pipeline()) + +``` + +### ComponentLauncher to launch the container-based application + +With the introduction of `ExecutorContainerSpec` which does not specify +`executor_class`, the default implementation of `BaseComponentLauncher` may +not be able to execute the container-based component. Furthermore, different +orchestrator (i.e. an instance of `TfxRunner`) may have different ways to launch +the containerized application program. + +We propose to extend the `BaseComponentLauncher` to define orchestrator-specific +ways to execute the containerized program. It includes the ways to translate +input artifacts to the complete command line, by filling the [Jinja template](https://jinja.palletsprojects.com/en/2.10.x/) +for `ExecutorContainerSpec.command` and `ExecutorContainerSpec.args`, and to +translate output from the containerized application to keep track of metadata of +it and write back to Metadata storage. + +Below is one possible implementation of `BaseComponentLauncher` that implements +a way to launch container-based components with `KubeflowDagRunner`, with +ability to configure low level k8s configurations. This +`KubeflowComponentLauncher` would use the k8s Pod API to launch the container +through underlying Kubeflow Pipelines SDK implementation [ref](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/orchestration/kubeflow/base_component.py#L118). +On top of this, `KubeflowDagRunner` allows to apply [additional k8s APIs](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/orchestration/kubeflow/kubeflow_dag_runner.py#L188), +such as volume mount and secret management to pods. + +```python +# Demonstration of a ComponentLauncher that has the ability to launch +# container-based component, in addition to executor-class based component, +# with KubeflowDagRunner. +class KubeflowComponentLauncher(BaseComponentLauncher): + """Demonstration of a ComponentLauncher specific to KubeflwoRunner.""" + + def __init__(self, ..., platform_config=...): + # platform_config delines any Runner-specific, for example k8s-specific + # configurations for launching containerized application programs. + ... + + def _run_driver(self, ...): + # runs driver, which may be custom to each container-based component + ... + + def _run_executor(self, ...): + spec = self._executor_spec + if isinstance(spec, ExecutorContainerSpec): + # Launch the container entrypoint with the specified image, + # by the Runner-specific way to execute the container application. + # In KubeflowDagRunner's case, it is with Argo on k8s. + # The platform_config is applied here. + ... + else: + # Launch executor_class as usual. + + def _run_publisher(self, ...): + # runs publisher. When launching container-based executor, this method + # is responsible for capturing output from the containerized program, + # and write back to ML Metadata. + .... + +``` + +Another example of `_run_executor()` to the above illustration may be to execute +`docker run` locally. + +The Runner should implement a suitable subclass of `BaseComponentLauncher` +accordingly. A pipeline may have different `ExecutorSpec`s for different +components. In case the Runner, and corresponding `BaseComponentLauncher` +subclasses, does not have a way to execute a containerized program with +`ExecutorContainerSpec`, a runtime error would be raised. If a Runner's `run()` +method has a compilation step from logical pipeline to orchestrator-specific +representation of the pipeline, such error could be caught at compile time. + +### Artifact Properties after Execution is complete + +It is worth noting that some (custom) properties of output artifacts can only be +determined after executor completed. For instance, `is_blessed` property of +`ModelBlessing` artifact can only be determined after execution finishes. + +When a custom image is used by the proposed `ExecutorContainerSpec`, we need to +capture the output of the component, decode value of these properties and send +them to Publisher so that published artifacts have correct final outcome. This +must be done before we transition the output artifact to Published state, so +immutability of published artifacts is preserved. + +There are few choices to realize this. + +#### Option 1: Disallow artifacts with such custom properties as output + +Simplest option is not to support such properties in output artifact from the +proposed container-based component. It is too limiting, and loses the main value +of it such that arbitrary business logic can be implemented in the +container-based application in a way that controls downstream component’s +behavior via output artifacts. + +#### Option 2: Capture container STDOUT + +Containerize application may indicate the result of execution to STDOUT. The +proposed container-based component could capture it and implement the logic to +translate into the (custom) property of the output artifact. This is also what +the previous Airflow based operators for TFX were doing, before the work to +combine driver/executor/publisher into a single Airflow operator was complete. +We do not see a reason to not generalize STDOUT to any file I/O interface. + +#### Option 3 (preferred): Use Files in shared temp directory to capture output artifacts + +This is a generalized version of Option 2, in that to capture output from the +containerized program via file I/O, and have the proposed container-based +component to capture it as properties of output artifacts. File I/O is +consistent with the way how `KubeflowDagRunner` passively logs output artifact +metadata as of `tfx` 0.13, hence natural extension to it. + +### Binding between ComponentSpec and ExecutorContainerSpec + +(A subclass of) `ComponentSpec` defines input and output artifact specification, +and execution property of the component, but does not define ‘what’ to execute +and ‘how’. (An instance of) `(Container)ExecutorSpec` defines ‘what’ and ‘how’ +to execute for this component. (A subclass of) `BaseComponent` defines the +binding of `ComponentSpec` and `ExecutorContainerSpec`, in order to be launched +by (a subclass of) `ComponentLauncher`. + +There are few possible design options as to where and how to define those +specifications and their bindings. + +#### Option 1 (illustrated above): Complete Separation between ExecutorContainerSpec, ComponentSpec and Component + +This is as illustrated as the code snippet in the previous sections. + +* **Pros**: + * It achieves clear separation between `ComponentSpec`, which is meant to + define executor-agnostic specification of a component, from + specification of execution which may be tied to a particular Runner’s + implementation, as illustrated in extension to `ComponentLauncher`. +* **Cons**: + * Component specifications are defined separately, and developer needs to + make sure to keep them consistent. + * Using `my_example_gen.py` in the above example, all of the below + needs to be defined in different places and kept in tightly + consistent. + 1. Command line flags to `my_example_gen.py`. + 1. Jinja template defined in `my_example_gen_exec_spec.command` and + `my_example_gen_exec_spec.args`. + 1. `INPUT`, `OUTPUT` and `PARAMETERS` in + `MyExampleGenComponentSpec`. + 1. The binding between `my_example_gen_exec_spec` and + `MyExampleGenComponentSpec`, which is done in + `MyContainerBasedExampleGen` class. + * If any of the above is inconsistent, the containerized + `my_example_gen.py` won’t be invoked properly, or output artifact is + not logged to ML Metadata thus not usable by downstream components. + * Such consistency check needs to be implemented outside of the component + class, possible as a part of `Runner` or `ComponentLauncher`. + +#### Option 2: ExecutorContainerSpec as a part of (subclass of) ComponentSpec + +Define a special subclass of `ComponentSpec`, that specifically holds +`ExecutorContainerSpec` as its member. + +```python + +class ContainerComponentSpec(ComponentSpec): + ... + # An instance of ExecutorContainerSpec + executor_spec = ... + + def _validate_spec(self): + ... + assert(instanceof(self.executor_spec, ExecutorContainerSpec)) + + +class MyContainerBasedExampleGenSpec(ContainerComponentSpec): + ... + executor_spec = ExecutorContainerSpec(...) + + +class MyContainerBasedExampleGen(BaseComponent): + SPEC_CLASS = MyContainerBasedExampleGenSpec + + EXECUTOR_SPEC = MyContainerBasedExampleGenSpec.executor_spec + +``` + +* **Pros**: + * Component’s `INPUT`, `OUTPUT` and `PARAMETER` definitions are co-located + with `ExecutorContainerSpec.command` and `ExecutorContainerSpec.args`’s + Jinja template in one place. + * It reduces cognitive load to properly define a container-based + component. + * It also makes it possible to place a static validation between them. +* **Cons**: + * `ContainerComponentSpec` defines not only specification of `INPUT`, + `OUTPUT` and `PARAMETERS` of the component, but also defines ‘how’ and + ‘what’ to execute, which violates the original design intention of + `ComponentSpec`. + +#### Option 3: Extend BaseComponent specifically for ContainerBasedComponent + +Provide a base class of `ContainerBasedComponent`, which defines all the specs +in one place as nested members. `ComponentLauncher` specific to a `Runner` +defines its behavior for subclasses of `ContainerBasedComponent`. +`ContainerBasedComponent` can be thought of as a convenience wrapper that puts +together `ComponentSpec` and `ExecutorContainerSpec` in one place, and provides +additional validation check on integrity between the two. + +```python + + +# Abstract base class that has extra facility to support ExecutorContainerSpec +class ContainerBasedComponent(BaseComponent): + + EXECUTOR_SPEC = _abstract_property() + + @classmethod + def dynamic_spec_class(cls, inputs, outputs, parameters): + class _ComponentSpec(ComponentSpec): + INPUTS=inputs + OUTPUTS=outputs + PARAMETERS=parameters + + return _ComponentSpec + + def _validate_entrypoint(self): + # Make sure SPEC_CLASS, executor_spec.command and executor_spec.args c + # are consistent. + ... + + def __init__(self, ...): + # It must execute containerized program. + assert(isinstance(self.executor_spec, ExecutorContainerSpec)) + + # SPEC_CLASS and EXECUTOR_SPEC must be consistent. + self._validate_entrypoint() + + # Instantiate Component with given ComponentSpec and ExecutorSpec + # and other auxiliary configurations. + super(ContainerBasedExampleGen, self).__init__(...) + + +# Implementation of a component based on my_example_gen.py +class MyContainerBasedExampleGen(ContainerBasedComopnent): + + # dynamic_spec_class() is a syntactic sugar to be able to inline + # SPEC_CLASS definition at declaration of ContainerBasedComponent subclass. + # In case the same ComponentSpec may be shared with another component but + # with different EXECUTOR_SPEC (and DRIVER_CLASS, etc), this class should + # be defined explicitly and shared. + SPEC_CLASS = ContainerBasedComponent.dynamic_spec_class( + # Implementation of ComponentSpec specific to MyExampleGen. This is + # exactly the same as `MyContainerBasedExampleGenSpec` illustrated above. + INPUTS=... + OUTPUTS=... + PARAMETERS=... + ) + + # This is the same as my_example_gen_exec_spec illustrated above. + EXECUTOR_SPEC = ExecutorContainerSpec(...) + +``` + +* **Pros**: + * All specifications of container-based component is co-located in one + place, making it possible to perform static validation check for + consistency between specs there. + * `ComponentSpec` remains purely about `INPUTS`, `OUTPUTS` and + `PARAMETERS` definitions, detached from ‘what’ and ‘how’ to execute the + compononent. +* **Cons**: + * Nested `ComponentSpec` class style may be cumbersome. + * Porting a pipeline to a new runner would involve changing all components + to derive from a new base class, if `ComponentLauncher` of the new + runner doesn’t know how to launch `ExecutorContainerSpec`. + +#### Option 4 (preferred): Utility to create inline specs and do static validation check hook + +This is built on top of Option 1. + +This option is similar to option 3 and generalizes it to all executor types. The +same pattern can also be applied to python class executor. Proposal is to: + +* Create a `types.dynamic_spec_class()` method to facilitate to create an + inline `ComponentSpec`. +* Define an abstract `validate_component_spec()` method in `ExecutorSpec` base + class to perform executor specific static validation. + +```python + + +class BaseComponent: + def __init__(self, spec, ...): + … + # Call ExecutorSpec.validate_component_spec to validate the component spec. + # Subclass of ExecutorSpec should implement the validation hook to validate + # component spec at compile time. + self.executor_spec.validate_component_spec(spec) + + +class ExecutorContainerSpec(ExecutorSpec): + def validate_component_spec(self, component_spec): + # Call Jinja parser to validate the entry-points with component_spec data. + … + +# Implementation of a component based on my_example_gen.py +class MyContainerBasedExampleGen(BaseComponent): + + # dynamic_spec_class() is a syntactic sugar to be able to inline + # SPEC_CLASS definition at declaration of BasedComponent subclass. + # In case the same ComponentSpec may be shared with another component but + # with different EXECUTOR_SPEC (and DRIVER_CLASS, etc), this class should + # be defined explicitly and shared. + SPEC_CLASS = types.dynamic_spec_class( + # Implementation of ComponentSpec specific to MyExampleGen. This is + # exactly the same as `MyContainerBasedExampleGenSpec` illustrated above. + inputs=... + outputs=... + parameters=... + ) + + # This is the same as my_example_gen_exec_spec illustrated above. + EXECUTOR_SPEC = ExecutorContainerSpec(...) + +``` + +* **Pros**: + * It generalizes to all executor types and keeps current component class + model unchanged. + * All specifications of a component is co-located in one place. + * Make it possible to perform executor specific static validation check + for consistency between specs. + * `ComponentSpec` remains purely about `INPUTS`, `OUTPUTS` and + `PARAMETERS` definitions, detached from ‘what’ and ‘how’ to execute the + component. + * Potentially, `BaseExecutor` can extend this model to support a class + method `validate_component_spec()` to support user executor static + validation of any logic. +* **Cons** + * `dynamic_spec_class()` style may be cumbersome. + * The dynamic class cannot be shared like static class. + +## Appendix + +### Pipeline Compilation and Release + +The proposed `ExecutorContainerSpec` and any related extension of DSL APIs +will reside in the TFX repository. Pending code completion, we may choose +to place some or all of the new APIs under `experimental` namespace until +we admit it to core APIs. + +If run with `KubeflowDagRunner`, it will be executed by `run()` method to +compile into Argo pipeline spec. As a result, there is no need to have any +additional code to be included inside the user’s container image. Other +orchestrators, such as `AirflowDAGRunner`, may have to have a newer version of +TFX SDK with the new `ExecutorContainerSpec` and implementations of +corresponding `ComponentSpec` subclasses installed in the environment in which +the component is executed. Nevertheless, it is no different than any other TFX +component’s execution in which it needs to have the TFX SDKs for components +installed on the Airflow execution environment. + +### Componentize a Python function, as opposed to a container image + +Kubeflow Pipelines SDK helps users to define a Python function and convert them +to a container-based application as a part of the pipeline (by the +`kfp.compiler.build_python_component()` API). In order for this to become fully +metadata-aware component as proposed in this document, a gap still remains that +it doesn’t help defining input- and output- of the Python function in terms of +the typed Artifacts to be tracked in ML Metadata. + +This proposed container-based component could further help filling the gap to +help declaratively configure `INPUTS`, `OUTPUTS` and `PARAMETERS` for the given +naked Python function and componentize it. Furthermore, there is an opportunity +to create a helper to build an image, configure a `python` command entrypoint +from a naked Python function, and construct command line arguments under the +hood, as a specialized subclass of it. Such helper shall eventually converge to +the other way of implementing a +[custom component](https://github.com/tensorflow/tfx/tree/0.14.0/tfx/examples/custom_components) +for TFX via a custom `Executor` class written in Python, and package it in a +container image to release. + +Detailed RFC particularly on this point will follow. + +### Component Archetypes + +As of `tfx` 0.14, there are +[10 known artifact types](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/types/standard_artifacts.py) +defined and used. + +* ExternalArtifact +* Examples +* Schema +* ExampleStatistics +* ExampleAnomalies +* TransformGraph +* Model +* ModelEvaluation +* ModelBlessing +* PushedModel + +Based on the above known artifact types, TFX defines the following +[9 component archetypes](https://github.com/tensorflow/tfx/tree/0.14.0/tfx/components). + +| **Component** | **Inputs** | **Outputs** | +| :--------------- | :-------------------------------------------- | :------------------------------------- | +| ExampleGen | ExternalArtifact (optional) | Examples | +| StatisticsGen | Examples | ExampleStatistics | +| SchemaGen | ExampleStatistics | Schema | +| ExampleValidator | ExampleStatistics, Schema | ExampleAnomalies | +| Transform | Examples (raw), Schema | TransformGraph, Examples (transformed) | +| Trainer | Examples, TransformGraph (optional), Schema | Model | +| Evaluator | Examples, Model | ModelEvaluation | +| ModelValidator | Examples, Model | ModelBlessing | +| Pusher | Model, ModelBlessing | PushedModel | + +The proposed generic container-based component will enable scenarios where, so +long as the wrapped component adheres to one of the above the input- and output- +archetypes, it enables drop-in replacement, while retaining interactions with +the rest of the components in the pipeline, and also the pipeline not having to +know the actual business logic inside the container application. + +### Specification of Artifacts + +As of tfx 0.14, the schema (list of properties) of metadata for each artifact +type is defined implicitly when it is created and used +([example](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/components/transform/component.py#L118)). + +In order for the proposed generic component-based component to utilize artifacts +and its metadata in a standardized way, such metadata schema definition needs to +be made explicit, possibly as a Python class (In TFX 0.14.0, the base `Artifact` +class defines the common set of [properties](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/types/artifact.py#L103), +with option for each sub-classed type to extend it). In other words, unless the +known artifact types are explicitly defined and accessible in a common +repository, custom built container-based component would not be able to +implement the interaction with other components via such artifact types, in turn +the custom container-based component would not be able to make use of the +interoperability and shareability with other components in a pipeline. + +We anticipate that +[standard_artifacts.py](https://github.com/tensorflow/tfx/blob/0.14.0/tfx/types/standard_artifacts.py) +will serve as the catalog of known artifact types. We also anticipate that this +catalog might evolve with more properties of a type, or more types themselves. + +### Interoperability of Artifact Payload + +In order for a custom component to be interoperable with other parts of the +TFX system, the payload of artifacts must be compatible with the way metadata +(via properties of artifacts) defines, which allows downstream components to +properly consume artifacts. In fact, in TFX 0.14.0, there is implicit assumption +on payload of artifacts. For example, payload of *Model* artifact is always +TensorFlow SavedModel with a certain signatures that downstream component, +such as Pusher (and the serving system it pushed to), can consume. Likewise, +payload of *Example* artifact is GZipped tensorflow.Example TFRecord. + +Any custom componenent, regardless of the proposed container-based component +or [Python class](https://www.tensorflow.org/tfx/guide/custom_component), +implementation, mismatch in assumed payload would cause runtime error. +This is analogous to the fact that Pandas [`DataFrame.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) +and subsequent [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) must implement the same format options +(such as delimiter, quote, header). + +We believe that best-possible way handle such ambiguity is to enforce +project-internal consistency within user projects by convention on artifact +properties. This approach will retain capability to implement a logic in +custom components to enforce payload compatibility between components at +DAG complination time. Once this has proven sufficiently generally useful, +some of such convention would be admitted to the central artifact type/property +repository as mentioned in the previous section, and compile time payload +compatibility check logic would be admitted to the TFX's core library. diff --git a/rfcs/20190910-struct-tensor.md b/rfcs/20190910-struct-tensor.md new file mode 100644 index 000000000..160722491 --- /dev/null +++ b/rfcs/20190910-struct-tensor.md @@ -0,0 +1,952 @@ +# StructuredTensor + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Authors** | Edward Loper (edloper@google.com), Martin Zinkevich (martinz@google.com), Zhuo Peng (zhuo@google.com) | +| **Sponsor** | Alex Passos (apassos@google.com) | +| **Updated** | 2019-09-10 | + +## Objective + +This RFC proposes a new Python tensor type **`tf.StructuredTensor`**, which +provides a flexible and Tensorflow-native way to encode structured data such +as Protocol Buffers or Pandas DataFrames. A ***StructuredTensor*** is a +multi-dimensional collection of ***structures*** with the same ***schema***, +where: + +* A ***schema*** is a collection of fields, each of which has a name and a type. +* A ***structure*** maps each field in the schema to a tensor value + (which could be a nested `StructuredTensor`). + +As an important special case, a 1D `tf.StructuredTensor` encodes a 2D table, +where columns are heterogeneous `Tensor`s, and rows are the aligned elements +in each of those `Tensor`s. This special case maps cleanly to a Pandas +DataFrame or an Arrow RecordBatch. + +## Motivation + +Structured data types are widely recognized as a useful abstraction for making +systems easier to understand and more likely to be correct. But in TensorFlow, +it is currently up to the user to keep track of how individual tensors relate to +one another; and integrating Tensors with external data formats must be done by +hand. This section illustrates the benefits of `StructuredTensor`s with a +motivating example. The benefits are summarized in the next section ("User +Benefit"). Alternative solutions are discussed in the "Related Work" section. + +### Example: Reading Structured Data + +Tensorflow is a very powerful tool for learning deep networks to solve problems. +However, it requires that users represent data using vectors and tensors. Most +problems can be solved by transforming inputs into tensors and operating on +those tensors, but usually the inputs to a model start out in some structured +native format. Currently, it is common practice to use "feature extraction" +code that is outside the main TensorFlow graph to transform the input into a set +of tensors. However, there are several advantages to performing this feature +extraction as part of the graph, and `StructuredTensor` enables this by giving +us a way to represent the original structured input in a TensorFlow-native +format. + +For example, consider the problem of predicting how much a user will enjoy a +recipe, based on a structured representation of the recipe. The structured +representation might include a pre-generated user embedding (where users with +similar tastes have similar embeddings), a recipe title, an estimated +preparation time, a list of ingredients (with amounts and types), a list of +steps, and information about what other users enjoyed the recipe. E.g.: + +![StructuredTensor schema for recipe task](20190910-struct-tensor/recipe_schema.png +"StructuredTensor schema for recipe task") + +In a traditional model, we might transform this structured representation into a +set of tensors using external code. For example, the recipe step descriptions +might be normalized, tokenized, and combined into a single bag-of-words tensor. +Those tensors would then be fed as inputs into a model to predict the user's +score for a recipe. But using `StructuredTensor`, the native representation of +the structured input can be fed directly into the model graph, and any +transformation of that structured encoding into flat tensors can be performed +inside the model. The following diagram illustrates the difference between +these approaches: + +![Separate vs Integrated Feature +Extraction](20190910-struct-tensor/feature_extraction_in_model.png +"Separate vs Integrated Feature Extraction") + +There are several advantages to including feature extraction as part of the +model: + +* **Encapsulation**: When the feature extraction code is not packaged with the + model, it is easier for them to get out of sync. For example, if the feature + extraction is updated but the model doesn't get retrained, then the model will + not perform as expected. + +* **Modularity**: Packaging the feature extraction and model as a single unit + makes it easy to experiment with new models, and to swap new models into a + production environment, even if the new models processes the original input in + a different way. + +* **Easy integration**: `StructuredTensor`s can be constructed from a wide variety + of popular formats, including protocol buffers, JSON, Apache Arrow and + Parquet, and many others. + +* **Training/Serving skew**: Often when feature extraction is performed + separately, separate functions are used to process training and inference + inputs. For example, the training input needs to be batched, and may be + stored in a separate format from the inference format. This divergence + increases the risk of "training/serving skew," where the two feature + extraction procedures differ in subtle ways, leading to suboptimal model + performance. Integrating feature extraction into the model minimizes the + chance of train/test mismatch. + + +* **Efficiency**: `StructuredTensor` can be read several several popular formats + using zero-copy reads, for lightning-fast data access with minimal + serialization overhead. In contrast, external feature extraction usually + involves serializing to `tf.example`, which can add substantial overhead. + + +### Example: Writing Structured Data + +`StructuredTensor` gives us a principled way to generate and store structured +output; without `StructuredTensor`, we would need to encode the predicted structure +in some ad-hoc format. To continue the example above, perhaps recipes are only +available in a semi-structured format, such as a webpage or a text file. We +might therefore wish to build a model that can parse a "raw" recipe, and predict +its structure. Using `StructuredTensor` as an output format has several +advantages: + +* **Standardization**: `StructuredTensor` provides us with a standard output format + for structured models, which can be easily converted to other popular formats, + including protocol buffers, JSON, Apache Arrow, Parquet, and many others. + +* **Efficiency**: `StructuredTensor`s can be converted to many standard output + formats using zero-copy writes with minimal serialization overhead. + + +* **Joint modeling**: We can combine a model that outputs `StructuredTensor`s (such + as recipe parsing) and a model that inputs `StructuredTensor`s (such as recipe + rating) into a single model. This makes it easy to do joint modeling, where + the final loss function can be used to improve both sub-models. + +### Example: Processing Structured Data + +`StructuredTensor` is useful even for models that don't have structured inputs or +outputs at their boundaries, by providing a way to manage structured data, and +to track and process collections of related tensors. Consider the joint model +constructed by combining the recipe parsing model with the recipe rating model +as a single end-to-end recipe rating model. This model has a simple input (a +single string) and a simple output (a single score), but the two internal +components of the model share a structured intermediate representation. +Advantages of using `StructuredTensor` internally to a model include: + +* **Self-documenting Tensors**: In models that use many different tensors, it + can be difficult (and error-prone) to keep track of the meanings of the + different dimensions for each tensor. By wrapping tensors in `StructuredTensor`s, + we can effectively provide a label for each dimension. + + +* **Shared Outer Dimensions**: Often, several different tensors will have + "shared dimensions." For example, we might have several different tensors + corresponding to a recipe's ingredient list, where the outer dimension of each + tensor indicates which ingredient is described. Using `StructuredTensor`, we can + combine all of these tensors into a single structure, which encodes, + documents, and enforces the fact that this dimension is shared. + + +* **Zero Runtime Overhead**: As discussed below, `StructuredTensor`'s internal + encoding is a collection of parallel tensors; and any operations on the + `StructuredTensor` reduce to operations on those individual tensors. Therefore, + `StructuredTensor` adds minimal overhead at graph-construction time, and zero + overhead at training and inference time. + +## User Benefit + +* **Integration*. `StructuredTensor`s can be constructed from a wide variety of + popular formats, often using zero-copy reads and writes. This will make it + much easier to integrate `StructuredTensor`s with existing systems, which may use + a wide variety of data formats. The following diagram gives examples of the + formats that for which we intend to provide direct conversion: + + ![Integration with Other + Formats](20190910-struct-tensor/integration_with_other_formats.png + "Integration with Other Formats") + +* **Encapsulation**. TensorFlow models are tightly coupled with input + transformation and feature engineering. Being able to package the model and + the feature engineering in a single graph has a variety of advantages: + + * It is easier to keep feature engineering and modeling in sync. Since the + same subgraph is used for training and inference, there is less code + duplication in feature engineering, which means that there are fewer places + to introduce training/serving skew. + * It is easier to try out different feature engineering techniques within a + model, without changing the overall infrastructure. + * It is simpler to swap in a new model with different input processing in a + production environment, since the input processing and the model can be + packaged as a single SavedModel. + * By directly operating on the native data, there is a possibility of trying + completely different modeling techniques without changing the API of the + model's TensorFlow graph. + +* **Efficiency.** + + * `StructuredTensor` can be read from and written to many popular formats using + zero-copy reads and writes, for lightning-fast data access with minimal + serialization overhead. This avoids the overhead of reading from and + writing to `tf.Example` or similar formats (which can sometimes be + substantial). + * `StructuredTensor` does not introduce any additional runtime overhead to + TensorFlow graphs, because all operations on `StructuredTensor`s reduce to + operations on the individual contained tensors. This also ensures that all + applicable graph optimizations can be applied to the graph. + * Graph optimizations can be used to ensure that the data ingestion + operations skip over any structured input fields that are not used by the + model. + * When multiple structured models are chained together, the structured output + of one can be fed directly in as the input of the next, with no need for + any serialization. + +* **Understandability.** It is difficult to represent structured records in + tensorflow, and even harder to represent nested structures. By having a + standard way to do this: + + * It is easier to get a holistic view of feature engineering and machine + learning. Often, decisions about feature engineering and modeling are + intertwined. By having them in one graph, this makes it clearer ex post + what feature engineering and modeling are useful together. + * We can provide clearer interfaces between stages of the graph. + * We can wrap tensors to document the meanings of dimensions at graph + construction time, with no run-time cost. + * We can encode, document, and enforce the relationship between related + tensors, including parallel tensors with shared dimensions. + * We can more easily build shareable feature engineering components. + +* **Joint Modeling.** + + * We can combine multiple structured models into a single joint model, where + loss functions can propagate through all the submodels. + +### Limitations and Drawbacks + +As with any generic system, there are some compromises in the design that +deserve mention: + +* **Efficiency vs custom feature engineering**. It may be possible to perform + input transformation and feature engineering more efficiently outside the + graph, depending on the types of input transformation that are performed. + +* **Efficiency vs custom data ingestion**. For some input formats, a custom + data ingestion operation could be made more efficient than the generic data + ingestion operations that will be used for `StructuredTensor`s. For example, if a + model is parsing a custom protocol buffer, then a custom op that reads that + protocol buffer could use tricks and assumptions that are not available to a + generic input parser. However, for several columnar input formats such as + Apache Arrow, where zero-copy reads and writes are possible, the difference + between a custom input parser and the generic input parser should be small. + +* **Requires a shared schema.** `StructuredTensor` requires that all structures in + a given tensor have the same schema. Thus, `StructuredTensor` cannot be used to + encode arbitrary schema-free "dictionary-like" data. + +* **Shared dimensions must be leading dimensions.** A collection of tensors can + only be combined into a single `StructuredTensor` if their shared dimensions are + the leading dimensions. E.g., two tensors `Tensor` with shape `[batch, + ingredient]` could be combined into a single `StructuredTensor`; but a `Tensor` + with shape `[batch, ingredient]` could not be combined with a `Tensor` with + shape `[batch, recipe_step, ingredient]`. + +* **Trees, not graphs.** `StructuredTensor` encodes tree-like nested structure, + consisting of records and lists. It does not provide any direct support for + encoding graph-like structured objects. + +Thus, as with any generic system, we have made some sacrifices of speed for +generality and safety, but we believe that the benefits in terms of integration, +collaboration, and model understandability outweigh the limitations. + +## Design Proposal + +We are proposing the addition of a new Python tensor type **`tf.StructuredTensor`**, +which can be used to encode structures and struct tensors. + +* A ***scalar `StructuredTensor`*** contains a single structure (whose fields may + have any type and shape). +* A ***vector `StructuredTensor`*** or ***higher-dimensional `StructuredTensor`*** + contains a collection of structures with the same schema. + +For example, we can use a `StructuredTensor` to encode the following structured +information about a user and a recipe: + +```python +struct = tf.struct.constant({ + 'user_embedding': [0.8, 2.1, 0.3, 0.1, 9.2, 1.8], + 'recipe': { + 'title': 'Snickerdoodle cookies', + 'est_time': 55.0, + 'ingredients': [ + {'amount': 3.0, 'unit': 'cup', 'name': 'flour'}, + {'amount': 1.0, 'unit': 'cup', 'name': 'white sugar'}, + {'amount': 0.5, 'unit': 'cup', 'name': 'brown sugar'}, + {'amount': 1.0, 'unit': 'cup', 'name': 'butter'}, + {'amount': 1.0, 'unit': 'teaspoon', 'name': 'cinnamon'}, + {'amount': 2.0, 'unit': 'teaspoon', 'name': 'cream of tartar'}, + ...], + 'step': [ + {'description': 'Preheat oven to 350 degrees F.'} + {'description': 'Whisk together the flour, cinnamon, baking soda, salt, ...'} + {'description': 'Using an electric mixer, cream together the butter and sugar ...'} + {'description': 'Add the dry ingredients to the wet ingredients and mix until ...'} + ...], + 'user_rating': [ + {'user_embedding': [0.7, 2.0, 0.3, 0.3, 5.2, 2.2], score: 0.8}, + {'user_embedding': [1.4, 0.0, 3.1, 1.1, 1.2, 0.3], score: 0.4}, + ...], + }) +``` + +In this example: + +* The root object `struct` is a scalar `StructuredTensor` (i.e., a single + structure). +* The `recipe` field contains a scalar `StructuredTensor` (a single nested + structure). +* The nested fields `ingredients`, `step`, and `user_rating` contain vector + `StructuredTensor`s (i.e. 1D collections of structures with the same schema). +* The nested fields `title`, `est_time`, `amount`, `unit`, `name`, + `description`, and `score` contain scalar `Tensor`s. +* The nested `user_embedding` field contains a vector `Tensor`. + +In the initial implementation, the value for a field may be a `Tensor`, a +`RaggedTensor`, or a nested `StructuredTensor`, where nested `StructuredTensor`s could +contain single structures (scalar `StructuredTensor`) or collections of structures +(non-scalar `StructuredTensor`). In the future, we may add support for additional +value types, such as `SparseTensor`s. + +A `StructuredTensor`'s schema constrains the type for each field: + +* **For fields with `Tensor` values**, the schema specifies the field value's + `dtype` and `shape`. +* **For fields with `RaggedTensor` values**, the schema specifies the field + value's `dtype` and `ragged_rank`. +* **For fields with `StructuredTensor` values**, the schema specifies the nested + `StructuredTensor` value's `shape` (e.g., a single structure vs a list of + structures), and the name and type for each nested field. + +For multidimensional `StructuredTensor` (such as a vector of structures), the +individual structures must all share the same schema. + +### Usage + +#### Construction + +`StructuredTensor`s can be constructed from a variety of sources, including: + +* Nested Python dictionaries and lists. +* Serialized protocol buffer messages. +* JSON-encoded protocol buffers. +* Apache Arrow record batches. +* Apache Parquet. +* Pandas DataFrames. +* Spanner. +* Firebase. + +For more details, see the section below on "Integration with Other Formats." +`StructuredTensor`s may also be constructed directly from a shape and a field +dictionary; see the "Struct Tensor Encoding" section below for more details. + + + +#### Indexing + +For scalar `StructuredTensor`s, the Python indexing operator selects a field's +value: + +```python +>>> print(struct['user_embedding']) +tf.Tensor([0.8, ...], shape=(6,), dtype=float) # result is a Tensor vector +>>> print(struct['recipe']) + # result is a StructuredTensor scalar +>>> print(struct['recipe']['title']) +tf.Tensor(Snicker..., shape=(), dtype=string) # result is a string scalar +>>> print(struct['recipe']['ingredients']) + # result is a StructuredTensor vector +``` + +For non-scalar `StructuredTensor`s, the indexing operator selects elements (as with +any other tensor): + +```python +>>> print(struct['recipe']['ingredients'][0]) + # result is a StructuredTensor scalar +``` + +Multi-dimensional indexing is supported: + +```python +>>> print(struct['recipe', 'ingredients', 0, 'name']) +tf.Tensor(flour, shape=(), dtype=string) # result is a string scalar +``` + +This includes support for slicing ranges of values in the tensor dimensions: + +```python +>>> print(struct['recipe', 'ingredients', :, 'name']) +tf.Tensor(['flour' 'white sugar' ...], shape=(10), dtype=string) + +>>> print(struct['recipe', 'user_rating', :, 'user_embedding', ::2]) + +# result is a 2D float tensor with shape [num_user_ratings, 3]. +``` + +#### Path-Based Manipulation + +We will define a collection of ops that can be used to move, transform, and +combine fields within a `StructuredTensor`. These include: + +* **`broadcast`**: copies values from a source path to a target path, where the + target path is a descendant of the source path. +* **`promote`**: copies values from a source path to a target path, where the + target path is an ancestor of the source path. +* **`apply`**: applies a given op or function to one or more source paths, and + writes the result to a specified target path. + +#### Updating Fields + +The `StructuredTensor` object is immutable. However, several methods are provided +which can be used to construct new `StructuredTensor`s with modified field values. + + +`StructuredTensor.with_updates(self, **updates)` +: Returns a copy of this `StructuredTensor` with one or more fields modified or added. + +`StructuredTensor.without(self, *field_names)` +: Returns a copy of this `StructuredTensor` with one or more fields removed. + +`StructuredTensor.with_only(self, *field_names)` +: Returns a copy of this `StructuredTensor` with only the specified fields retained. + +The tensors encoding any unmodified fields are shared (not copied). + + +### Struct Tensor Encoding +`StructuredTensor` will be an abstract base class with two concrete subclasses: + +* `DenseStructuredTensor`: A dense collection of structures. +* `RaggedStructuredTensor`: A ragged collection of structures. + +#### `DenseStructuredTensor` +Internally, each `DenseStructuredTensor` is encoded using two objects: + +* **shape**: A TensorShape specifying the overall shape of the `StructuredTensor`. + For example, a `StructuredTensor` with shape `[5, 3]` contains `15` structures, in + `5` rows of `3`. The `shape`'s rank must be statically known -- i.e., + `shape.ndims` may not be `None`. (Note: this is the shape for a collection of + structures; it does not describe the shapes of any individual fields.) For a + scalar `StructuredTensor` (i.e, a single structure), `shape=()`. + +* **fields**: A python dictionary mapping each field name to a `Tensor`, + + `RaggedTensor`, or `StructuredTensor` that encodes that field's values. If `st` + is an `N`-dimensional `StructuredTensor`, then for each field `(k, v)` in + `st.fields.items`: + + ``` + v.shape[:N] = st.shape + st[D1...DN][k] = v[D1...DN] + ``` + + Note that for scalar `StructuredTensor`s (where `N=0` and `s.shape=()`), this + simplifies to just: + + ``` + st[k] = v + ``` + +The following example shows the encoding for a scalar `StructuredTensor` with two +fields: `x` (whose values are string scalars), and `y` (whose values are 2- +dimensional `RaggedTensor`s). + +```python +>>> st_scalar = struct_constant({"x": "foo", "y": [[1, 2], [3]]}) +>>> st_scalar.shape +TensorShape([]) +>>> st_scalar._fields +{'x': , + 'y': } +``` + +The following example shows the encoding for a vector `DenseStructuredTensor` with +the same schema as `st_scalar`. + +```python +>>> st_vector = struct_constant([{"x": "foo", "y": [[1, 2], [3]]}, + {"x": "bar", "y": [[4], [5, 6]]}, + {"x": "baz", "y": [[7, 8, 9]]}]) +>>> st_vector.shape +TensorShape([Dimension(3)]) +>>> st_vector._fields +{'x': , + 'y': } +``` + +The following example shows the encoding for a 2x2 matrix `DenseStructuredTensor` with the same schema: + +```python +>>> st_matrix = struct_constant( + [[{"x": "foo", "y": [[1, 2], [3]]}, {"x": "bar", "y": [[4], [5, 6]]}], + [{"x": "baz", "y": [[7, 8, 9]] }, {"x": "raz", "y": [] }]]) +>>> st_vector.shape +TensorShape([Dimension(2), Dimension(2)]) +>>> st_vector._fields +{'x': , + 'y': } +``` + +This last example (of a 2x2 matrix `StructuredTensor`) is illustrated below: + +![st_matrix](20190910-struct-tensor/pydict_to_struct_tensor.png "st_matrix") + +The `DenseStructuredTensor` encoding effectively takes the outer dimensions of the +`StructuredTensor`, and adds them into the tensors that encodes each individual +field. This is illustrated by the following diagram: + +![Python vs StructuredTensor Encoding +](20190910-struct-tensor/python_vs_struct_tensor_encoding.png +"Python vs StructuredTensor Encoding") + +#### `RaggedStructuredTensor` + +`RaggedStructuredTensor` is used to encode ragged collections of `StructuredTensor`s. +Internally, each `RaggedStructuredTensor` consists of a values tensor (which is a +`StructuredTensor`) and one or more row-partitioning tensors. See the +[`RaggedTensor` guide](https://www.tensorflow.org/guide/ragged_tensors) for more +information about this encoding. + +### Limitations + +`StructuredTensor` requires that all structures in a given tensor have the same +schema. Thus, `StructuredTensor` cannot be used to encode arbitrary schema-free +"dictionary-like" data. A few examples of structured values that could not be +encoded with `StructuredTensor` include: + +| Structured Value | Reason Why It Can't be Encoded | +:-------------------------------------------- |:--------------------------------------- | +|`[{"a": 1}, {"a": "hello"}]` | Field `"a"` has different dtypes | +|`[{"b": [1, 2, 3]}, {"b": [[1, 2], [3, 4]]}` | Field `"b"` has different ranks | +|`[{"c": {"x": 1}}, {"c": {"y": 1}}]` | Field `"c"` has different nested fields | + +Many existing struct-like encodings have provisions for "optional" values. For +example, protobuf fields may be "`optional`" and Apache Arrow fields may be +"nullable." In this proposal, we are not proposing to add support for optional +or nullable tensors. However, if support for optional or nullable tensors is +added in the future, then `StructuredTensor` could take advantage of that to relax +the restriction that nested structures must have the same fields. + +Similarly, many existing struct-like encodings have provisions for "union" +values. For example, protobuf fields may use "`anyof`" and Apache Arrow defines +a `Union` type. In this proposal, we are not proposing to add support for +union-value tensors. However, if support for union-value tensors is added in +the future, then `StructuredTensor` could take advantage of that to to relax the +restriction that fields must have the same `dtype`. + +### Integration with Other Formats + +#### Row-Based Formats + +##### Nested Python Dictionaries + +Nested Python dictionaries can be converted into `StructuredTensor`s using +`tf.struct.constant`. The schema for the `StructuredTensor` may be specified +explicitly; or if it is not specified, then it can be inferred from the value +itself: + +* Nested `list`s of scalar values are `Tensor`s or `RaggedTensor`s. (All scalar + values must be at the same nesting depth, and have compatible `dtype`.) + +* `Dict`s are converted into `StructuredTensor` scalars. + +* Nested `list`s of `dict`s are converted into multidimensional `StructuredTensor`s. + (All dictionaries must be at the same nesting depth, and must have the same + schema.) + +##### Protocol Buffers + +A protobuf definition can easily be mapped to a `StructuredTensor` schema. In +particular, each protobuf message type corresponds with a `StructuredTensor`, and +its protobuf field names are used as `StructuredTensor` field names. The shape of +each `StructuredTensor` field depends on the cardinality of the corresponding +protobuf field: + +* `required` protobuf fields are represented with 0D (scalar) values. +* `repeated` protobuf fields are represented with 1D (vector) values. +* `optional` protobuf fields are represented with 1D (vector) values, where each + value has length zero or one. + +The tensor type of each `StructuredTensor` field depends on the type of the +corresponding protobuf field: + +* Scalar fields are represented with `Tensor` values (if the underlying tensor + is 0D or 1D) or `RaggedTensor` values (if the underlying tensor is >1D). +* Message fields are represented with `StructuredTensor` values. + +A single proto is represented as a scalar (0D) `StructuredTensor`, and a batch of +protos is represented as a vector (1D) `StructuredTensor`. + +An open-source library, struct2tensor, is provided to construct a TF graph that +parses a batch of serialized protos into a `StructuredTensor`. + +##### JSON + +As long as a collection of JSON objects are subject to the same schema, they can +be represented similarly to Protocol Buffers. A JSON → `StructuredTensor` converter +implementation could choose to infer the schema (like what we do for nested +python dicts) or require a schema. + +#### Column-Based Formats + +At its core, `StructuredTensor` uses a column-based encoding for structured data. +As such, it is highly compatible with other column-based formats. In many +cases, the underlying data buffer used by `StructuredTensor` is identical to the +data buffer used by other formats, which enables zero-copy reads and writes from +these formas. + +##### Apache Arrow + +The `StructuredTensor` encoding is directly compatible with the Apache Arrow +encoding for structs. In particular, the format and contents of all data +buffers is identical, allowing zero-copy transfers between `StructuredTensor` and +Arrow. The following diagrams illustrate how `StructuredTensor`s relate to Arrow: + +![StrucTensor vs Apache Arrow +(Example 1)](20190910-struct-tensor/arrow_example_1.png +"StrucTensor vs Apache Arrow (Example 1)") +![alt StrucTensor vs Apache Arrow +(Example 2)](20190910-struct-tensor/arrow_example_2.png " +StrucTensor vs Apache Arrow (Example 2)") + +#### Pandas DataFrames + +Pandas `DataFrame`s can be represented using 1D `StructuredTensor`s, where each +column corresponds to a `StructuredTensor` field. For example, we can encode the +same set of 1D fields in both a DataFrame and a `StructuredTensor`: + +```python +>>> fields = {'col1': [1, 2], 'col2': [3, 4]} +>>> df = pd.DataFrame(data=fields) +>>> st = tf.`DenseStructuredTensor`(shape=[2], fields=fields) +``` + +If the default `RangeIndex` is not used for the row index, then the +`StructuredTensor` must contain an additional "_index_" field. DataFrames that use +hierarchical indexing (`MultiIndex`) can be represented as higher-dimensional +`StructuredTensor`s (for hierarchical row indices) or nested `StructuredTensor`s (for +hierarchical column indices). + + + +## Related Work +### tensorflow.io + +We plan to work with the tensorflow/io team to update their data sets to support +StructuredTensor. For example, we will extend tensorflow_io.arrow (which currently +only supports flat unstructured Arrow files), to generate `StructuredTensor`s for +structured Arrow files. + +### tf.nest + +Many TensorFlow APIs support nested structures of dictionaries, lists, and +tuples. We can use this facility to store structured data. For example, we +could encode a recipe as a nested Python structure whose leaf elements are +scalars: + +```python +recipe = { + 'title': tf.constant('Sugar cookies'), + 'ingredient': [ + {'name': tf.constant('sugar'), + 'amount': tf.constant(1), + 'unit': 'cup'}, + {'name': tf.constant('egg'), + 'amount': tf.constant(1)}, + ...], + ...} +``` + +But this encoding differs from the StructuredTensor encoding in two important ways: + +* Nested Python structures are only supported if the structure is entirely + static -- and in particular, if the *length of any nested list never changes*. + If we built a model to process recipes, then it could only process recipes + with a fixed and predetermined number of ingredients, steps, and user_ratings. + +* StructuredTensor encodes structured data using a column-based format, where all of + the values for a given nested field (such as recipe.ingredient.amount) are + stored in a single tensor. This makes it possible to process all values of a + given field in parallel, using a single op. Using his column-based format can + be critical for efficiently processing structured data. + +## Detailed Design + +### `class StructuredTensor(CompositeTensor)` + +The `StructuredTensor` base class defines the following properties and methods: + +#### Accessors + +```python +@property +StructuredTensor.shape + """The static shape of this StructuredTensor. + + The returned `TensorShape` is guaranteed to have a known rank, but the + individual dimension sizes may be unknown. + """ +``` + +```python +@property +StructuredTensor.rank + """The rank of this StructuredTensor (`ndims`). Guaranteed not to be `None`.""" +``` + +```python +StructuredTensor.field_names(self): + """Returns the string field names for this `StructuredTensor`. + + Returns: + tuple of string. + """ +``` + +```python +StructuredTensor.field_value(self, field_name): + """Returns the tensor value for the specified field. + + If this `StructuredTensor` has shape `[D1...DN]`, then the returned tensor + will have shape `[D1...DN, V1...VM]`, where the slice `result[d1...dN]` + contains the value for the scalar `StructuredTensor` `self[d1...dN]`. + + If this is a scalar `StructuredTensor` (`N=0`), then the returned value is + field value for this structure. + + Returns: + `Tensor`, `StructuredTensor`, or `RaggedTensor`. + """ +``` + +#### Factories + +```python +StructuredTensor.with_updates(self, **updates): + """Returns a copy of this StructuredTensor with one or more fields modified. + + Args: + **updates: A mapping from string field names to new field values. Fields + may be added or modified. If `self.rank > 0`, then each new field + value's shape must be prefixed with `self.shape`. E.g., if `self` is a + vector of structures with `n` elements, then the update values must have + an outer dimension of size `n`, and the `i`th element will become the + field value for the `i`th struct. + + Returns: + A `StructuredTensor`. + """ +``` + +```python +StructuredTensor.without(self, *field_names): + """Returns a copy of this StructuredTensor with one or more fields removed. + + Args: + *field_names: The names of the fields to remove. + + Raises: + KeyError: If a specified field is not present in this StructuredTensor. + """ +``` + +```python +StructuredTensor.with_only(self, *field_names): + """Returns a copy of this StructuredTensor with only specified fields retained. + + Args: + *field_names: The names of the fields to keep. + + Raises: + KeyError: If a specified field is not present in this StructuredTensor. + """ +``` + +#### Conversion + +```python +StructuredTensor.to_py(self): + """Returns this StructuredTensor as a nested Python dict or list of dicts. + + Requires that `self` was constructed in eager execution mode. (In graph mode, + evaluate the `StructuredTensor` first, and then use `StructuredTensorValue.to_py()`.) + + Returns: + A nested structure of `dict` and `list`. + """ +``` + +#### Operators + +```python +StructuredTensor.__getitem__(self, key): + """Returns the specified piece of this StructuredTensor.""" +``` + +```python +StructuredTensor.__repr__(self): + """Returns a string representation for this StructuredTensor.""" +``` + +### `class DenseStructuredTensor(StructuredTensor)` + +#### Constructor + +```python +DenseStructuredTensor(self, shape, fields): + """Creates a dense `StructuredTensor`. + + Args: + shape: Static information about the shape of the StructuredTensor. Must have + a known rank (`ndims`). + fields: A dictionary mapping from string to tensor, providing the values + for individual fields in the struct. If `ndims > 0`, then every tensor + in `fields` must have the same shape in the first `ndims` dimensions; + and that shape must be compatible with `shape`. + """ +``` + +### `class RaggedStructuredTensor(StructuredTensor)` + +#### Factory Methods + +```python +@classmethod +from_row_splits(cls, values, row_splits): + """Creates a ragged `StructuredTensor` from row_splits. + + Args: + values: A StructuredTensor with shape `[nvals, ...]`. + row_splits: An integer vector with shape `[nrows+1]`. + Returns: + A `RaggedStructuredTensor` with shape `[nrows, None, ...]`. + """ +``` + +### `class DenseStructuredTensorSpec` + +The `DenseStructuredTensorSpec` class is a `tf.TypeSpec` subclass that specifies the +shape and schema for a DenseStructuredTensor. + +#### Constructor + +```python +DenseStructuredTensorSpec(shape, field_specs): + """Build a type specification for a StructuredTensor. + + Args: + shape: The shape of the StructuredTensor. shape.ndims must not be None. + field_specs: A dictionary mapping from field name to TypeSpec, specifying + the tensor type used to encode each field. + + These TypeSpecs should specify the type of the entire field + (including outer dimensions which correspond to `shape`). For + example, if `shape=[2, 3]`, and field 'x' contains an int32 vector + of size `10` for each structure, then `field_specs['x']` should be + `tf.TensorSpec([2, 3, 10], tf.int32)`. + """ +``` + +### `class RaggedStructuredTensorSpec` + +The `RaggedStructuredTensorSpec` class is a `tf.TypeSpec` subclass that specifies +the shape and schema for a `RaggedStructuredTensor`. + +#### Constructor + +```python +RaggedStructuredTensorSpec(shape, values_spec, row_splits_dtype): + """Build a type specification for a RaggedStructuredTensor. + + Args: + shape: The shape of the StructuredTensor. shape.ndims must not be None. + values_spec: A StructuredTensorSpec for the values of the RaggedStructuredTensor. + row_splits_dtype: `dtype` for the `row_splits` tensor. + """ +``` + +### `class StructuredTensorValue` (TF-v1 only) + +When the `StructuredTensor` class is used in graph mode, the `StructuredTensorValue` +class can be used to store a concrete input or output value for a given +`StructuredTensor`. Its structure is analogous to that of `StructuredTensor`, except +that field values are numpy arrays (or `RaggedTensorValues` or +`StructuredTensorValues`). + + + +### Variables + +The fields in a `tf.StructuredTensor` may be `tf.Variable`s; but a +`tf.StructuredTensor` is a python-level object, and can't be stored in a +`tf.Variable` itself. + + + diff --git a/rfcs/20190910-struct-tensor/arrow_example_1.png b/rfcs/20190910-struct-tensor/arrow_example_1.png new file mode 100644 index 000000000..a6469be75 Binary files /dev/null and b/rfcs/20190910-struct-tensor/arrow_example_1.png differ diff --git a/rfcs/20190910-struct-tensor/arrow_example_2.png b/rfcs/20190910-struct-tensor/arrow_example_2.png new file mode 100644 index 000000000..ef86f0e3c Binary files /dev/null and b/rfcs/20190910-struct-tensor/arrow_example_2.png differ diff --git a/rfcs/20190910-struct-tensor/feature_extraction_in_model.png b/rfcs/20190910-struct-tensor/feature_extraction_in_model.png new file mode 100644 index 000000000..464b00446 Binary files /dev/null and b/rfcs/20190910-struct-tensor/feature_extraction_in_model.png differ diff --git a/rfcs/20190910-struct-tensor/integration_with_other_formats.png b/rfcs/20190910-struct-tensor/integration_with_other_formats.png new file mode 100644 index 000000000..78c160d22 Binary files /dev/null and b/rfcs/20190910-struct-tensor/integration_with_other_formats.png differ diff --git a/rfcs/20190910-struct-tensor/pydict_to_struct_tensor.png b/rfcs/20190910-struct-tensor/pydict_to_struct_tensor.png new file mode 100644 index 000000000..f17e9e3ac Binary files /dev/null and b/rfcs/20190910-struct-tensor/pydict_to_struct_tensor.png differ diff --git a/rfcs/20190910-struct-tensor/python_vs_struct_tensor_encoding.png b/rfcs/20190910-struct-tensor/python_vs_struct_tensor_encoding.png new file mode 100644 index 000000000..6ce7550ac Binary files /dev/null and b/rfcs/20190910-struct-tensor/python_vs_struct_tensor_encoding.png differ diff --git a/rfcs/20190910-struct-tensor/recipe_schema.png b/rfcs/20190910-struct-tensor/recipe_schema.png new file mode 100644 index 000000000..b057d02a8 Binary files /dev/null and b/rfcs/20190910-struct-tensor/recipe_schema.png differ diff --git a/rfcs/20190910-struct-tensor/struct_5.png b/rfcs/20190910-struct-tensor/struct_5.png new file mode 100644 index 000000000..c3855e0f7 Binary files /dev/null and b/rfcs/20190910-struct-tensor/struct_5.png differ diff --git a/rfcs/20191016-dlpack-support.md b/rfcs/20191016-dlpack-support.md new file mode 100644 index 000000000..0fa43e6b3 --- /dev/null +++ b/rfcs/20191016-dlpack-support.md @@ -0,0 +1,123 @@ +# dlpack support for interoperability with other GPU frameworks + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | 180 (https://github.com/tensorflow/community/pull/180) (update when you have community PR #)| +| **Author(s)** | eoldridge@nvidia.com, wmjlyjemaine@gmail.com, zhoujinjing09@gmail.com | +| **Sponsor** | apassos@google.com, sanjoy@google.com | +| **Updated** | 2019-11-26 | + +## Objective + +This document proposes the adoption of dlpack (https://github.com/dmlc/dlpack) as way of passing tensor data to other frameworks without leaving the GPU and without a copy per [24453](https://github.com/tensorflow/tensorflow/issues/24453). dlpack is a community effort to define a common tensor data structure that can be shared by different frameworks. dlpack is currently supported by cuPy, cuDF, DGL, TGL, PyTorch, and MxNet. + +The interoperability of dlpack would allow for fast on-GPU communication between TensorFlow and these frameworks opening up a wide range of use cases outlined below. It would further enable \_\_cuda_array_interface\_\_ interoperability through cuPy/cuDF which support both methods providing a way to transfer data to Numba, PyArrow and other frameworks that have adopted that method, although [a similar request has been made to support that method of interoperability](https://github.com/tensorflow/tensorflow/issues/29039) and ideally both would be supported. + +A solution has already been developed by @VoVAllen and @jermainewang (coauthored above) as an external python package. This RFC would see the concepts from the package integrated into Tensorflow Core, and reviewed and enhanced by the TF team so that dlpack support is native. + +## Motivation + +DLPack is a community effort to define a common tensor data structure that can be shared by different frameworks allowing data to be quickly shared often with zero or minimal copy. One of the main bottlenecks when trying to achieve GPU performance when operating across different frameworks is I/O and data formatting. The transfer of data between GPU and CPU or between formats is costly to the point where many operations become faster to simply run on the CPU because of the additional costs associated with moving/transforming the data. Even when mechanisms exist to copy data without leaving the GPU, memory constraints limit the application because two copies of the data are required. By implementing dlpack within TensorFlow there would be a way to transfer data directly between frameworks, enabling the development of a range of applications that weren't previously possible. + +Existing applications that take advantage of dlpack include: + - Inline on-gpu preprocessing of tabular data using cuDF to prepare it for deep learning models (continuous normalization, categorical encoding, etc) improving preprocessing performance by 10x over pandas and CPU + - Larger than cpu memory dataloader that iterates over parquet files and batch loads tensors, providing a significant speedup over traditional dataloaders for tabular data + - [End to end acceleration of training on GPU](https://medium.com/rapids-ai/accelerating-deep-learning-recommender-systems-by-15x-using-rapids-fastai-and-pytorch-b50b4d8568d1); + - Use of Tensorflow in conjunction with [tvm](https://github.com/dmlc/tvm); [TF custom op implementation of TVM](https://github.com/tobegit3hub/tftvm) + - Use of Tensorflow in conjunction with [dgl](https://github.com/dmlc/dgl) + - Zero copy transfer of data in [DALI](https://github.com/NVIDIA/DALI) reducing memory requirements. + - [thinc.ai](https://thinc.ai/docs/usage-frameworks) framework interoperability. + +Beyond the benefit of specific applications, Tensorflow's adoption of dlpack would further incentivize other frameworks considering its adoption as all three major DL frameworks would now be supporting it. Finally, it would also make the development of applications that operate upstream and downstream of deep learning frameworks easier to develop as a single framework agnostic method could be used in conjunction all DL frameworks. + +## User Benefit + +Users who wish to utilize other GPU accelerated frameworks like cuDF, cuPy, etc would be able to do so without expensive copy operations. By doing direct dataloading, feature engineering and preprocessing on GPU we see 10-15x speedups over traditional workflows involving CPUs to prepare the data for model readiness in other frameworks and they would be immediately available in tensorflow. + +More generally, users would be able to develop preprocessing or other GPU based functionality and be able to support integration with all dl frameworks simplifying development efforts when creating solutions that are upstream or downstream from deep learning models. + +A blog post or release notes headline could read "Tensorflow now supports dlpack enabling interoperability with other GPU powered frameworks like cuPy, cuDF, DGL, TGL, PyTorch, and MxNet." + +## Design Proposal + +A working version of dlpack integration has been released as a package by coauthors @jermainewang and @VoVAllen here: +https://github.com/VoVAllen/tf-dlpack/issues/3 + +This proposal would leverage that solution and integrate it into TF so that the operations could be performed natively. + +User experience +We plan to release a python package tfdlpack, containing two APIs: +``` +to_dlpack: Given a tensorflow tensor, return a DLPack tensor contain. +from_dlpack: Given a DLPack-compatible python capsule, return a tensorflow tensor. +``` + +Example code of converting a Tensorflow tensor to Torch tensor using DLPack using the package: +```python +import numpy as np +import tensorflow as tf +import torch.utils.dlpack as thdlpack +import tfdlpack + +t1 = tf.constant([1, 2, 3], dtype=np.float32) +dlpack = tfdlpack.to_dlpack(t1) # tf tensor -> dlpack +t2 = thdlpack.from_dlpack(dlpack) # dlpack -> th tensor +print(t2) +dlpack = thdlpack.to_dlpack(t2) # th tensor -> dlpack +t3 = tfdlpack.from_dlpack(dlpack) # dlpack -> tf tensor +print(t3) +``` +You will find that t1, t2 and t3 all have the same values, shape, and device contexts. +Package dependency: tensorflow>=2.0 + +Proposed code of converting a Tensorflow tensor to Torch tensor using DLPack natively: +```python +import numpy as np +import tensorflow as tf +import tensorflow.experimental.dlpack as tfdlpack +import torch.utils.dlpack as thdlpack + + +t1 = tf.constant([1, 2, 3], dtype=np.float32) +dlpack = tfdlpack.to_dlpack(t1) # tf tensor -> dlpack +t2 = thdlpack.from_dlpack(dlpack) # dlpack -> th tensor +print(t2) +dlpack = thdlpack.to_dlpack(t2) # th tensor -> dlpack +t3 = tfdlpack.from_dlpack(dlpack) # dlpack -> tf tensor +print(t3) +``` + +Potential technical problems for this API: +1. Memory usability on async device (to_dlpack) +As mentioned by @alextp +> TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers). +Here we decide to manunally sync the device when exporting TF tensor to dlpack. The sync behavior is done in the `TFE_TensorHandleDevicePointer` API, which returns the pointer to the underlying memory. + +2. Memory management (avoid leak) (to_dlpack/from_dlpack) +As the design of dlpack, the framework constructing tensor from dlpack is responsible to call the dlpack's deleter, which is usually dereferencing the underlying buffer, when destructing the constructed tensor. +For `from_dlpack`, a deleter function is registered when constructing the TF tensor, and would be called upon destruction. +For `to_dlpack`, the dlpack data structure will hold a reference (by `TensorReference`) to the underlying buffer, and `unref` it in the dlpack's deleter function. + +Proposed API implementation details: +- to_dlpack + - Implementing `TFE_HandleToDLPack`, which converts tf's eager tensor handle to dlpack tensor's pointer(`DLManagedTensor*`). And wrap it into PyCapsule to adapt to the Python interface in ffi binding file. For the underlying memory liveness, `TensorReference` is used to maintain the reference counting over the underlying `TensorBuffer`, which increases when creating dlpack tensor, and decreases in the deleter of dlpack tensor. +- from_dlpack + - Implementing `TFE_HandleFromDLPack`, which converts dlpack tensor's pointer(`DLManagedTensor*`) to tf's eager tensor handle. `TFE_TensorHandleDevicePointer` is used to get the data pointer of underlying buffer, and synchronize the related device to ensures the memory readiness. + + +## Questions and Discussion Topics + +https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting [\_\_cuda_array_interface\_\_](https://github.com/tensorflow/tensorflow/issues/29039) is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well. + +## Reference + +### tfdlpack package implementation detail + +The first design consideration is that we want to avoid any modification to the main Tensorflow library, so to get around the potential long delay of PR, code review, and release cycle of Tensorflow main package. Inspired by the solution from https://github.com/tobegit3hub/tftvm, we decide to implement the functionality as two custom tensor ops: to_dlpack and from_dlpack. + +Besides, we want this feature to be plugged into other projects quite easily. For example, any project that relies on this feature is able to run without compiling against Tensorflow's header files. Not only that an extra dependency usually means extra effort, but also that such maintenance is repetitive and should be handled by the feature developer (i.e., us) alone. To this end, we have an idea of releasing it as a python package. However, the question is how to invoke the two custom tensor ops in python? The challenge is that Tensorflow's custom op interface has a limited support of argument and return types, while to_dlpack and from_dlpack should have an argument/return type of DLPack object. We work around this by encoding the address of an DLPack object as an integer, so it can be accepted/returned by the custom op interface. Then, we decode it in python or C depending on whether we return it (to_dlpack) or consume it (from_dlpack). + +Finally, to achieve the maximal efficiency, we want the conversion happens without memory copy. + +For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation). +For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op). diff --git a/rfcs/20191017-tfx-standardized-inputs.md b/rfcs/20191017-tfx-standardized-inputs.md new file mode 100644 index 000000000..716e6fb60 --- /dev/null +++ b/rfcs/20191017-tfx-standardized-inputs.md @@ -0,0 +1,753 @@ + + +# Standardized TFX Inputs + +Status | Accepted +:------------ | :------------------------------------------------------------ +**RFC #** | [162](https://github.com/tensorflow/community/pull/162) +**Author(s)** | Zhuo Peng (zhuo@google.com), Kester Tong (kestert@google.com) +**Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) +**Updated** | 2019-10-03 + +# Objective + +* To define a common in-memory data representation that: + * is powerful enough to encode the following logical training data format: + flat + ([`tf.Example`](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/core/example/example.proto#L88)), + sequence + ([`tf.SequenceExample`](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/core/example/example.proto#L298)) + or structured data (e.g. + [Protocol Buffers](https://developers.google.com/protocol-buffers) or + [Apache Avro](https://avro.apache.org/)). + * all TFX components can understand and can support their own unique use + cases with. +* To define an I/O abstraction layer that produces the above in-memory + representation from supported physical storage formats, while hiding TFX’s + choice of such storage formats from TFX users. +* To define a bridge from the above in-memory representation to TF feedables + (i.e. Tensors and certain + [CompositeTensors](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/python/framework/composite_tensor.py#L1)). + +# Motivation + +## Interoperability of TFX libraries +TFX offers a portfolio of libraries, including +[TFT](https://github.com/tensorflow/transform), +[TFDV](https://github.com/tensorflow/data-validation) and +[TFMA](https://github.com/tensorflow/model-analysis). These libraries can be +used in a standalone manner (i.e. a user can run TFDV without/outside TFX) +while on the other hand, one TFX component (a program orchestrated by TFX) may +use multiple TFX libraries under the hood. For example, there could be a +TFX component that transforms the training data by invoking TFT and +collects pre-transform and post-transform statistics of the data by invoking +TFDV. + +Currently, each TFX library accepts different in-memory data representation: + +| | TFDV | TFT | TFMA | BulkInference | +|:---| :--- | :--- | :--- | :------------ | +| In-memory data representation | Arrow RecordBatches | Dict[str, np.ndarray] | str (raw data records), Dict[str, np.ndarray] | str (raw data records) | +| Understand the data and conduct analysis | input data is encoded losslessly as RecordBatches. | the in-mem representation may be lossy. | Relies on the model’s input layer, and the format is Dict[str, np.ndarray]. | N/A | +| Feed TF | N/A | the in-mem representation is TF feedable. | Feed “raw data” to the model. | Feed “raw data” to the model | + +When a TFX component needs to invoke multiple TFX libraries, it may need to +decode the data into one of the libraries’ in-memory representations, +and translate that into the other library’s. For example, in the “Transform” +component mentioned above: + +![alt_text](20191017-tfx-standardized-inputs/double-translation.png) + +Note that TFDV is invoked twice, with different data. Thus the translation needs +to happen twice. + +This has created several issues: + +* The translation is computationally expensive. In a real world set-up, such + translation could take as many CPU cycles as the core TFT and TFDV logic + takes in total. +* More such translation logic may need to be implemented to support expanding + TFX use cases -- imagine that TFMA needs to invoke TFDV to compute + statistics over slices of data identified by model evaluation. +* The complexity of adding new logical data representations scales with the + number of components. For example, to add SequenceExample support in TFX, + one may need to come up with an in-memory representation for each of the + components, to keep the consistency within the component. +* TFX library users whose data format is not supported natively by TFX would + have to implement the decoding logic for each of the libraries + they want to use. For example, had TFX not supported the CSV format, a user + would have to implement + [one CSV decoder for TFT](https://github.com/tensorflow/transform/blob/master/tensorflow_transform/coders/csv_coder.py), + and [another CSV decoder for TFDV](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py) + +A common in-memory data representation would address the issues. + +## The need for supporting new physical storage formats in TFX + +Currently TFX (mostly) assumes tf.Example on TFRecord. However because TFX is a +managed environment, it is desired that its choice of physical format of data is +an implementation detail and is opaque to the components and the users. TFX +would like to explore switching to a columnar physical format like Apache +Parquet and it can be imagined that there will be a migration at some point. +Such a migration must happen in either of the following two ways: + +* Change every single TFX library and component that needs to read data to + add support for reading the new physical format (into each library’s own + in-memory representation) +* Rely on an indirection through tf.Example and give up some performance + because of the translation. + +Beyond easier migrations (which could arguably be one-time efforts), a good I/O +abstraction would allow TFX to choose the optimal storage format based on user’s +workload, in a user-transparent manner. + +# User Benefit + +## TFX End Users + +While this change is transparent to end users, it will facilitate the design and +implementation of many user-facing features, for example: + +* Columnar storage format in TFX. +* Structured training examples. + +## Individual TFX component users + +We use TFXIO to refer to the proposed I/O abstraction layer. All TFX components +will start using TFXIO to ingest the data and have a unified way of representing +the data. Individual TFX component users would be able to implement TFXIO for +their own data formats / storage formats that are not supported by TFX. By +design, any such implementation will be readily accessible by all TFX +components. + +## TFX developers + +Developers working on TFX infrastructure will not have to understand the +internals of each component any more in order to make changes to I/O and parsing +(for example, adding support for a new storage format for the training +examples). + +Developers working on TFX components would benefit from sharing common +operations against the unified in-memory representation, or even higher-level +computations. For instance, suppose that we implement a sketch-based algorithm +to compute approximate heavy hitters over this in-memory representation. We can +now share this implementation inside both TFDV and TFT for their top-K feature +value computation. + +# Requirements + +The in-memory representation should: + +* Be columnar. + + A columnar in-memory representation works better than a row based on under + typical workload of TFX libraries: + + Compute statistics over a column. + + Feed a batch of rows to TensorFlow. + +* Be able to losslessly encode the logical data format. Specifically, it + should be able to distinguish a null (unpopulated) value from an empty list. + + TFDV produces distinct statistics for nulls and empty lists. + +* Provide for efficient integration with TF computation. Ideally, using the + in-memory representation with TensorFlow should not require a data copy. + + Feeding TF is a significant TFX workload, both in terms of CPU cycles and + number of examples processed. + +* Have efficient Python APIs to slice, filter and concatenate. Or more + generally, have efficient Python APIs for data analysis type of workload. + + For example, TFMA may group the examples by “geo_location” and “language” + and evaluate the model for each of the slices. This operation would require + efficient slicing a batch of examples, and concatenation of slices that + belong to the same slice key (because the slices could be very small, and + inefficient for feeding TF). TFDV has similar use cases where statistics of + a certain slice of data needs to be collected. + + This type of workload is also significant in TFX, both in terms of CPU + cycles and number of examples processed. + + Note that TFX libraries don't always need to run TF graphs. For example, + TFDV, despite of its name, only analyzes the training data and (almost) does + not call any TF API. Another example, TFMA, will support "blackbox" + evaluation where the model being evaluated does not have to be a TF model. + Therefore a TF-neutral in-memory representation that works well with plain + Python code is desirable. + +* Be interoperable with the rest of the world. + + The OSS world should be able to use TFX components with little effort on + data conversion. This aligns with TFX’s long term vision. + + +# Design Proposal + +This design proposes **a common in-memory data representation**, **a way to +translate that into TF feedables** (np.ndarray or EagerTensors) and **a set of +APIs** each component can use to get both. + +![alt_text](20191017-tfx-standardized-inputs/overview.png) + +## Common in-memory data representation + +[Apache Arrow](https://arrow.apache.org/) will be used as the common in-memory +data representation. Beam-based TFX components will accept +PCollection[pyarrow.[RecordBatch](https://arrow.apache.org/docs/python/data.html#record-batches)]. + +Each logical data format will have its own encoding convention, +[discussed](#logical-data-encoding-in-arrow) in the detailed design. + +We chose Apache Arrow because: + +* It’s Expressive enough. + * Lossless encoding of (conformant) tf.Example, tf.SequenceExample + * Can encode structured data (proto) +* It’s a columnar format. It works well with common TFX workloads: + * Column (feature)-wise analysis + * Feed a batch of columns (features) to TensorFlow. +* It’s OSS friendly. + * Community support for more storage format I/O (e.g. Apache Parquet) + * Friendly to other OSS data formats, both in-memory and on disk (e.g. + Pandas) + * Friendly to numpy / TF: many Arrow array types share the same memory + layout with numpy ndarrays and certain type of TF (composite) Tensors. +* TF neutral. + * Leaves the possibility of supporting other ML libraries open. + +## Translation from Arrow to TF feedables + +The analogy to this is parsing tf.Examples into TF feedables -- extra +information is needed in this translation because a +[`Feature`](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/core/example/feature.proto#L76) +can be converted to a Tensor, a SparseTensor or a +[RaggedTensor](https://www.tensorflow.org/guide/ragged_tensor) depending on the +[feature specs](https://github.com/tensorflow/tensorflow/blob/635e23a774936b5fe6fa3ef3cb6e54b55d93f324/tensorflow/python/ops/parsing_ops.py#L46-L49). +Currently this extra information is implicitly contained in the pipeline schema +(an instance of the +[TFMD Schema](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto)) +proto. + +Similarly, an Arrow column can be translated to various TF feedables. +[An extension to the pipeline schema](#tensorrepresentation) is proposed to for +a user to express the intention for conversion. + +The conversion can be efficient (zero-copy) in certain cases. It is +[discussed](#efficient-arrow-tensor-conversion) in the detailed design. + +## Standardized Inputs APIs + +We propose a set of APIs that TFX components will call, and need to be +implemented for each of the supported combination of {physical, logical} format. + +```py +class TFXIO(object): + """Abstract basic class of all Standardized TFX inputs API implementations.""" + def __init__( + self, + schema: Optional[tfmd.Schema]=None + ): + pass + + @abc.abstractmethod + def BeamSource(self, + projections: Optional[List[Text]]=None + ) -> beam.PTransform: + """Returns a beam PTransform that produces PCollection[pa.RecordBatch]. + + May NOT raise an error if the TFMD schema was not provided at construction time. + + Args: + projections: if not None, only the specified subset of columns will be + read. + """ + + @abc.abstractmethod + def TensorAdapter(self) -> TensorAdapter: + """Returns a TensorAdapter that converts pa.RecordBatch to TF inputs. + + May raise an error if the TFMD schema was not provided at construction time. + """ + + @abc.abstractmethod + def ArrowSchema(self) -> pyarrow.Schema: + """Returns the schema of the Arrow RecordBatch generated by BeamSource(). + + May raise an error if the TFMD schema was not provided at construction time. + """ + + @abc.abstractmethod + def TFDataset(self, ...) -> tf.data.Dataset: + """Returns a Dataset of TF inputs. + + May raise an error if the TFMD schema was not provided at construction time. + """ +``` + +Where `TensorAdapter` is: + +```py +class TensorAdapter(object): + + def __init__( + self, + arrow_schema: pyarrow.Schema, + tensor_representations: Dict[str, TensorRepresentation]): + """Initializer. + + Args: + arrow_schema: the schema of the RecordBatches this adapter is to receive + in ToBatchTensors(). + tensor_representations: keys are the names of the output tensors; values + describe how an output tensor should be derived from a RecordBatch. + """ + + + def TypeSpecs(self) -> Dict[str, tf.TypeSpec]: + """Returns tf.TypeSpec for each tensor in `tensor_representation`. + + TypeSpecs can be used to construct placeholders or tf.function signatures. + """ + + def ToBatchTensors( + self, record_batch: pyarrow.RecordBatch, + projections: Optional[List[TensorName]]=None + ) -> Dict[str, TFFeedable]: # TFFeedable: np.ndarrays or tf.EagerTensor + # (or compositions of them, i.e. + # CompositeTensors). + """Converts a RecordBatch to batched TFFeedables per `tensor_representation` + + Each will conform to the corresponding TypeSpec / TensorRepresentation. + + Args: + projections: a set of names of TFFeedables (mentioned in + `tensor_representation`). If not None, only TFFeedables of those names + will be converted. + """ +``` + +Note that we will provide a default implementation of `TensorAdapter`, but TFXIO +implementations can implement their own `TensorAdapter`. A custom +`TensorAdapter` would allow a `TFXIO` implmentation to rely on a TF graph to +do parsing -- the same graph can be used in both `BeamSource` and +`TensorAdapter`. + +The default `TensorAdapter` can be constructed out of the Arrow schema (which +is required for any TFXIO implementation) and `TensorRepresentations`. The +latter is part of the TFMD schema. See [this section](#tensorrepresentation) +for details. + +# Detailed Design + +## Logical data encoding in Arrow + +On a high level, a batch of logical entities (“examples”) is encoded into a +[`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch). +Features or fields (from structured records) are encoded as columns in the +RecordBatch. + +Note that +[`pyarrow.Table`](https://arrow.apache.org/docs/python/data.html#tables) offers +an abstraction similar to RecordBatch with the key difference being that a +column in a Table might contain multiple chunks of contiguous memory regions +while a column in a RecordBatch contains only one chunk. RecordBatch is chosen +because we want to enforce that TFXIO implementations produce batched data in +the most efficient way (one chunk per batch). Users of TFXIO may construct a +Table from one or more RecordBatches since easy conversion from one to the other +is supported by Apache Arrow. + +This design aims to support the logical structure of tf.Example, +tf.SequenceExample or structured data like Protocol Buffers. Thus only a subset +of Arrow array types are needed. All TFX components will guarantee to understand +those types, but no more. Below is a summary of supported encodings: + +| Logical representation | Arrow encoding | +| :--------------------- | :------------- | +| Feature with no value | `NullArray` | +| Univalent feature (one value per example) | `FixedSizeListArray` (list_size = 1) | +| Multivalent feature (multiple values per example) | `[FixedSize]ListArray` | +| Sequence feature (list of lists of values per example) | `[FixedSize]ListArray<[FixedSize]ListArray>` | +| Proto-like structured data | `ListArray}>>` | + +However the design is flexible to support more complicated logical structures, +for example, k-nested sequences (tf.SequenceExample is 2-nested). + +Next we show that these encodings cover the logical data formats we aim to +support: + +### tf.Example + +[Conformant](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/core/example/example.proto#L78) +tf.Examples are assumed. I/O + parsing should throw an error upon non-conformant +instances. + +A key requirement derived from the conformant-ness is for the encoding to be +able to distinguish the following two cases: + +* a feature is present, but it’s value list is empty + + ``` + { + features { + "my_feature": { + bytes_list { + } + } + } + ``` + +* a feature is not present + + ``` + { + features { + } + } + ``` + + or + + ``` + { + features { + "my_feature": {} # none of the oneof is set + } + } + ``` + +Each feature can be encoded as: + +``` +[FixedSize]ListArray +``` + +Then, the feature value in case a) is encoded as an empty sub-list, while the +feature value in case b) is encoded as null. + +If we know that all the lists in a `ListArray` are of equal length (from the +schema of the data, see below sections), `FixedSizeListArray` can be used to +obviate the `O(N)` space overhead for lengths of lists. + +### tf.SequenceExample + +[Conformant](https://github.com/tensorflow/tensorflow/blob/abfba15cd9734cec7ecd3d0661b146fc251c842d/tensorflow/core/example/example.proto#L184) +tf.SequenceExamples are assumed. I/O + parsing should throw an error upon +non-conformant instances. + +A context feature will be encoded similarly to a feature in tf.Example. A +sequence feature will be encoded as: + +``` +[FixedSize]ListArray<[FixedSize]ListArray> +``` + +To avoid name conflicts with context features, all the sequence features can be +grouped into one `StructArray`: + +``` +StructArray<{'sequence_feature1': ListArray>, ...}> +``` + +### Structured data (e.g. Protocol Buffers / Apache Avro) + +A batch of structured records can be encoded as follows: + +* Each direct leaf field of the structure can be encoded similarly to + tf.Example. (`ListArray` of primitive types). +* Each sub-message can be encoded as: + + ``` + ListArray>> + ``` + +## Arrow to TF Feedable conversion + +### TensorRepresentation + +One or more Arrow columns can potentially be converted to multiple types of TF +feedables. + +For example, a `ListArray` can be converted to: + +* a Tensor, if given a default value to pad +* a SparseTensor to represent a ragged array +* a RaggedTensor + +The choice depends on user’s intents, which currently is +[implicitly](https://github.com/tensorflow/transform/blob/11afcff467f779ba6163686395582e69603987d1/tensorflow_transform/tf_metadata/schema_utils.py#L172) +expressed in the pipeline schema. + +We propose to create a new [TFMD](https://github.com/tensorflow/metadata) +(TensorFlow MetaData) Proto, `TensorRepresentation` to carry those intents implicitly: + +```protobuf +message TensorRepresentation { + oneof { + DenseTensor { … } // column_name, dtype, shape, default_value + VarLenSparseTensor { … } // column_name, dtype + SparseTensor { } // dtype, value_column_name, indice_column_names + VarLenRaggedTensor { … } // dtype + RaggedTensor { } // dtype, value_column_name, row_partition_column_names, ... + StructuredTensor { } // column_names + } +} +``` + +This proto is used in two places: + +* It’s part of TFMD schema: + + ```protobuf + message TensorRepresentationGroup { + map tensor_representation = 2; + }; + + message Schema { + repeated Feature feature = 1; + // … + map tensor_representation_group = 42; + } + ``` + + Note : + + * `TensorRepresentationGroup` allows different instances of one TFX + component to use different sets of `TensorRepresentation`s. + * `tensor_representation_group` is **optional**. If the user does not + specify any, a default representation will be derived from + schema.feature to keep backwards compatibility. + * this field is **not** a sub-message of Schema::Feature, because a TF + feedable may comprise multiple columns + + Being part of the schema makes it possible to serialize and materialize the + intents for other components to use, which allows TFT’s materialization + functionality to have its own TFXIO implementation that hides the + data/physical format from the user. + + When generating the initial schema from the statistics of the data, TFDV can + propose a default set of `TensorRepresentationGroup`. The user may revise + the proposal and TFDV can validate `TensorRepresentationGroup`s in a + continuous manner. + +* The default implementation of TensorAdapter takes an optional `Dict[str, + TensorRepresentation]` at construction time. If a TFXIO implementation + choose to use the default TensorAdapter, it needs to provide them (may come + directly from the Schema). + +### Efficient Arrow->Tensor conversion + +The key to efficient conversions is to avoid copying of data. The prerequisites +to do so are: + +* Same memory alignment +* Same memory layout + +Currently 64-byte alignment is the standard in both Tensorflow's `TensorBuffer` +and Apache Arrow's `Buffer`. Forthermore, it can be guaranteed by implementing +our own version of `arrow::MemoryPool` that is backed by a +`tensorflow::Allocator`. + +The memory layout will be the same if right types are chosen at both ends thus +zero-copy conversion can be done, for example: + +* `FixedLengthListArray` (or `ListArray` of equal-length lists) -> dense + Tensors. +* `ListArray>` -> + [RaggedTensors](https://github.com/tensorflow/tensorflow/blob/3c2dabf53dd085c21e38a28b467e52c566c0dfaf/tensorflow/python/ops/ragged/ragged_tensor.py#L1). +* `ListArray>` -> + [StructuredTensors](https://github.com/tensorflow/community/blob/master/rfcs/20190910-struct-tensor.md) + +In other cases, copies can be avoided for the values, but some computation is +needed: + +* `ListArray>` -> `tf.SparseTensor` + * Need to compute the sparse indices from `ListArray`'s list offsets. + +The remaining cases require a copy: + +* `ListArray>`(of non-equal-length lists) -> dense Tensors + +With TensorRepresentation available in the Schema, a TFXIO implementation may +optimize its decoder to choose the most efficient Arrow type. + +#### Conversion of string features + +Arrow’s string arrays (`BinaryArray`) have a different memory layout than +TensorFlow’s string Tensors, even with +[`tensorflow::tstring`](https://github.com/tensorflow/community/blob/master/rfcs/20190411-string-unification.md). +There is always some overhead in conversion, but with `tensorflow::tstring` a +Tensor of `string_view`s is possible, thus the overhead will be a function of +the number of strings being converted, instead of the lengths of the strings. + +#### TF APIs for conversions + +In TF 1.x we will use np.ndarray as a bridge as Arrow has zero-copy conversion +to numpy’s ndarrays. (not for string arrays). + +Starting from TF 2.x, we will be able to create EagerTensors from Python +memoryview(s) so that strings can be covered. + +## TFMD Schema + +[The TFMD Schema](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto) +is a pipeline-level artifact and in the scope of this proposal, it may serve two +purposes: + +* To provide optional inputs to the parsing logic for optimizations. +* To carry user’s intents of converting data to TF feedables. + +The two purposes don’t have to be served in the following cases: + +* TFDV should not require a schema to work and it does not need TF feedables. +* Some TFXIO implementation may not need the schema for either purposes. + +Therefore the TFMD schema is optional, and a TFXIO implementation: + +* should guarantee that the `BeamSource()`can return a valid + `PCollection[RecordBatch]` without a schema. + * Other interfaces may raise an error when a schema was not provided. +* does not have to require a TFMD schema for all its interfaces to work. + +## (TensorFlow) Trainer integration + +For TFX to freely choose the storage format for training examples for a user, we +**cannot** expose file-based or record-based interface to that user in the TF +trainer, because: + +* the user might not know how to open those files. +* there might not be an efficient representation of a “record” (this is true + for columnar storage formats like Apache Parquet) but only an efficient + representation of a batch of records. + +Thus we propose that to most users, the TF Trainer only exposes a handle to a +`tf.data.Dataset` of parsed (composite) Tensors. + +Each `TFXIO` implementation will implement a `TFDataset()` interface to return +such a `tf.data.Dataset`. This dataset contains logically a set of batched +(composite) Tensors that are of the same type as the corresponding +`TensorAdapter()` would return for a `RecordBatch`. See +[this section](#recommended-way-of-implementing-a-tfxio) about how to minimize +the code needs to be written for a new `TFXIO` implementation. + +The `TFDataset()` interface will accept common knobs that a user may need to +tweak: + +* Batch size +* Random shuffle + +## Code organization and OSS + +### `tfx_bsl` package + +TFXIO will be used by all TFX components as well as the TFX framework, making it +be almost at the bottom of the dependency chain. Moreover, a lot of +implementations details will be in C++, with python wrapping around, and we want +to make sure our TFX components pip packages remain pure Python for easy +maintenance. Therefore we propose a new python package +[tfx_bsl](https://github.com/tensorflow/tfx-bsl) (TFX Shared Basic Libraries) to +contain the implementations of `TFXIO` and other libraries shared across TFX +components. + +![alt_text](20191017-tfx-standardized-inputs/oss_lib_org.png) + +## Recommended way of implementing a TFXIO + +To maximize code sharing, the following way of implementing a `TFXIO` is +suggested: + +![alt_text](20191017-tfx-standardized-inputs/impl_tfxio.png) + +One would only need to implement the IO+Parsing-to-arrow in C++ once, and reuse +it in the BeamSource() and a format-specific Dataset Op that produces a +DT_VARIANT tensor that points to the parsed Arrow RecordBatch. Then we provide +one C++ library that translates the Arrow RecordBatch to Tensors, which can also +be reused in a TF op (as the downstream of the Dataset, or in a Python wrapper). + +# Alternatives Considered + +## StructuredTensor + +We’ve considered an alternative where [StructuredTensor](https://github.com/tensorflow/community/blob/master/rfcs/20190910-struct-tensor.md) +is the unified in-memory representation, but it does not meet all of the +requirements: + +| | Arrow RecordBatch | StructuredTensor | +| :-------------------------------- | :----------------- | :---------------- | +| Columnar | Yes | Yes | +| Lossless encoding (nullity) | Yes | No (see remark 1) | +| Efficient translatable to Tensors | Yes (see remark 2) | Yes | +| Efficient slicing, filtering and concatenation | Yes | No (see remark 3) | +| Interoperability with the rest of the world | Good through Apache Arrow | Needs adaptation (see remark 4) | + +Remarks: + +1. We could revise the design of StructuredTensor to include the nullibility + support. +2. Only when the backing buffers are aligned correctly. Currently both TF + and Apache Arrow has 64-byte alignment. And this can be enforced by + implementing our own Arrow MemoryPool wrapping a TF allocator. + [This colab notebook](https://colab.research.google.com/drive/1bM8gso7c8x4UXx5htDM4N1KUSTuRvIFL) + shows that as long as the memory alignment is the same, feeding TF with an + Arrow Array has very little overhead. +3. See the comparison in [this colab notebook](https://colab.research.google.com/drive/1CvDjZCH3GQE8iojCmRHPuSqLTw8KgNf3). + * It’s worth calling out that Arrow is meant to be a data analysis library + and better data analysis support (for example, support for a “group-by” + clause) will be added over time. +4. We may gain similar interoperability by creating an Arrow to + StructuredTensor adapter. + * Beyond the technical aspects, we believe by having all the TFX libraries + directly adopting a popular OSS in-memory format will send a positive + message that TFX is meant to work well with the rest of the world. + +## tf.Data + +We’ve also considered tf.Data as the unified I/O abstraction. tf.Data has +support for a good number of data formats through +[tensorflow-io](https://github.com/tensorflow/io/tree/master/tensorflow_io). + +It's faily straightforward to implement a Beam PSource that wraps a file-backed +tf.data DatasetSource, and have that PSource produce Arrow RecordBatches, +but due to lack of support for [dynamic work rebalancing](https://cloud.google.com/blog/products/gcp/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow) in tf.data, such an implementation would +not match the performance of existing beam PSources. + +While we cannot rely solely on tf.data as the I/O abstraction, the proposed +TFXIO interface does not disallow such a tf.data based implementation. So we can +still gain support for many other formats through tensorflow-io while still +using existing beam PSources for formats that have native support. + +# Questions and Discussion Topics + +## OSS build / release issues + +### We don’t have a strong representation in the Arrow community + +This has led to some issues. For example, the PyPI/Wheel packaging for pyarrow +currently is unfunded and lacks volunteers, and the pyarrow wheel sometimes had +issues with TensorFlow ([example](https://github.com/apache/arrow/issues/4472)). + +### ABI compatibility with libarrow + +The OSS library, tfx_bsl will depend on Arrow and TensorFlow’s DSOs (dynamic +shared objects). Because both libraries currently expose C++ APIs, there are +always risks of incompatible ABIs as TensorFlow and Arrow are likely to be built +using different toolchains, we cannot completely eliminate the risks. + +With [Modular TensorFlow](https://github.com/tensorflow/community/pull/77), +which replaced all the C++ APIs with C-APIs, we will be able to eliminate the +risk by using the same toolchain that builds Arrow. + +Furthermore, the Apache Arrow community is discussing about an +[ABI-stable C-struct](https://github.com/apache/arrow/pull/5442) that describes +Arrow Arrays. This will allow to build Apache Arrow from source and link +statically with our code, and only talk with pyarrow through that ABI-stable +interface. + +Since in Google we build everything from HEAD, using the same toolchain, there +are no risks. + +## Performance + +We would have to convert from Apache Arrow dataframes to TF feedables / Tensors. +Sometimes this conversion cannot happen efficiently (requires copying out the +data or other computation). diff --git a/rfcs/20191017-tfx-standardized-inputs/double-translation.png b/rfcs/20191017-tfx-standardized-inputs/double-translation.png new file mode 100644 index 000000000..a65d1d87f Binary files /dev/null and b/rfcs/20191017-tfx-standardized-inputs/double-translation.png differ diff --git a/rfcs/20191017-tfx-standardized-inputs/impl_tfxio.png b/rfcs/20191017-tfx-standardized-inputs/impl_tfxio.png new file mode 100644 index 000000000..7309b56cc Binary files /dev/null and b/rfcs/20191017-tfx-standardized-inputs/impl_tfxio.png differ diff --git a/rfcs/20191017-tfx-standardized-inputs/oss_lib_org.png b/rfcs/20191017-tfx-standardized-inputs/oss_lib_org.png new file mode 100644 index 000000000..1e706c8da Binary files /dev/null and b/rfcs/20191017-tfx-standardized-inputs/oss_lib_org.png differ diff --git a/rfcs/20191017-tfx-standardized-inputs/overview.png b/rfcs/20191017-tfx-standardized-inputs/overview.png new file mode 100644 index 000000000..55e32a3f0 Binary files /dev/null and b/rfcs/20191017-tfx-standardized-inputs/overview.png differ diff --git a/rfcs/20191106-tf2-tpu-savedmodel.md b/rfcs/20191106-tf2-tpu-savedmodel.md new file mode 100644 index 000000000..097f655ef --- /dev/null +++ b/rfcs/20191106-tf2-tpu-savedmodel.md @@ -0,0 +1,413 @@ +# TPU SavedModel Export API for TF2.x + +Status | Accepted +:------------ | :----------------------------------------------------------- +**RFC #** | [171](https://github.com/tensorflow/community/pull/171) +**Author(s)** | Zhuoran Liu (lzr@google.com), Youlong Cheng (ylc@google.com) +**Sponsor** | Jonathan Hseu (jhseu@google.com) +**Updated** | 2020-02-04 + +## Objective + +Provide an API to allow TF2 users to export TPU saved models for +inference, which: + ++ Provide a user-friendly way to specify which function to run on TPU; ++ Hides Graph construction and TPU inference specific logic (multi-core + support, etc) from users; ++ Allows specifying tags in SavedModel. + +## Motivation + +### TPU Serving Requirement + +Serving a model on TPU is not as straightforward as serving on CPU and GPU, +because TPU serving has special requirements, listed as follows: + ++ Contract between TensorFlow graph and TF2XLA Bridge. The new bridge will + still respect this contract. The information of “which part of computation + should run on TPU” is conveyed from Graph to Bridge by tagging a special + Node attribute `_tpu_replicate`. Because of this, we need to provide + information during Function object instantiation in order for this attribute + to be correctly attached to Nodes during Graph building; + ++ Multi-core TPU serving. TPU has various deployment configurations, for + example 1x1 Dragonfish chip has 2 cores, 2x2 Dragonfish chip has 8 cores. + The exported saved model should be able to run on different configurations + and can leverage all the TPU cores. + + - When users write their model code, they likely don’t have information + about how many TPU they have for serving / which core they can use. + Therefore we need a Graph level abstraction to express graph + partitioning information. tf.device() cannot serve this purpose, because + it requires users to have knowledge about the physical device they have + during serving; + - To make efficient usage of multicore TPUs, we need to encapsulate TPU + computations as FunctionDef, and construct TPUPartitionedCall / + TPUOrdinalSelector to perform round-robin core selection; + ++ Tagging system of SavedModel. Users rely on a tagging system to load their + models for serving. E.g. CPU MetaGraphs have one tag ‘serve’, while TPU + MetaGraphs have two tags ‘serve’ and ‘tpu’. Only with correct tags can + SavedModels be loaded correctly. + +Below is an intuitive example of how a TPU graph is different from a CPU one: + +![Original CPU Graph](20191106-tf2-tpu-savedmodel/cpu_graph.png) +
Original CPU Graph.
+ +![TPU Graph](20191106-tf2-tpu-savedmodel/tpu_graph.png) +
TPU Graph.
+ +### Limitation of current `tf.saved_model.save()` + +MetaGraphDef allows saving customized tags. Current downstream components like +TPU model-server, TFX infra-validator use the tags to load the specific +MetaGraph. However tf.saved_model.save() does not allow users to specify the set +of tags in MetaGraphDef, but hard-coded the MetaGraph to have only one ‘serve’ +tag. + +### User Control of Device Placement + +There has to be a way for users to specify which part of computation should be +placed on TPU, because there’s no perfect device placement policy that can work +for every use case. For example even though dense embedding ops are allowed on +TPU, serving models might still want to run embedding lookups on CPU because the +embeddings are too big to fit on TPU. + +![Customized Embeddings](20191106-tf2-tpu-savedmodel/customized_embeddings.png) +
Example of user control. In this graph, both ‘custom_embedding’ and +‘dense’ can run on TPU. But users want ‘custom_embedding’ to run on CPU for +whatever reason, e.g. CPU computations can be parallelized, users don’t have +enough TPU resources, etc. In this case, there has to be a way for them to tell +SavedModel that only ‘dense’ is to run on TPU.
+ +## Design + +The general idea is to allow users to store a function-alias mapping during +model saving, so that they can refer to the function they want to rewrite for +TPU inference when they use downstream graph transformation tools to rewrite +their models for TPU serving. + +This alias mapping mechanism is because a single tf.function can generate many +ConcreteFunctions. If a downstream tool wants to refer to all concrete functions +generated by a single tf.function, it can use the `function_aliases` argument to +store a map from the alias name to all concrete function names. + +### Major changes + ++ For `tf.saved_model.SaveOptions`: A new slot `function_aliases` is added, to + allow users specify alias of functions they potentially wish to be rewritten + by external graph transformation tools (TPU Inference Converter in this + case); ++ For `MetaInfoDef` in `MetaGraphDef` in `SavedModel`: A new field + `functions_aliases` is added, to store names of FunctionDef mapping to their + aliases. + +### User facing API + +Users can give `FunctionDef`s they potentially want to rewrite for TPU inference +an alias when saving model: + +```python +class MyModel: + @tf.function + def func(): + ... + @tf.function + def serve(): + ... + func() + +model = MyModel() +signatures = { + 'serving_default': model.serve.get_concrete_function(), +} +options = tf.saved_model.SaveOptions(function_aliases={ + 'my_func': model.func, +}) +tf.saved_model.save(model, export_dir, signatures, options) +``` + +And leverage some model conversion tool to convert their CPU model for TPU +serving: + +```python +MyModelConversionTool(input_saved_model, output_saved_model, function_alias='my_func') +``` + +## Alternative Design Considered + +### Caveat + +`@tf.tpu.function` should only be used for serving. It should never appear in +training code. + +### User Facing API + +For General TF2 Users + +Under the proposed design, users will need to do the following things to export +a TPU SavedModel in TF2.x: + +1. Replace @tf.function with @tf.tpu.function for functions they wish to run on + TPU; + + ```python + # `model` can be any Python Callable. E.g. A Keras Model. + @tf.tpu.function + def predict_step(image_tensors): + return model(image_tensors) + ``` + +2. Create main serving function and call the tpu function above. The main + function might have additional TF ops which can’t run on TPU (e.g. + `tf.decode_image`: + + ```python + @tf.function + def serve(images): + image_tensors = tf.decode_image(images) + return predict_step(image_tensors) + ``` + + And then create a signature: + + ```python + signatures = { + 'serving_default': + serve.get_concrete_function(...), + } + tags = [tag_constants.SERVING, tag_constants.TPU] + ``` + +3. Pass the both signatures to `tf.saved_model.save()`: + + ```python + tf.saved_model.save( + model, + export_dir='...', + signatures=signatures, + tags=tags) + ``` + +The resulting TPU inference graph looks like this: + +![Resulting TPU Graph](20191106-tf2-tpu-savedmodel/tpu_result.png) +
Resulting TPU Graph.
+ +For Advanced Users who need customized Ops + +In such cases, we provide the flexibility for users to tweak `@tf.tpu.function`. + +1. If users wish not to use TPUPartitionedCall, they can disable using + TPUPartitionedCall: + + ```python + @tf.tpu.function(use_tpu_partitioned_call=False) + def predict_step(images): + ... + ``` + +2. Users can also nest TPU functions within BatchFunction: + + ```python + @tf.tpu.function(use_batch_function=True, + # Below arguments for BatchFunction + # are optional + max_batch_size=..., + allowed_batch_sizes=... + ...) + def predict_step(images): + ... + ``` + +3. User can also customize their TPUPartitionedCallOp: + + ```python + @tf.tpu.function(use_tpu_partitioned_call=True, + device_ordinal=0) + def predict_step(images): + ... + ``` + +For Keras Users + +Option 1: + +Introduce argument `export_to_tpu`. For Keras users, they will only need to pass +`export_to_tpu=True` to save to TPU SavedModel. (Currently, we require the graph +defined by `model` to be completely TPU-compatible.) + +```python +tf.keras.models.save_model( + model, + filepath='...', + export_to_tpu=True) +``` + +Option 2: + +Keep tf.keras.models.save_model() unchanged. Users use a keras model as if they +were using a TF2 Function. + +```python +# isinstance(model, (tf.keras.Model, tf.keras.layers.Layer)) == True +@tf.tpu.function +def predict_step(image_tensors): + return model(image_tensors) +``` + +### Changes to TF2.x API + +1. `tf.saved_model.save()` will take an optional argument `tags`. + + `tags` is an optional argument which represents a list of tags. This allows + users to specify customized tags. For example, Servomatic or model server + requires both ‘tpu’ and ‘serve’ tags to load TPU saved model. + +2. Implement an additional `@tf.tpu.function` decorator in + `tensorflow/python/tpu/tpu.py`. This decorator handles TPU rewriting under + the hood. + + `tf.tpu.function()` takes the following optional arguments: + + - `func`: A Python function. If not set, will return a wrapper that takes + a Python function. This allows @tf.tpu.function to be called w/ or w/o + arguments; + - `use_tpu_partitioned_call`: boolean. Controls whether TPUPartitionedCall + will be used; + - `device_ordinal`: Used in conjunction with `use_tpu_partitioned_call`. A + tensor or a TF Function object that returns a tensor, designating the + device ordinal. Default to tpu_ordinal_selector(); + - `use_batch_function`: boolean. Controls whether BatchFunction will be + used; + - `num_batch_threads`, `max_batch_size`, `batch_timeout_micros`, + `allowed_batch_sizes`, `max_enqueued_batches`: arguments used to + configure BatchFunction. + - `preserve_cpu_fn`: boolean. With this set to true, users avoid having to + copy-paste the same block of code for CPU inference. + +### Changes to Keras API + +Option 1 + +If Keras users would like `tf.keras.models.save_model()` to work directly for +exporting TPU SavedModel, without having knowledge of tf.function / tags / +signatures. The only way to achieve this is to hide those logics under +`tf.keras.models.save_model()`. + +After the change, `tf.keras.models.save_model()` will have two additional +arguments: + +1. `export_to_tpu`: Simply setting this to `True` will export TPU model; +2. `tags`: Optionally for advanced users, if they want to have more control of + what tags they are using, they can use this argument as if they are using + TF2.x saving API. + +Option 2 + +No change. Users can save a keras model for TPU inference with +tf.saved_model.save(). + +## Detailed Design + +### TF2.x API + +Under the hood, exporter API is doing the following things: + ++ The @tf.tpu.function wraps user-specified function; ++ Tag the MetaGraph with user-defined tags. + +Step 1: Use a new decorator to wrap TPU version of the user-specified TPU +function. It calls tpu.rewrite inside the original function to generate a TPU +version of graph. By default, this will create a tpu function. If users wish to +preserve both CPU and TPU function, they can set ‘preserve_cpu_fn=True’. +Optionally, they can use `use_tpu_partitioned_call` and `use_batch_function` to +customize the Function object they get. + +```python +# tensorflow/python/tpu/tpu.py + +def _tpu_partitioned_call_wrapper(tf_func, device_ordinal): + ... + +def _batch_function_wrapper(tf_func, + num_batch_threads, + max_batch_size, + batch_timeout_micros, + allowed_batch_sizes, + max_enqueued_batches): + ... + +def _rewrite_func_wrapper(func): + ... + +@tf_export("tpu.function") +def tpu_function(func=None, *args, **kwargs): + ... + tpu_func = _rewrite_func_wrapper(func) + ... + if use_tpu_partitioned_call: + tpu_fn = _tpu_partitioned_call_wrapper(tpu_fn, device_ordinal) + ... + if use_batch_function: + tpu_fn = _batch_function_wrapper(tpu_fn, + num_batch_threads, + max_batch_size, + batch_timeout_micros, + allowed_batch_sizes, + max_enqueued_batches) + ... +``` + +Step 2: Create a MetaGraph with designated tags for the SavedModel. + +```python +# tensorflow/python/saved_model/save.py + +saved_model = saved_model_pb2.SavedModel() +... +meta_graph_def = saved_model.meta_graphs.add() +asset_info, exported_graph = _fill_meta_graph_def( + meta_graph_def, saveable_view, signatures, + options.namespace_whitelist, + tags=list(tags)) +... +``` + +### Support for Keras saving API (Under option 1 for Keras) + +Adding an argument `export_to_tpu` for `tf.keras.models.save_model()`, which if +set to true will rewrite the model for TPU inference. + +Adding an argument `tags` for `tf.keras.models.save_model()` which has the same +semantics as that in `tf.saved_model.save()`. + +```python +# tensorflow/python/keras/saving/save.py + +@keras_export('keras.models.save_model') +def save_model(model, + filepath, + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + tags=None, + export_to_tpu=False, + options=None): + ... + if (export_to_tpu and + (not tags + or tag_constants.TPU not in tags)): + checkpoint_graph_view = save_lib._AugmentedGraphView(model) + signatures = find_function_to_export_tpu(checkpoint_graph_view) + tags = [tag_constants.SERVING, tag_constants.TPU] + + saved_model_save.save(model, filepath, overwrite, + include_optimizer, + signatures, + tags, + options) +``` diff --git a/rfcs/20191106-tf2-tpu-savedmodel/cpu_graph.png b/rfcs/20191106-tf2-tpu-savedmodel/cpu_graph.png new file mode 100644 index 000000000..9014e1113 Binary files /dev/null and b/rfcs/20191106-tf2-tpu-savedmodel/cpu_graph.png differ diff --git a/rfcs/20191106-tf2-tpu-savedmodel/customized_embeddings.png b/rfcs/20191106-tf2-tpu-savedmodel/customized_embeddings.png new file mode 100644 index 000000000..3d9265c75 Binary files /dev/null and b/rfcs/20191106-tf2-tpu-savedmodel/customized_embeddings.png differ diff --git a/rfcs/20191106-tf2-tpu-savedmodel/tpu_graph.png b/rfcs/20191106-tf2-tpu-savedmodel/tpu_graph.png new file mode 100644 index 000000000..5928ad730 Binary files /dev/null and b/rfcs/20191106-tf2-tpu-savedmodel/tpu_graph.png differ diff --git a/rfcs/20191106-tf2-tpu-savedmodel/tpu_result.png b/rfcs/20191106-tf2-tpu-savedmodel/tpu_result.png new file mode 100644 index 000000000..9b6b41baf Binary files /dev/null and b/rfcs/20191106-tf2-tpu-savedmodel/tpu_result.png differ diff --git a/rfcs/20191127-pip-structure.md b/rfcs/20191127-pip-structure.md new file mode 100644 index 000000000..ac0a4f84f --- /dev/null +++ b/rfcs/20191127-pip-structure.md @@ -0,0 +1,204 @@ +# Improved pip package structure + +| Status | [Implemented](https://github.com/tensorflow/tensorflow/commit/5c00e793c61860bbf26778cd4704313e867645be) | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [182](https://github.com/tensorflow/community/pull/182)| +| **Author(s)** | Anna Revinskaya (annarev@google.com) | +| **Sponsor** | Alex Passos (apassos@tensorflow.org) | +| **Updated** | 2020-02-04 | + +## Objective + +We propose to simplify TensorFlow pip package structure to enable IDE features such as autocomplete, jump-to-definition and quick-documentation. + +## Motivation + +### Current package structure +TensorFlow package structure has grown quite complex over time as we started to support multiple versions (1.x and 2.x) and import external sub-packages (such as tensorflow\_estimator and tensorboard). This complexity is expected to grow if we split out more components into separate pip packages. + +Sources of complexity: + +* Versioning: tensorflow\_core API lives under *_api/v1* or *_api/v2* directory depending on the version. +* Virtual pip package: Installing TensorFlow actually installs 2 directories: *tensorflow/* and *tensorflow\_core/* under *site-packages/*. TensorFlow code lives under *tensorflow\_core/*. TensorFlow uses lazy loading to import everything from *tensorflow\_core/* to *tensorflow/*. Two-directory structure helps work-around circular imports caused by tensorflow\_estimator. + +Outline of the current structure: +``` +tensorflow + __init__.py (contains "from tensorflow_core import *") + +tensorflow_core + python/... + lite/... + _api/v2 + __init__.py + audio/__init__.py + autograph/__init__.py + ... +``` + +### Rationale behind current package structure +#### Multiple version support +To prepare for TensorFlow 2.0 launch, we added a way to build two versions: 1.x and 2.x. Each version has its own respective genrule that outputs file for 1.x or 2.x since API modules are different (for e.g. *tensorflow/manip/\_\_init\_\_.py* only exists in 1.x and not 2.x API). Now, bazel does not allow two genrules to output files to the same directory. Therefore, we have *_api/v1/* and *_api/v2/* subdirectories. + +Note that we could still place the API directly under *tensorflow/* in the pip package since a pip package contains a single version of TensorFlow. This option became out of reach when *tensorflow/contrib/lite/* was migrated to *tensorflow/lite/*. Now *tensorflow/lite/* API directory would conflict with *tensorflow/lite/* source directory if the API was under *tensorflow/* instead of *_api/vN/*. + +#### Circular dependencies +Estimator depends on TensorFlow. At the same time, TensorFlow includes estimator as a part of its API. This creates a cycle. + +![alt_text](https://github.com/annarev/community/blob/pip_structure_rfc/rfcs/20191127-pip-structure/circular_dependency.png "Circular dependency +between TensorFlow and Estimator.") + +#### Metapackage vs base package plans +[Modular TensorFlow +RFC](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md) proposes to keep two pip packages: +tensorflow-base would only contain core TensorFlow (for e.g. no estimator). +TensorFlow Metapackage would be a thin package defining composition of TensorFlow which includes base, estimator, keras and tensorboard. +Note that this 2-package approach is not implemented yet. However, its proposal demonstrates how keeping a virtual pip package could be beneficial in the future. + +![alt_text](https://github.com/annarev/community/blob/pip_structure_rfc/rfcs/20191127-pip-structure/modular_structure.png "Proposed modular TensorFlow structure.") + +Current structure looks more like this (except *tensorflow/* and *tensorflow\_core/* are directories as opposed to separate pip packages) and meant to be the first step towards structure above: + +![alt_text](https://github.com/annarev/community/blob/pip_structure_rfc/rfcs/20191127-pip-structure/current_structure.png "Current TensorFlow structure.") + +### Current state of IDE code features + +#### PyCharm 2019.1.1 + +* Autocomplete: + * Works in most cases after switching to use relative imports. + * Doesn’t work for tf.compat.v1.keras and tf.compat.v2.keras. + * Doesn’t work for keras if importing it using from import (i.e. `from tensorflow import keras`). +* Jump-to-definition doesn’t work. +* Quick documentation doesn’t work. + +#### PyCharms with 2019.3 EAP build 193.3793.14 +Latest version of PyCharms added [custom handling for tensorflow](https://github.com/JetBrains/intellij-community/blob/0a08f8212351ee84d602cdc5547f038ce0df79fd/python/src/com/jetbrains/tensorFlow/PyTensorFlow.kt) +* Autocomplete works in most cases. +* Doesn’t work for keras if importing it using from import (i.e. `from tensorflow import keras`). +* Jump-to-definition works. +* Quick documentation works. + +#### VS Code 1.40 (October 2019 release) +* Autocomplete: + * Works in most cases. + * Doesn’t work for `tf.estimator` or `tf.keras`. + * Doesn’t work for `tf.compat.v1.keras` and `tf.compat.v2.keras`. + * Doesn’t work for keras if importing it using from import (i.e. `from tensorflow import keras`). +* Jump-to-definition doesn’t work. +* Quick documentation doesn’t work. + + +## User Benefit + +TensorFlow package structure creates difficulties for those who use IDEs. +Autocomplete, quick documentation and jump-to-definition features often rely on +module structure matching directory structure. For example, TensorFlow code uses +`from tensorflow.foo` imports but lives under tensorflow\_core package. Simplifying +package structure would improve productivity for TensorFlow users. + +## Design Proposal + +The best way I can think of to fix the autocomplete issues is to make our package structure as clean as possible. In this case, autocomplete will work out of the box. + +### Short term: Remove virtual pip package + +Primary purpose of keeping the virtual pip package is to workaround circular +estimator imports. Alternatively, we can resolve this issue by lazy loading +estimator. + +Estimator import in root *\_\_init\_\_.py* file: +```python +from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader +estimator = _LazyLoader( + "estimator", globals(), + "tensorflow_estimator.python.estimator.api._v2.estimator") +setattr(_current_module, "estimator", estimator) +``` + +Lazy loading by itself would mean that we no longer have autocomplete for estimator. As a workaround, we can import estimator without lazy loading if `typing.TYPE_CHECKING` is `True`. + +After building a pip package with this change all of the following work in PyCharms (both released and EAP) and VS Code: + +* jump-to-definition +* quick documentation +* autocomplete for `compat.v1.keras`, `compat.v2.keras` +* autocomplete for keras when using from tensorflow import keras +* ...basically any import I tested works with autocompletion + +To support the TensorFlow Metapackage plans we could add a new pip package that specifies dependencies on tensorflow, tensorflow\_estimator, tensorboard, etc.. Its sole purpose would be to get all dependencies installed. + +![alt_text](https://github.com/annarev/community/blob/pip_structure_rfc/rfcs/20191127-pip-structure/new_modular_structure.png "New proposed modular TensorFlow structure.") + +### Long term (optional): Import from external package directly +Short term would fix IDE issues, but the package structure is still not as clean as it could be. We resolve cycles with lazy loading but it would be even better not to have this circular structure at all. + +Therefore, I propose that we don’t import external packages into TensorFlow 3.0. Users who want to use estimator, tensorboard summaries or keras could import them separately: + +Current code that looks like: +```python +import tensorflow as tf + +tf.estimator +tf.keras +tf.summary +``` + +Would be changed to: +```python +import tensorflow as tf +import tensorflow_estimator as estimator +import keras +from tenosorboard import summaries +``` + +Rationale for this change: + +* One way dependencies (estimator depends on tensorflow and not vise-versa). +* Minimal overhead for users. Adding an extra import is easy. + +Note that this change cannot be done in TensorFlow 2.x due to API guarantees. Also, accessing these packages from `tf.` would match familiar workflows. Therefore, we can keep `tf.estimator`, `tf.keras` (once it is moved out of TensorFlow), `tf.summary` available as an alternative to importing pip package directly. This would require some work to make sure these packages contain the right API (for e.g. tensorflow\_estimator.estimator currently always contains V1 API). + + +### Alternatives Considered +Alternatively, we could solve IDE autocomplete issues by changing all imports in +TensorFlow to import from `tensorflow_core` instead of `tensorflow`. + +#### Advantages: + +* Keep supporting external libraries included as a sub-namespace, for e.g. +`tf.estimator`. + +#### Disadvantages: + +* This is a more invasive change since it requires updating every Python file in TensorFlow. +It would also mean that external packages such as `tensorflow_estimator` need to +use imports of the form `from tensorflow_core` instead of `from tensorflow`. + +The main proposal in this document seems simpler to me (it removes complexity +instead of adding it) and therefore preferred. + +### Performance Implications +I am not expecting major performance changes since this is just a package +structure proposal. + +### Dependencies +This proposal does not add new dependencies. The rest of the proposal largely +describes how we plan to handle dependencies. + +### Engineering Impact +We don't expect changes to binary size / startup time / build time / test time. + +### Platforms and Environments +This should work on all platforms and we will test it to make sure. + +### Best Practices, Tutorials and Examples +There are no user-visible changes other than fixes to enable IDE features. + +### Compatibility +Short term proposal does not have any compatibility concerns. Long term, +however, proposes to remove `tf.estimator`, etc.. which is not a backwards +compatible change. We can only make this change at the next major release. + +### User Impact +There are no user-visible changes other than fixes to enable IDE features. diff --git a/rfcs/20191127-pip-structure/circular_dependency.png b/rfcs/20191127-pip-structure/circular_dependency.png new file mode 100644 index 000000000..fd5f571fb Binary files /dev/null and b/rfcs/20191127-pip-structure/circular_dependency.png differ diff --git a/rfcs/20191127-pip-structure/current_structure.png b/rfcs/20191127-pip-structure/current_structure.png new file mode 100644 index 000000000..9e3fd3cbd Binary files /dev/null and b/rfcs/20191127-pip-structure/current_structure.png differ diff --git a/rfcs/20191127-pip-structure/modular_structure.png b/rfcs/20191127-pip-structure/modular_structure.png new file mode 100644 index 000000000..fc590ae86 Binary files /dev/null and b/rfcs/20191127-pip-structure/modular_structure.png differ diff --git a/rfcs/20191127-pip-structure/new_modular_structure.png b/rfcs/20191127-pip-structure/new_modular_structure.png new file mode 100644 index 000000000..4f716ad3e Binary files /dev/null and b/rfcs/20191127-pip-structure/new_modular_structure.png differ diff --git a/rfcs/20191203-single-eager-graph-path.md b/rfcs/20191203-single-eager-graph-path.md new file mode 100644 index 000000000..f8dce0512 --- /dev/null +++ b/rfcs/20191203-single-eager-graph-path.md @@ -0,0 +1,329 @@ +# Single python code path for eager and graph + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [184](https://github.com/tensorflow/community/pull/184) | +| **Author** | Saurabh Saxena (srbs@google.com) | +| **Sponsors** | Alex Passos, Gaurav Jain | +| **Updated** | 2019-12-03 | + + +## Objective + +This proposal discusses merging the graph building and eager op-dispatch code-paths in python and moving the FuncGraph capturing logic and gradient tape bookkeeping into C++. + +## Motivation + +### Graph building performance + +Graph-building time performance has been a key bottleneck in enabling implementation of large models in TF2. + +* Capturing external tensors: In analysis of graph-building time for [BERT](https://github.com/tensorflow/models/tree/master/official/nlp/bert) we found that ~20% time of building the body graph of a tf.while_loop is spent in `FuncGraph.capture`. We also extensively perform capturing when building gradients of functional ops since the backward function requires access to intermediate tensors of the forward function. This includes 2 parts, both of which we could potentially perform in C++. + * [Creating](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1118) the placeholders. These can be many (154630 in BERT’s while loop). Building these in python means we incur the python op building overheads, Python->C SWIG costs and maintain the captures mapping in python. + * [Copying](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1120) the handle data (for resources and variants). Handle data contains information about the shape and type of the _contained_ entity of a `DT_RESOURCE`/`DT_VARIANT` type. Copying handle data requires a python-c round-trip since the handle data is contained in either `EagerTensor._handle_data` (for EagerTensors) or `InferenceContext.output_handle_shapes_and_types` (for Graph tensors). +* Automatic control deps: We add control dependencies to a `tf.function`’s nodes as a post-processing step to make sure that any side-effects occur in program order. This can easily be done in C/C++. +* Gradient Tape: The tape needs to keep track of the forward ops to build gradients (or actually compute the gradient in the case of forward-mode diff). This is currently triggered in gen_xyz_ops.py. We can move this to C++ as well. + + +### Cross-language support + +There have been various [requests](https://github.com/tensorflow/tensorflow/issues/28195) for providing APIs for building `tf.function` and v2 control flow in non-python frontends. Moving capturing logic to the C/C++ layer is the first step towards enabling this. The full details for this will be fleshed out in follow-up proposals, however, we do analyse how this proposal addresses use-cases of `FuncGraph` later in this doc. + + +### Shape Inference + +C++ shape inference in FuncGraphs fails if a shape tensor relies on the constant value of a captured placeholder because we do not have information about graph nesting available there. We currently work around this c6c1f2ff3bc979f420d8fffa2b6e02268f711bf6 by explicitly calling [maybe_set_static_shape](https://github.com/tensorflow/tensorflow/blob/15715cb2c8e877c18f8d969cc51a37ff26e8397b/tensorflow/python/ops/random_ops.py#L78) in Python because we have the graph hierarchy available there. One alternative @allenlavoie suggested was to replace the placeholders with their constant value tensors if possible, guarded by a size threshold but it was unclear what this threshold should be. Having information about the nested graphs and captures etc during shape inference could help avoid this problem. + + +### Consistent execution environments + +(Contributed by @allenlavoie) We currently rely on Python exporting SavedModels which are compatible with Session-based execution, where the Session owns variable memory and it is retrieved by executing variable nodes with fixed names. TensorFlow Serving for example still uses Sessions. This compatibility mode is quite different than the 2.x Python eager execution memory model where the language bindings associate memory with variable objects, and is likely going to be a source of confusion and bugs. This effort lays necessary groundwork for implementing FuncGraph in C/C++ and hence brings us closer to executing SavedModels the same way during serving (from C++) that we execute them during development (TF 2.x Python). + +References: + +1. TODO(saxenasaurabh): Link to FuncGraph CUJs doc. + +## Design Proposal + +Basically we want to get rid of the graph-building part in gen_*_ops.py and get rid of gradient tape bookkeeping in both graph and eager modes. For example: + + +```diff +def batch_matrix_band_part(input, num_lower, num_upper, name=None): + _ctx = _context._context or _context.context() + tld = _ctx._thread_local_data +- if tld.is_eager: + try: + _result = _pywrap_tensorflow.TFE_Py_FastPathExecute( + _ctx._context_handle, tld.device_name, "BatchMatrixBandPart", name, + tld.op_callbacks, input, num_lower, num_upper) + return _result + except _core._FallbackException: + try: + return batch_matrix_band_part_eager_fallback( + input, num_lower, num_upper, name=name, ctx=_ctx) + except _core._SymbolicException: + pass # Add nodes to the TensorFlow graph. + except _core._NotOkStatusException as e: + _ops.raise_from_not_ok_status(e, name) +- # Add nodes to the TensorFlow graph. +- _, _, _op, _outputs = _op_def_library._apply_op_helper( +- "BatchMatrixBandPart", input=input, num_lower=num_lower, +- num_upper=num_upper, name=name) +- _result = _outputs[:] +- if _execute.must_record_gradient(): +- _attrs = ("T", _op._get_attr_type("T")) +- _inputs_flat = _op.inputs +- _execute.record_gradient( +- "BatchMatrixBandPart", _inputs_flat, _attrs, _result) +- _result, = _result +- return _result + +def batch_matrix_band_part_eager_fallback(input, num_lower, num_upper, name, ctx): + _attr_T, (input,) = _execute.args_to_matching_eager([input], ctx) + num_lower = _ops.convert_to_tensor(num_lower, _dtypes.int64) + num_upper = _ops.convert_to_tensor(num_upper, _dtypes.int64) + _inputs_flat = [input, num_lower, num_upper] + _attrs = ("T", _attr_T) + _result = _execute.execute(b"BatchMatrixBandPart", 1, inputs=_inputs_flat, + attrs=_attrs, ctx=ctx, name=name) +- if _execute.must_record_gradient(): +- _execute.record_gradient( +- "BatchMatrixBandPart", _inputs_flat, _attrs, _result) + _result, = _result + return _result +``` + + +1. Graph building will implicitly happen in `TFE_Py_Execute` which is called from `xyz_eager_fallback`. +1. `TF_EagerContext` makes the call to `Tape.RecordGradient` so we no longer need to call it from Python. +1. The Graph stack will be maintained in `TF_EagerContext` (see below) which includes the graph hierarchy and captures made from each graph. + + +## Detailed Design + + +### API +A high-level overview of the anticipated API. + +**C API** + + +``` +// TODO: control dependencies, auto control dependencies, callbacks + +// A TF_EagerContext knows whether we're in eager mode or in graph mode, keeps +// track of gradient tapes, etc. +typedef struct TF_EagerContext TF_EagerContext; + +TF_EagerContext* TF_NewEagerContext(TFE_Context* ctx); +void TF_DeleteEagerContext(TF_EagerContext* c); + +// The context is executing eagerly if there are no graphs in the stack. We +// check when popping a graph from the stack that it is indeed the one we +// expected to avoid bugs. +int TF_EagerContextIsExecutingEagerly(TF_EagerContext* c); +void TF_EagerContextEnterGraph(TF_EagerContext* c, TF_Graph* g); +void TF_EagerContextExitGraph(TF_EagerContext* c, TF_Graph* g, TF_Status* s); +// Cleans up captures and other graph metadata in the eager context. +void TF_EagerContextDeleteGraph(TF_EagerContext* c, TF_Graph* g, TF_Status* s); + +// A TF_TensorHandle is a union type of TFE_TensorHandle (eager tensor) and +// TF_Output (graph tensor). +typedef struct TF_TensorHandle TF_TensorHandle; + +// Note: takes ownership of t. +TF_TensorHandle* TF_TensorHandleFromTensor(TFE_TensorHandle* t); +void TF_TensorHandleDecRef(TF_TensorHandle* t); +void TF_TensorHandleIncRef(TF_TensorHandle* t); +TFE_TensorHandle* TF_TensorHandleToTensor(TF_TensorHandle* t, TF_Status* s); +TF_Output TF_TensorHandleToGraphTensor(TF_TensorHandle* t, TF_Status* s); +int TF_TensorHandleHasValue(TF_TensorHandle* t); +TF_DataType TF_TensorHandleDataType(TF_TensorHandle* t); + +// When in graph mode accessing a tensor from outside the graph will trigger +// capturing logic similar to what we have in FuncGraph. These methods let you +// inspect the capturing metadata before popping the graph from the graph stack. +int TF_EagerContextNumCaptures(TF_EagerContext* c, TF_Graph* g, TF_Status* s); +void TF_EagerContextCapturedValues(TF_EagerContext* c, TF_Graph* g, + TF_TensorHandle** captures, TF_Status* s); +void TF_EagerContextCapturedPlaceholders(TF_EagerContext* c, TF_Graph* g, + TF_Output* captures, + TF_Status* s); + +// Allows specifying a custom capturing function. To be use to implement +// custom capturing logic for tf.while_loop. `captured` must be in the current +// context graph. +typedef void(*CaptureCallback)(TF_EagerContext* c, + TF_Graph* source_graph, + TF_TensorHandle* source, + TF_TensorHandle** captured, + TF_Status* s); +void TF_EagerContextPushCaptureCallback(TF_EagerContext* c, + CaptureCallback* callback, + TF_Graph* graph, TF_Status* s); +void TF_EagerContextPopCaptureCallback(TF_EagerContext* c, + TF_Graph* graph, TF_Status* s); + +// Needed for updating the captured tensor in tf.function, tf.cond grad func, VariableTensor. +void TF_EagerContextUpdateCaptureForPlaceholder(TF_EagerContext* c, TF_Graph* g, + TF_TensorHandle* placeholder, + TF_TensorHandle* new_capture, + TF_Status* s); + +// TF_OutputList just lets us not specify the number of outputs of an operation +// beforehand. This forces a memory allocation in the runtime, which is bad, but +// it allows for generic code. +typedef struct TF_OutputList TF_OutputList; +TF_OutputList* TF_NewOutputList(); +void TF_DeleteOutputList(TF_OutputList* o); +int TF_OutputListNumOutputs(TF_OutputList* o); +TF_TensorHandle* TF_OutputListOutput(TF_OutputList* o, int i); + +// A TF_AbstractOp is the metadata we need to execute an operation in either +// eager or graph mode. +typedef struct TF_AbstractOp TF_AbstractOp; +TF_AbstractOp* TF_NewAbstractOp(TF_EagerContext* c, const char* const op_type, + const char* const op_name, TF_Status* s); +void TF_DeleteAbstractOp(TF_AbstractOp* op); + +// TODO: we need a way of specifying attrs + +// TF_ExecuteOperation will, if in eager mode, execute, if in graph mode, maybe +// capture some inputs and then add a node in the graph, and after +// execution/node creation it'll go and record things that happened in any tape +// which happens to be active. +void TF_ExecuteOperation(TF_AbstractOp* op, int num_inputs, + TF_TensorHandle* const * inputs, TF_Status* s, + TF_OutputList* o); + +// TF_Tape is just a specialization of tensorflow::eager::Tape on +// TF_TensorHandle values and gradients. +typedef struct TF_Tape TF_Tape; +TF_Tape* TF_NewTape(TF_EagerContext* c, int persistent); +void TF_DeleteTape(TF_Tape* t); + +void TF_ContextPushTape(TF_EagerContext* ctx, TF_Tape* tape); +void TF_ContextPopTape(TF_EagerContext* ctx, TF_Tape* tape, TF_Status* s); +void TF_TapeWatch(TF_EagerContext* ctx, TF_TensorHandle* t); +void TF_TapeGradient(TF_Tape* t, int num_sources, TF_TensorHandle** sources, + int num_targets, TF_TensorHandle** targets, + TF_OutputList* gradients, TF_Status* s); + +// A GradientFunction is what we execute at runtime when computing a gradient; +// it takes some closure-captured values from the forward pass and the output +// gradients of the op and produces the input gradients of the op. +typedef void (*GradientFunction)(int num_output_gradients, + TF_TensorHandle* const * output_gradients, + TF_TensorHandle** input_gradients, + TF_Status* s, void* closure); +typedef void (*GradientFunctionDeleter)(GradientFunction function, + void* closure); + +// A GradientFunctionRegisterer is the code that will run during the forward +// pass to find out which gradient function should be pushed into the tape. It +// has access to all inputs and outputs of an operation and gets to choose which +// ones to pack into the closure which will be available to the GradientFunction +// at runtime. +typedef void (*GradientFunctionRegisterer)( + TF_EagerContext* c, int num_inputs, TF_TensorHandle* const* inputs, + TF_OutputList* outputs, GradientFunction* gradient_function, + GradientFunctionDeleter* gradient_function_deleter, + void* registerer_closure, void** gradient_function_closure); + +void TF_TapeCustomGradient(TF_EagerContext* ctx, + int num_inputs, + TF_TensorHandle** inputs, + int num_outputs, + TF_TensorHandle** outputs, + GradientFunctionRegisterer registerer, + void* registerer_closure); + +// Registers a gradient function to run given an op name. +void TF_ContextRegisterGradientFunction(TF_EagerContext* ctx, + const char* op_name, + GradientFunctionRegisterer registerer, + void* registerer_closure); +``` + + +**Python API** + + +``` +class EagerContextManager(object): + def __init__(self, c_graph): + self._c_graph = c_graph + def __enter__(self): + c_api.TF_EagerContextEnterGraph(ctx, self._c_graph) + def __exit__(self): + c_api.TF_EagerContextExitGraph(ctx, self._c_graph) + +class _FuncGraphBase(object): + def __init__(): + self._c_graph = c_api.TF_NewGraph() + @contextmanager + def as_default(): + # Note: This means that the graph hierarchy is no longer maintained in python. + return EagerContextManager(self._c_graph) +``` + + +We will implement a new subclass for `FuncGraph` that will replace `Graph`. We will try to keep as much of the logic as possible in C++ and expose that using pybind or somesuch. Here’s a discussion of some of the features that `FuncGraph` inherits from `Graph` which we will need to support. This list may not be exhaustive and we are hoping to add support for other things as needed. + + + +1. `apply_op_helper` + `create_op_internal` contain a lot of op _preparation_ logic which will need to be moved to C++. For example: + 1. [Uniquifying op names](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/framework/ops.py#L3297). + 1. [Checking](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/framework/op_def_library.py#L319-L327) deprecated op versions. [Graph version](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/framework/ops.py#L2946) is already maintained in C++ so this should be fine. + 1. [Type checking](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/framework/op_def_library.py#L53). + 1. There is a lot of logic for building attrs in python. For this we could possibly re-use the existing implementation in the pywrap C++ eager layer ([1](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/eager/pywrap_tfe_src.cc#L755), [2](https://github.com/tensorflow/tensorflow/blob/41228d7f14496ff661e7c22361a987b0255cf812/tensorflow/python/eager/pywrap_tfe_src.cc#L850)) + 1. apply_op_helper calls [convert_to_tensor](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/op_def_library.py#L421) to convert python scalars to Tensors. This will happen in python for now and may move to a python specific C++ layer in the future. +1. We need some form of context management to handle a variety of context managers we have in Graph e.g. [control dependencies](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L4345), control flow contexts (for XlaControlFlowContext), [colocate_with](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/framework/ops.py#L4115), [name_scope](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L3918), [_attr_scope_map](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L4587), [_kernel_label_map](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L4653) etc. We will look into whether this can be implemented using a generic op callback mechanism. The same mechanism can be used for implementing op callbacks as well. +1. We will perform a case-by-case analysis of APIs of `Graph` to decide which of those should be supported in `_FuncGraphBase`. + 1. Certain APIs related to [feeding and fetching](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L4788-L4805) probably don’t make sense for FuncGraph. + 1. APIs for fetching Operations and Tensors: These APIs rely on a [dict of Operations](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L2721) maintained in Graph. Currently this dict is built _actively_ as operations are created in the graph. We could choose to populate this cache lazily as well. + 1. In each Graph we maintain a [dict](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/framework/ops.py#L2757) of EagerDefinedFunction/DefinedFunction used in the graph directly or in a sub-function. In nested functions we probably spend quadratic time in [copying](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/eager/function.py#L488-L497) the inner functions all the way to the eager context and use quadratic (in the number of functions) memory. Storing `_EagerDefinedFunction` references in the global graph has been a common source of memory leaks which @kkimdev has been valiantly fighting with. I think we should try to register functions directly in the global eager context. We can just keep weakrefs to the _EagerDefinedFunction so that we don’t interfere with memory management. @kkimdev pointed out that we still need to maintain some reference to the list of functions used inside a ConcreteFunction so that we can add those to the [SavedModel](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/saved_model/save.py#L593). + +Some implementation notes: + +1. Need to add RefCounting for eager tensor handles. + 1. If a graph captures an EagerTensor, the code creating the EagerTensor should not delete it. + 1. How do you write the gradient function of add, which just wants to forward the output gradient to the two inputs + + +### Analysing some FuncGraph CUJs + + +**tf.function** + +When building the gradient (Stateful)PartitionedCall op, a captured tensor in the forward graph needs to be resolved to a forward call op’s output. This will still be possible to do in python. + +**tf.cond/tf.switch_case** + +Similar to tf.function, during gradient computation forward graph intermediates need to be mapped to forward op’s outputs. This currently updates the FuncGraph.captures map which can be done using `TF_EagerContextUpdateCaptureForPlaceholder`. Note however that tf.function does not actually update FuncGraph.captures and simply uses the new captures for building the gradient op. We may be able to avoid calling the API to update captures here if we do the same. Not sure if there any behavior relying on this though. Higher order derivatives using tf.gradients maybe? + +**tf.while_loop** + +tf.while_loop intercepts the default capturing mechanism of FuncGraph with custom behavior. In tf.while_loop, when a forward pass tensor needs to be captured we have to add an [accumulator](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/ops/while_v2.py#L1012) and then capture the output of the While op corresponding to that accumulator. + +To support this we will provide a `TF_EagerContext{Push|Pop}CaptureCallback` API which will register a callback function to perform the logic in [_WhileBodyGradFuncGraph._capture_helper](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/ops/while_v2.py#L933). + +We could leverage this to unify the gradient graph captures resolving behavior of `tf.function`/`tf.cond`/`tf.while_loop` all of which have their own recipes right now. + +### Automatic Control Dependencies + +Automatic control dependencies (ACD) will move to C++ as well. However instead of being post-hoc it will now be performed _during_ graph building. The current design has certain limitations e.g. control dependencies across function boundaries are performed at the function level which is prohibitive for performance. There are ongoing discussions on ways to improve this. Other issues have come up in `tf.data` and `tf.distribute` for example because ACD only tracks direct dependencies. Ideally we should use this opportunity to address these shortcomings. However the details of this redesign are left to another doc to avoid diluting this doc. + +### Open questions + +1. Keras seems to be using [non-public APIs](https://github.com/tensorflow/tensorflow/blob/6d7926bb87c1a91ffd110aa3407c003b2ae54009/tensorflow/python/keras/engine/base_layer.py#L2511) for directly building NodeDef and adding that to the graph. This is necessary for supporting Keras's Functional API (Model.add_loss, Model.add_metric, and auto-Lambda layers). We need to figure out if/how to support that. There are ongoing efforts to use just the public API of TF in tf.keras but the timelines for that are unclear. + 1. In the design review it was concluded that we should either be able to change Keras to use public python APIs or replace the internal python API calls with C API calls. + + +## Appendix + +**Definition: Capturing** + +Capturing is the process used to allow users to write functions which can reference tensors that are not directly passed as function inputs or are not passed as loop_vars in a call to tf.while_loop. In FuncGraph, when an external tensor is captured we create a [placeholder](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/framework/func_graph.py#L649) just like any other input and add that placeholder to the list of [FuncGraph.inputs](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/framework/func_graph.py#L672) and store the mapping from the external tensor to the placeholder in [FuncGraph._captures](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/framework/func_graph.py#L671).Capturing is triggered in `_create_op_internal` which is overridden in FuncGraph. + diff --git a/rfcs/20191203-single-eager-graph-path/20191203-func-graph-cujs.md b/rfcs/20191203-single-eager-graph-path/20191203-func-graph-cujs.md new file mode 100644 index 000000000..4586bdef9 --- /dev/null +++ b/rfcs/20191203-single-eager-graph-path/20191203-func-graph-cujs.md @@ -0,0 +1,112 @@ +# CUJs for FuncGraph + +| **Author** | Saurabh Saxena (srbs@google.com) | +:-------------- |:---------------------------------------------------- | +| **Updated** | 2019-12-03 | + + +### tf.function + + +### **Forward** + + + +1. An empty FuncGraph is created. +1. [Placeholders](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1205) are created in it corresponding to the input_signature. Note the signature can contain CompositeTensors which are flattened. The input structure is maintained in [structured_input_signature](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L906). + 1. We seem to be [always](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1237) capturing variables even though they are unused. Can that be avoided? +1. The python_func is called with the above input placeholders as args. This can trigger creation of new placeholders by capturing. The captured tensors can be symbolic tensors from outer graphs or eager tensors. +1. FuncGraph.structured_outputs is populated with the structured tensors(containing CompositeTensors, IndexedSlices etc.). FuncGraph.outputs is built by flattening the structure and CompositeTensors in structured_outputs and by removing any Nones. + 1. We call [capture](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1015) on the tensors in the list of outputs to handle the case when the function is simply returning an external tensor. Solutions: + 1. We could replace this with creating an Identity node in the forward graph which would implicitly capture the external tensor. However, these Identity nodes are not necessary and might cause performance problems later. + 1. Can we avoid doing the capturing in func_graph_from_py_func? Idea: We keep Nones in the list of structured_outputs and not in the list of outputs. We could do the same for external outputs. These can get repackaged just like we [repackage](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L1911-L1913) Nones. + +**Backward** + + + +1. An empty FuncGraph is created. +1. input_signature is [constructed](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L644) from the incoming grads and [placeholders](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/framework/func_graph.py#L1205) are created in it corresponding to the input_signature. +1. The gradient [function](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L649) is called in this FuncGraph. This triggers capturing of intermediate tensors in the forward FuncGraph or one of its outer graphs in case custom_gradients are involved. Note that we already created placeholders for incoming grads so those are not captured. When building the gradient PartitionedCall op, this external capture will be replaced with a Placeholder in the current graph if the capture is not already in the current graph. The external capture is now a capture in the current graph (graph containing the gradient PartitionedCall). There are a few cases in the resolution: + 1. The external tensor is in one of the outer graphs of the current graph. In this case nothing needs to be done. + 1. The external tensor is not in the current hierarchy. + 1. If it is in the forward graph it gets [added](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L688) to the list of outputs and the forward op is [updated](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L715) with new outputs and this tensor is [resolved](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/eager/function.py#L723-L728) to an op output. + 1. If it is in an outer graph of the forward graph, nothing needs to be done (yet). + 1. If it is in an inner graph of the forward graph, an error is raised (this should never happen). +1. FuncGraph.structured_outputs is populated with the structured tensors(containing CompositeTensors, IndexedSlices etc.). FuncGraph.outputs is built by flattening the structure and CompositeTensors in structured_outputs and by removing any Nones. + + +### tf.cond/tf.switch_case + +**Forward** + + + +1. Build graphs for branch functions. +1. Find the superset of input tensors needed by all branch functions and update signatures of all branch functions so that they [match](https://github.com/tensorflow/tensorflow/blob/f540109342f8b7cb9b96163dae455013249c3128/tensorflow/python/ops/cond_v2.py#L494) by creating dummy placeholders. This requires resetting FuncGraph.inputs and FuncGraph.captures. + 1. Supporting this would require either ResetInputs, ResetCaptures APIs or adding new If/Case ops that don’t need this signature matching (b/143286622). + 1. Another option is to not support resetting inputs and captures at all and let the consumers take care of this when generating the FunctionDef. However this would mean that the FunctionDef would not match the FuncGraph which may cause problems in [gradient computation](https://github.com/tensorflow/tensorflow/blob/f540109342f8b7cb9b96163dae455013249c3128/tensorflow/python/ops/cond_v2.py#L109) which use the forward cached FuncGraph and expects the forward op’s FunctionDef to be generated 1-1 from the forward FuncGraph. + +**Backward** + + + +1. Build the grad func for each branch using tf.gradients. +1. Similar to forward pass, add dummy inputs to make input signatures match. +1. Any needed intermediates in the forward graph are wrapped in Optionals and are added to the list of forward graph [outputs](https://github.com/tensorflow/tensorflow/blob/f540109342f8b7cb9b96163dae455013249c3128/tensorflow/python/ops/cond_v2.py#L151-L152). +1. Similar to tf.function, we resolve any external captures to the forward op’s outputs. + + +### tf.while_loop + +**Forward** + + + +1. Build the [cond](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L141) FuncGraph using a signature built from the input loop vars. Cond function can capture external tensors which show up in cond_graph.external_captures. +1. Build the [body](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L186) FuncGraph using the same signature as the cond. However in the body function [capture](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L162-L165) the external captures of cond first. At this point the full signature, i.e. original input signature with loop vars + captures, matches in cond and body. + 1. The explicit capture is needed here to make the signatures of cond and body to match. This can be avoided if we allow signatures of cond and body to diverge. +1. Now body_graph has some extra external captures. These are captured in the [cond_graph](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L206-L213). So in effect the external captures of body cond_graph and body_graph are effectively cond-graph-captures + body-graph-captures. + +**Backward** + + + +1. Build the gradient graph for the forward graph just like for other functional ops. +1. Since a while loop can run for multiple iterations, if the backwards pass needs to capture a forward tensor there are two cases: + 1. If the tensor’s value potentially varies across iterations, in the forward graph the tensor is [accumulated](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L1012) in a TensorList (think: stack). Note: now the forward op has an extra input, the empty stack, and an extra output which contains the list of values of the tensor in multiple iterations. The forward graph stack is captured in the backward graph and a value is popped from it to use as the intermediate value for that tensor. + 1. If the tensor’s value is invariant across loop iterations, we directly [capture](https://github.com/tensorflow/tensorflow/blob/c29529aa7d55bc66b040917a36acdb5722231043/tensorflow/python/ops/while_v2.py#L978) the forward tensor in the backward graph. + + +### Autograph + +FuncGraph is used as a temporary graph to evaluate the type of a while loop’s conditional expression. See [while_stmt](https://github.com/tensorflow/tensorflow/blob/6a70aa6d438259cabd23c09808db4cf51a2e5377/tensorflow/python/autograph/operators/control_flow.py#L739). Created ops, if any, are discarded immediately - we only need to test whether the expression evaluates to a Tensor or not, and if a tf.while_loop is created, they will be created again by the while_loop itself. + +This might not require a FuncGraph - any regular graph is suitable for this purpose. + + +### Serialization/SavedModel + +Serialization + + + +1. The Trackable object graph is crawled to find all functions. An error is raised if trying to save an unsaveable FuncGraph. + 1. FuncGraph has a `_saveable` property which is used to denote whether a FuncGraph can be saved to a SavedModel. This seems to have only [one usage](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/python/keras/backend.py#L311) right now in Keras to mark functions that capture the symbolic learning phase to be unsaveable. +1. For every ConcreteFunction + 1. Its captured non-resource non-variant tensors are [converted](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/saved_model/save.py#L280-L298) into graph constants. + 1. The graph is converted to a [FunctionDef](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/saved_model/save.py#L593) and is written to the MetaGraphDef graph’s function library. + 1. An [entry](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/core/protobuf/saved_object_graph.proto#L32) is added to the object graph proto which stores the node ids of the captured inputs in the object graph and the input/output structures. +1. To enable loading the SavedModel with Sessions, placeholders are [created](https://github.com/tensorflow/tensorflow/blob/23275fb35cf17482d147f88ce7d8f4ce9c2376f3/tensorflow/python/saved_model/save.py#L341) in the graph for non-captured inputs. Then a (Stateful)PartitionedCall op is created in the graph, by feeding the placeholders + constants as inputs to the call op. A SignatureDef is then created for the call op and added to the MetaGraphDef. + 1. This requires access to FuncGraph.inputs, captures and external_captures and assumes that placeholders for captures are at the rear of FuncGraph.inputs. + +Deserialization + + + +1. Concrete functions are [created](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/python/saved_model/load.py#L113-L115) for all graph library functions. + 1. This probably instantiates ConcreteFunctions for non-top-level functions as well. Is that necessary? +1. The captures map is initialized by using the [bound_inputs](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/core/protobuf/saved_object_graph.proto#L107) field of the SavedConcreteFunction proto. + 1. This makes a call to [replace_capture](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/python/saved_model/load.py#L184) and then a separate call to [capture](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/python/saved_model/load.py#L200). This is done because we already have the internal placeholders created and we just need to update the captures map. The call to FuncGraph.capture records the capture on the tape. +1. Input/output structures are [initialized](https://github.com/tensorflow/tensorflow/blob/99f0e90b384cfb255103a8965bec0d11a7995e49/tensorflow/python/saved_model/load.py#L155-L157). + 1. Seems like structured_outputs only contains the structure but not really the tensors e.g. in the original FuncGraph.structured_outputs. diff --git a/rfcs/20191206-tensorflow-lattice-v2.md b/rfcs/20191206-tensorflow-lattice-v2.md new file mode 100644 index 000000000..64a160af1 --- /dev/null +++ b/rfcs/20191206-tensorflow-lattice-v2.md @@ -0,0 +1,440 @@ +# TensorFlow Lattice 2.0 + +| Status | Accepted | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [186](https://github.com/tensorflow/community/pull/186) +| **Author(s)** | Mahdi Milani Fard (mmilanifard@google.com), Oleksandr Mangylov (amangy@google.com) | +| **Sponsor** | Zhenyu Tan (tanzheny@google.com), Karmel Allison (karmel@google.com) | +| **Updated** | 2020-04-15 | + +## Objective + +TensorFlow Lattice (TFL) is an implementation of +[Deep Lattice Networks](https://arxiv.org/abs/1709.06680) in TensorFlow. Using +TFL, one can create models with guaranteed shape constraints such as +monotonicity with respect to a set of features. TFL was open sourced in 2017 +(https://github.com/tensorflow/lattice) and was based on TF 1.x. This RFC covers +the goals and design details of TFL 2.0. + +The main objectives of TFL 2.0 include: + +* TF library with support for both TF 2.x eager mode and TF 1.x compatibility + and graph mode +* Keras Layer API for lattice and calibration functions +* Easy to construct canned estimators for typical model architectures +* No degradation in functionality, accuracy, training speed or evaluation + speed compared to TFL 1.x +* Support for new + [shape constraints](http://proceedings.mlr.press/v97/cotter19a.html) + including convexity, concavity, unimodality and pair-wise trust +* Easy plotting and inspection +* Clear and useful API docs and examples + +Stretch goals and future work: + +* Faster training and evaluation compared to TFL 1.x +* More accurate models using better projection algorithms +* Support for GPU/TPU and tf.distribute.Strategy +* Premade Keras models with multi-phase training +* Exposing visualization tools in TensorBoard + +## Motivation + +TensorFlow Lattice is a library that implements fast-to-evaluate and +interpretable (optionally monotonic) lattice based models. The library includes +layers and canned estimators that can enforce shape constraints such as +monotonicity on the model function. Such constraints can encode policy +considerations as well as common-sense and domain specific knowledge about the +underlying problem. + +The currently open sourced TF implementation of these layers is based on TF 1.x +and lacks eager support. It also does not provide a Keras layers API and can be +difficult to use when building custom models. The reliance on custom ops makes +the library difficult to maintain as it requires compiling the native code for +several platforms. Users outside supported platforms (e.g. Windows users) have +to compile the library on their own, which in effect limits the library's user +base. + +As part of the transition to TF 2.x, we plan to move TFL under +[TF libraries & extensions](https://www.tensorflow.org/resources/libraries-extensions) +for better visibility and integration with the rest of the TF ecosystem. + +## User Benefit + +There have been several requests from the open source community to add eager and +Keras support to TFL. We aim to address these requests and help avoid issues +commonly encountered with TFL 1.x. In particular: + +* Switching away from custom ops (which require compiling the library for + several platforms) makes it possible to ship the OSS releases more + frequently. It also helps with many of the compatibility issues with + non-supported platforms. +* Using Keras native constraint handling avoids the need to manually apply + projections. This has been the source of a lot of headache and reported + issues for TFL users. +* Implementation as Keras layers opens up the opportunity to mix and match + with other Keras layers and use TFL within Keras models. +* Reworked canned estimators provide easier setup and better control over the + model structure. +* Additional shape constraint types offer more power and control to users. + +## Design Proposal + +We construct TFL 2.0 using three levels of abstraction: + +* A low-level TF-only library that implements the basics of interpolation and + projection logic. +* Keras layers that wrap the low-level library. +* Canned estimators and premade Keras models that use TFL Keras layers. + +### Low-level TF library + +The low-level TF library implements: + +* Lattice interpolation, constraint projections and regularizers. +* Piecewise-linear interpolation and constraint projections. +* Partial monotonicity projection for categorical calibration. +* Projections for monotonic linear functions. + +The interpolation code for lattices and calibrators will be implemented in core +TF. It uses basic XLA compatible TF ops. In contrast TFL 1.x uses custom TF ops +with C++ implementation for interpolation. Open source binary release of the +library with custom ops requires compiling on various platforms with frequent +hard-to-fix breakage caused by slight changes in infrastructure used for the +release (TF core, Bazel, release infrastructure, etc). The ops are also +difficult to extend and optimize for newer hardware (e.g. TPUs). We thus want to +avoid using custom ops in TFL 2.0 while maintaining similar or better +performance. + +The low-level TF library is 1.x and 2.x compatible, works both in graph and +eager mode and can be wrapped by higher-level APIs such as Keras. + +### Keras Layer API + +The Keras Layer API for TFL 2.0 mimics that of other Keras Layers, with the +addition of the shape constraints and layer-specific regularization. In +particular we want to: + +* In the layer constructor, use the same parameter names as standard Keras + layers. +* Use standard Keras initializer objects, with short-hand string valued + alternatives. +* Use standard Keras regularizer objects, with short-hand string-value pair + alternatives. +* Use standard Keras constraint handling for both strict and partial + projections. + * Strict projections are handled by Keras standard constraints. + * Partial projections are handled by Keras standard constraints, followed + by a final projection through explicit calls in Keras model callbacks or + with estimator training hooks. + +#### Linear Layer + +This layer applies a linear transformation to the input tensor with an optional +bias term. It supports monotonicity and fixed-norm constraints. + +```python +linear = tfl.linear_layer.Linear( + num_input_dims=8, + # Monotonicity constraints can be defined per dimension or for all dims. + monotonicities='increasing', + use_bias=True, + # You can force the L1 norm to be 1. Since this is a monotonic layer, + # the coefficients will sum to 1, making this a “weighted average”. + normalization_order=1, +) +``` + +#### Piecewise-Linear Calibration Layer + +This layer applies a piecewise-linear (PWL) function to the input tensor. PWL +keypoint inputs are fixed and passed to the layer constructor. They are +typically set to the quantiles of the input, or are uniformly spaced in the +input range. + +![PWL Calibration](20191206-tensorflow-lattice-v2/pwl.png) + +This layer supports monotonicity, convexity, concavity and bound constraints. + +```python +calibrator = tfl.pwl_calibration_layer.PWLCalibration( + # Key-points of piecewise-linear function. + input_keypoints=np.linspace(1., 4., num=4), + # Output can be bounded, e.g. when this layer feeds into a lattice. + output_min=0.0, + output_max=2.0, + # You can specify monotonicity and other shape constraints for the layer. + monotonicity='increasing', + # You can specify TFL regularizers as tuple ('regularizer name', l1, l2). + # You can also pass any keras Regularizer object. + kernel_regularizer=('hessian', 0.0, 1e-4), +) +``` + +#### Categorical Calibration Layer + +This layer maps integer-valued input categories to float output. + +![Categorical Calibration](20191206-tensorflow-lattice-v2/cat.png) + +This layer supports partial ordering and bound constraints. + +```python +calibrator = tfl.categorical_calibration_layer.CategoricalCalibration( + # Number of categories, including oov buckets and default values. + num_buckets=3, + # Output can be bounded, e.g. when this layer feeds into a lattice. + output_min=0.0, + output_max=2.0, + # Categorical monotonicity can be a partial order. + # output(0) <= output(1) and output(0) <= output(2). + monotonicities=[(0, 1), (0, 2)], +) +``` + +#### Lattice Layer + +A lattice is an interpolated look-up table that can approximate arbitrary +input-output relationships in the data. It overlaps a regular grid onto the +input space and learns values for the output in the vertices of the grid. For a +test point *x*, *f(x)* is linearly interpolated from the lattice values +surrounding *x*. + +![Lattice](20191206-tensorflow-lattice-v2/lattice.png) + +This layer support monotonicity, unimodality, +[trust](http://proceedings.mlr.press/v97/cotter19a.html) and bound constraints. + +```python +lattice = tfl.lattice_layer.Lattice( + # Number of vertices along each dimension. + lattice_sizes=[2, 2, 3, 4], + # You can specify monotonicity constraints. + monotonicities=['increasing', 'none', 'increasing', 'increasing'], + # You can specify trust constraints between pairs of features. Here we + # constrain the function to be more responsive to a main feature (index 4) + # as a secondary conditional feature (index 3) increases (direction 1). + edgeworth_trusts=(3, 2, 1), + # Output can be bounded. + output_min=0.0, + output_max=1.0 +) +``` + +#### MultiCalibration Layer + +This layer concatenates multiple calibrators to act on a single +multi-dimensional input. This helps with construction of sequential models. + +```python +model = tf.keras.models.Sequential() + +all_calibrators = tfl.lattice_layer.MultiCalibration() +all_calibrators.append(tfl.pwl_calibration_layer.PWLCalibration(...)) +all_calibrators.append(tfl.pwl_calibration_layer.PWLCalibration(...)) +all_calibrators.append(tfl.pwl_calibration_layer.CategoricalCalibration(...)) + +lattice = tfl.lattice_layer.Lattice(...) + +model.add(all_calibrators) +model.add(lattice) +model.compile(...) +model.fit(...) +``` + +#### Projection Handling + +By default, TFL applies a full projection into constraints after every gradient +update. This makes sure that all the specified constraints are satisfied after +every update to the model parameters. Alternatively you can apply faster partial +projections for each batch and a final strict projection at the end of training. + +```python +lattice = tfl.lattice_layer.Lattice( + ..., + monotonic_at_every_step=False, +) +... (train) ... +lattice.finalize_constraints() +``` + +The final projection can be automatically handled by high level APIs (e.g. +callbacks in model.fit or training hooks in estimators) or manually in a custom +training setup. + +### High Level Canned Estimator API + +TFL 2.0 provides v2 canned estimators with several model structures. These +include: + +* Calibrated linear (generalized additive) +* Calibrated lattice +* Ensemble of calibrated lattices + +To allow the user to define various shape constraints, regularizers and model +structures, the library provides a **configs** API. To construct a canned model, +the user first creates a model config that specifies the model structure and +various constraints about the model shape for each input feature. + +```python +# Configuration for a lattice ensemble with output calibration. +model_config = tfl.configs.CalibratedLatticeEnsembleConfig( + num_lattices=6, # number of lattices + lattice_rank=5, # number of features in each lattice + output_calibration=True, + + # Optional per feature configuration. + feature_configs=[ + # Numeric feature with PWL calibration. + # Feature type is inferred from the corresponding feature column. + tfl.configs.FeatureConfig( + name='age', + lattice_size=3, + # Model output must be monotonically increasing w.r.t. this feature. + monotonicity='increasing', + # Per feature regularization. + regularizer_configs=[ + tfl.configs.RegularizerConfig(name='calib_hessian', l2=1e-4), + ], + ), + # Categorical feature. + # Feature type and vocab list is inferred from the input feature column. + tfl.configs.FeatureConfig( + name='thal', + # Partial monotonicity: + # output(normal) <= output(fixed) + # output(normal) <= output(reversible) + monotonicity=[('normal', 'fixed'), ('normal', 'reversible')], + ), + ], + + # Global regularizers + regularizer_configs=[ + # Regularizer applied to all calibrators. + tfl.configs.RegularizerConfig(name='calib_wrinkle', l2=1e-4), + # Regularizer applied to the lattice. + tfl.configs.RegularizerConfig(name='torsion', l2=1e-4), + # Regularizer for the output calibrator. + tfl.configs.RegularizerConfig(name='output_calib_hessian', l2=1e-4), + ], +) +``` + +PWL calibration requires a list of input keypoint values. If explicit keypoints +are not provided, keypoints are set to be the quantiles of the features and are +calculated using an auxiliary input_fn (with 1 epoch or a subsample of the data) +passed to the estimator constructor. Feature types and categorical vocabulary +list can be inferred from the feature columns passed to the estimator. + +```python +estimator = tfl.estimators.CannedClassifier( + feature_columns=feature_columns, # same as any other estimator + model_config=model_config, # defines model and feature configs + feature_analysis_input_fn=make_input_fn(num_epochs=1, ...)) +estimator.train(input_fn=make_input_fn(num_epochs=100, ...)) +``` + +Premade Keras models will also be provided with the library, either in the +initial release or in future updates. + +### Visualization + +To help better analyze and debug TFL canned models, we implement model specific +visualization tools not already supported by +[TFMA](https://www.tensorflow.org/tfx/model_analysis/get_started) or other +standard TF analysis tools. These include calibrator plots and model graphs for +our specific canned estimators. + +#### Calibrator plots + +TFL 2.0 supports extracting calibrator parameters from saved models and plotting +them either individually or altogether. + +![Calibration Plots](20191206-tensorflow-lattice-v2/calib.png) + +#### Plotting model structure + +The model structure and all layer parameters can be extracted from a saved model +and plotted in a schematic graph. This is similar to Keras model plotting. + +![Model Graph](20191206-tensorflow-lattice-v2/graph.png) + +We plan to expose these in a TFL TensorBoard extension in future launches. + +### Alternatives Considered + +* **Custom interpolation kernels in C++:** although they might provide better + performance on specific backends, we decided that the potential gains are + not enough to counter the maintenance difficulties of binary packages. They + also make XLA compilation difficult. +* **Supporting v1 Estimators:** As suggested by the TF/Keras teams, we only + support estimator v2 versions as v1 support is not adding much value either + for new users or those migrating from TFL 1.x. +* **Using hparams instead of a custom configs library:** hparams is going away + in TF 2.x and each library is expected to implement its own version. hparams + by design has a flat structure, hence making it cumbersome to represent + hierarchies required for model configuration. + +### Performance Implications + +There are several end-to-end benchmarking examples that help us measure and +optimize the performance both in training and evaluation. Our current estimate +suggests that compared to TFL 1.x there is no significant regression in the +training or evaluation speed on the standard TF runtime. A separate tf/compile +benchmark shows significant improvement in evaluation time compared to the TFL +1.x. Full XLA compilation is not possible with TFL 1.x as it uses custom ops. + +### Dependencies + +The core library depends on TF and a handful of commonly used open source python +libraries (numpy, six, etc). The estimator has a minor dependency on the +feature_column module, and the rest of the library is based on the public TF +API. + +### Engineering Impact + +Since the new library is using core TF ops, the library size should be smaller +than TFL 1.x. However, since the model ends up having a lot more ops in the +graph mode, the startup time can be longer and can take more memory. We are +investigating various options to improve the performance over time. + +### Platforms and Environments + +The new library works on all platforms supported by TensorFlow. + +**Evaluation:** We use basic and simple TF ops in the evaluation path, making +the serving saved model fully XLA compatible. We also have a tf/compile test +bench for AOT compilation. + +**Training:** The training is mostly XLA compatible. Training on TPU is +possible, but can be slow to converge. Further updates and optimization to the +library for better TPU support is planned for future launches. + +### Best Practices, Tutorials and Examples + +Several examples and tutorials on public datasets will be available with the +library. Colabs will also be provided for a quick overview of the library. + +### Compatibility + +TFL 2.0: + +* Not backwards compatible with TFL 1.x, but a migration should be + straightforward. +* XLA compatible and can run on TPU/GPU/CPU. +* Has some convergence issues when training with TPU distribution strategies + due to heavy use of constraints. We plan to improve this in future launches. +* Can be used with tf/compile AOT. +* Supports Estimator v2 and SavedModel format. +* Layers are eager compatible. + +### User Impact + +We will release migration guides to help current TFL users switch to the new +library. + +## Questions and Discussion Topics + +* What other functionalities or shape constraints would be good additions to + new library? +* Which computational platforms should the library be optimized for? diff --git a/rfcs/20191206-tensorflow-lattice-v2/calib.png b/rfcs/20191206-tensorflow-lattice-v2/calib.png new file mode 100644 index 000000000..c11c0f27c Binary files /dev/null and b/rfcs/20191206-tensorflow-lattice-v2/calib.png differ diff --git a/rfcs/20191206-tensorflow-lattice-v2/cat.png b/rfcs/20191206-tensorflow-lattice-v2/cat.png new file mode 100644 index 000000000..c355eb5ca Binary files /dev/null and b/rfcs/20191206-tensorflow-lattice-v2/cat.png differ diff --git a/rfcs/20191206-tensorflow-lattice-v2/graph.png b/rfcs/20191206-tensorflow-lattice-v2/graph.png new file mode 100644 index 000000000..757ba636b Binary files /dev/null and b/rfcs/20191206-tensorflow-lattice-v2/graph.png differ diff --git a/rfcs/20191206-tensorflow-lattice-v2/lattice.png b/rfcs/20191206-tensorflow-lattice-v2/lattice.png new file mode 100644 index 000000000..9d9915ac3 Binary files /dev/null and b/rfcs/20191206-tensorflow-lattice-v2/lattice.png differ diff --git a/rfcs/20191206-tensorflow-lattice-v2/pwl.png b/rfcs/20191206-tensorflow-lattice-v2/pwl.png new file mode 100644 index 000000000..0c8750fa5 Binary files /dev/null and b/rfcs/20191206-tensorflow-lattice-v2/pwl.png differ diff --git a/rfcs/20191212-keras-categorical-inputs.md b/rfcs/20191212-keras-categorical-inputs.md new file mode 100644 index 000000000..f0eeb948f --- /dev/null +++ b/rfcs/20191212-keras-categorical-inputs.md @@ -0,0 +1,542 @@ +# Keras categorical inputs + +| Status | Implemented (https://github.com/tensorflow/community/pull/209) | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)| +| **Sponsor** | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) | +| **Updated** | 2019-02-22 | + +## Objective + +This document proposes 5 new Keras preprocessing layers (KPL) (`StringLookup`, `CategoryCrossing`, `CategoryEncoding`, `Hashing`, `IntegerLookup`) and allow users to: +* Perform basic feature engineering for categorical inputs +* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers +* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity + +Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md). + +The proposed layers should support ragged tensors. + +## Motivation + +Specifically, by introducing the 5 layers, we aim to address these pain points: +* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416). +* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`. +* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs, or shared embedding inputs. +* Feature columns offer black-box implementations, mix feature engineering with trainable objects, and lead to + unintended coding pattern. + +## User Benefit + +We expect to get rid of the user painpoints once migrating off feature columns. + +## Example Workflows + +Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR). + +### Workflow 1 -- Official guide on how to replace feature columns with KPL + +Refer to [tf.feature_column](https://www.tensorflow.org/api_docs/python/tf/feature_column) for a complete list of feature columns. + +1. Replacing `tf.feature_column.categorical_column_with_hash_bucket` with `Hashing` +from +```python +tf.feature_column.categorical_column_with_hash_bucket(key, hash_bucket_size) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=dtype) +hashed_input = tf.keras.experimental.preprocessing.Hashing(num_bins=hash_bucket_size)(keras_input) +``` + +Note the hashed output from KPL will be different than the hashed output from feature column, given how seed is choosen. `Hashing` also supports customized `salt`. + +2. `tf.feature_column.categorical_column_with_identity` +This feature column is merely for having identical inputs and outputs except mapping out-of-range value into `default_value`, thus can easily be done at data cleaning stage, +not be part of feature engineering, and hence dropped in this proposal. + +3. Replacing `tf.feature_column.categorical_column_with_vocabulary_file` and `tf.feature_column.categorical_column_with_vocabulary_list` with `StringLookup` or `IntegerLookup`. +for string inputs, +from +```python +tf.feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file, vocabulary_size, tf.dtypes.string, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string) +id_input = tf.keras.experimental.preprocessing.StringLookup(max_tokens=vocabulary_size + num_oov_buckets, + num_oov_indices=num_oov_buckets, mask_token=None, vocabulary=vocabulary_file)(keras_input) +``` + +Similarly, from +```python +tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list, tf.dtypes.string, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string) +id_input = tf.keras.experimental.preprocessing.StringLookup(max_tokens=len(vocabulary_list) + num_oov_buckets, num_oov_indices=num_oov_buckets, + mask_token=None, vocabulary=vocabulary_list)(keras_input) +``` + + +Note that `default_value` is mutually exclusive with `num_oov_buckets`, in the case of `num_oov_buckets=0` and `default_value=-1`, simply set `num_oov_indices=0`. We do not support +any values other than `default_value=-1`. + +Note the out-of-range values for `StringLookup` is prepended, i.e., [0,..., num_oov_tokens) for out-of-range values, whereas for `categorical_colulmn_with_vocabulary_file` is +appended, i.e., [vocabulary_size, vocabulary_size + num_oov_tokens) for out-of-range values. The former can give you more flexibility when reloading and adding vocab. + +For integer inputs, +from +```python +tf.feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file, vocabulary_size, tf.dtypes.int64, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.int64) +id_input = tf.keras.experimental.preprocessing.IntegerLookup(max_values=vocabulary_size + num_oov_buckets, num_oov_indices=num_oov_buckets, mask_value=None, vocabulary=vocabulary_file)(keras_input) +``` + +Similarly, from +```python +tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list, tf.dtypes.int64, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.int64) +id_input = tf.keras.experimental.preprocessing.IntegerLookup(max_values=len(vocabulary_list) + num_oov_buckets, num_oov_indices=num_oov_buckets, mask_value=None, vocabulary=vocabulary_list)(keras_input) +``` + + +4. Replacing `tf.feature_column.crossed_column` with `CategoryCrossing` or `Hashing` +from +```python +tf.feature_column.crossed_column(keys, hash_bucket_size, hash_key) +``` +to +```python +keras_inputs = [] +for key in keys: + keras_inputs.append(tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string)) +hashed_input = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=hash_bucket_size)(keras_inputs) +``` + +Note when `hash_bucket_size=0`, no hashing is performed, in this case it should be replaced with: +```python +keras_inputs = [] +for key in keys: + keras_inputs.append(tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string)) +crossed_input = tf.keras.layers.experimental.preprocessing.CategoryCrossing()(keras_inputs) +``` + +5. Replacing `tf.feature_column.embedding_column` with `tf.keras.layers.Embedding` +Note that `combiner=sum` can be replaced with `tf.reduce_sum` and `combiner=mean` with `tf.reduce_mean` after +the embedding output. `sqrtn` can also be implemented using tf operations. For example: +```python +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.embedding_column(categorical_column, dimension=dimension, combiner="sum", initializer=initializer, + max_norm=max_norm) +``` +can be replaced with: +```python +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +id_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +embedding_input = tf.keras.layers.Embedding(input_dim=len(vocabulary_list), output_dim=dimension, + embeddings_initializer=initializer, embeddings_constraint=tf.keras.constraints.MaxNorm(max_norm))(id_input) +embedding_input = tf.reduce_sum(embedding_input, axis=-2) +``` + +6. Replacing `tf.feature_column.indicator_column` with `CategoryEncoding` +from +```python +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.indicator_column(categorical_column) +``` +to +```python +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +id_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +encoded_input = tf.keras.layers.experimental.preprocessing.CateogoryEncoding( + max_tokens=categorical_column.num_buckets, output_mode="count", sparse=True)(id_input) +``` + +Note that `CategoryEncoding` supports one-hot through `output_mode="binary"` as well. This is a much more +efficient approach than `tf.one_hot` + `tf.reduce_sum(axis=-2)` to reduce the multivalent categorical inputs. + +Note that by specifing `sparse` flag, the output can be either a `tf.Tensor` or `tf.SparseTensor`. + +7. Replacing `tf.feature_column.weighted_categorical_column` with `CategoryEncoding` +from +```python +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.weighted_categorical_column(categorical_column, weight_feature_key) +``` +to +```python +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +lookup_output = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +weight_input = tf.keras.Input(shape=(1,), dtype=tf.float32, name=weight_feature_key) +weighted_output = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + max_tokens=categorical_column.num_buckets)(lookup_output, weight_input) +``` + +8. Replacing `tf.feature_column.shared_embeddings` with a single `tf.keras.layers.Embedding`. +Similar to 5, but with multiple categorical inputs: +from +```python +watched_video_id = tf.feature_column.categorical_column_with_vocabulary_list('watched_video_id', video_vocab_list) +impression_video_id = tf.feature_column.categorical_column_with_vocabulary_list('impression_video_id', video_vocab_list) +tf.feature_column.shared_embeddings([watched_video_id, impression_video_id], dimension) +``` +to +```python +watched_video_input = tf.keras.Input(shape=(1,), name='watched_video_id', dtype=tf.int64) +impression_video_input = tf.keras.Input(shape=(1,), name='impression_video_id', dtype=tf.int64) +embed_layer = tf.keras.layers.Embedding(input_dim=len(video_vocab_list), output_dim=dimension) +embedded_watched_video_input = embed_layer(watched_video_input) +embedded_impression_video_input = embed_layer(impression_video_input) +``` + +9. Replacing `tf.estimator.LinearXXX` with `CategoryEncoding` and `tf.keras.experimental.LinearModel`. +LinearClassifier or LinearRegressor treats categorical columns by multi-hot, this can be replaced by encoding layer and Keras linear model, see Workflow 2 for details. + +10. Replacing `tf.feature_column.numeric_column` and `tf.feature_column.sequence_numeric_column` with `tf.keras.Input` and `Normalization`. +`tf.keras.layers.experimental.preprocessing.Normalization` with `set_weights` on mean and standard deviation. + +11. Replacing `tf.feature_column.sequence_categorical_xxx`. +Replacing `tf.feature_column.sequence_categorical_xxx` is similar to `tf.feature_column.categorical_xxx` except `tf.keras.Input` should take time dimension into +`input_shape` as well. + +12. Replacing `tf.feature_column.bucketized_column` with `Discretization`. +from +```python +source_column = tf.feature_column.numeric_column(key) +tf.feature_column.bucketized_column(source_column, boundaries) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.float32) +bucketized_input = tf.keras.experimental.preprocessing.Discretization(bins=boundaries)(keras_input) +``` + + +### Workflow 2 -- Complete Example + +This example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset: + +Refer to this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR) to reproduce. + +```python +dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') +y_train = dftrain.pop('survived') + +STRING_CATEGORICAL_COLUMNS = ['sex', 'class', 'deck', 'embark_town', 'alone'] +INT_CATEGORICAL_COLUMNS = ['n_siblings_spouses', 'parch'] +NUMERIC_COLUMNS = ['age', 'fare'] + +keras_inputs = {} +keras_preproc_inputs = [] +for key in STRING_CATEGORICAL_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=key) + keras_inputs[key] = keras_input + vocab = dftrain[key].unique() + keras_preproc_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocab, num_oov_indices=0, mask_token=None, name='lookup' + key)(keras_input) + keras_preproc_input = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=len(vocab), output_mode='count', sparse=True, name='encode' + key)(keras_preproc_input) + keras_preproc_inputs.append(keras_preproc_input) + +for key in INT_CATEGORICAL_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.int64, name=key) + keras_inputs[key] = keras_input + vocab = dftrain[key].unique() + keras_preproc_input = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab, num_oov_indices=0, mask_value=None, name='lookup' + key)(keras_input) + keras_preproc_input = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=len(vocab), output_mode='count', sparse=True, name='encode' + key)(keras_preproc_input) + keras_preproc_inputs.append(keras_preproc_input) + +for key in NUMERIC_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.float32, name=key) + keras_inputs[key] = keras_input + keras_preproc_inputs.append(keras_preproc_input) + +age_x_sex = tf.keras.layers.experimental.preprocessing.CategoryCrossing(name='age_x_sex_crossing')([keras_inputs['age'], keras_inputs['sex']]) +age_x_sex = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=100, name='age_x_sex_hashing')(age_x_sex) +keras_output_age_x_sex = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=100, output_mode='count', sparse=True, name='age_x_sex_encoding')(age_x_sex) +keras_preproc_inputs.append(keras_output_age_x_sex) + + +linear_model = tf.keras.experimental.LinearModel(units=1, kernel_initializer='zeros', activation='sigmoid') +linear_logits = linear_model(keras_preproc_inputs) +sorted_keras_inputs = tuple(keras_inputs[key] for key in sorted(keras_inputs.keys())) +model = tf.keras.Model(sorted_keras_inputs, linear_logits) + +model.compile('ftrl', 'binary_crossentropy', metrics=['accuracy']) + +df_dataset = tf.data.Dataset.from_tensor_slices((dict(dftrain), y_train)) +def encode_map(features, labels): + encoded_features = tuple(tf.expand_dims(features[key], axis=1) for key in sorted(features.keys())) + return (encoded_features, labels) +encoded_dataset = df_dataset.batch(32).map(encode_map) + +model.fit(encoded_dataset) +``` + +## Design Proposal + +```python +`tf.keras.layers.StringLookup` +StringLookup(PreprocessingLayer): +"""This layer transforms categorical inputs to index space. + If input is dense/sparse, then output is dense/sparse.""" + + def __init__(self, max_tokens=None, num_oov_indices=1, mask_token="", + oov_token="[UNK]", vocabulary=None, encoding=None, + invert=False, name=None, **kwargs): + """Constructs a IndexLookup layer. + + Args: + max_tokens: The maximum size of the vocabulary for this layer. If None, + there is no cap on the size of the vocabulary. Note that this vocabulary + includes the OOV and mask tokens, so the effective number of tokens is + (max_tokens - num_oov_indices - (1 if mask_token else 0)) + num_oov_indices: The number of out-of-vocabulary tokens to use; defaults to + 1. If this value is more than 1, OOV inputs are hashed to determine their + OOV value; if this value is 0, passing an OOV input will result in a '-1' + being returned for that value in the output tensor. (Note that, because + the value is -1 and not 0, this will allow you to effectively drop OOV + values from categorical encodings.) + mask_token: A token that represents masked values, and which is mapped to + index 0. Defaults to the empty string "". If set to None, no mask term + will be added and the OOV tokens, if any, will be indexed from + (0...num_oov_indices) instead of (1...num_oov_indices+1). + oov_token: The token representing an out-of-vocabulary value. Defaults to + "[UNK]". + vocabulary: An optional list of vocabulary terms, or a path to a text file + containing a vocabulary to load into this layer. The file should contain + one token per line. If the list or file contains the same token multiple + times, an error will be thrown. + encoding: The Python string encoding to use. Defaults to `'utf-8'`. + invert: If true, this layer will map indices to vocabulary items instead + of mapping vocabulary items to indices. + name: Name of the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: + a string or int tensor of shape `[batch_size, d1, ..., dm]` + Output shape: + an int tensor of shape `[batch_size, d1, ..., dm]` + + Example: + >>> vocab = ["a", "b", "c", "d"] + >>> data = tf.constant([["a", "c", "d"], ["d", "z", "b"]]) + >>> layer = StringLookup(vocabulary=vocab) + >>> layer(data) + + """ + pass + + +`tf.keras.layers.IntegerLookup` +IntegerLookup(PreprocessingLayer): +"""This layer transforms categorical inputs to index space. + If input is dense/sparse, then output is dense/sparse.""" + + def __init__(self, max_values=None, num_oov_indices=1, mask_value=0, + oov_value=-1, vocabulary=None, invert=False, name=None, **kwargs): + """Constructs a IndexLookup layer. + + Args: + max_values: The maximum size of the vocabulary for this layer. If None, + there is no cap on the size of the vocabulary. Note that this vocabulary + includes the OOV and mask values, so the effective number of values is + (max_values - num_oov_values - (1 if mask_token else 0)) + num_oov_indices: The number of out-of-vocabulary values to use; defaults to + 1. If this value is more than 1, OOV inputs are modulated to determine + their OOV value; if this value is 0, passing an OOV input will result in + a '-1' being returned for that value in the output tensor. (Note that, + because the value is -1 and not 0, this will allow you to effectively drop + OOV values from categorical encodings.) + mask_value: A value that represents masked inputs, and which is mapped to + index 0. Defaults to 0. If set to None, no mask term will be added and the + OOV values, if any, will be indexed from (0...num_oov_values) instead of + (1...num_oov_values+1). + oov_value: The value representing an out-of-vocabulary value. Defaults to -1. + vocabulary: An optional list of values, or a path to a text file containing + a vocabulary to load into this layer. The file should contain one value + per line. If the list or file contains the same token multiple times, an + error will be thrown. + invert: If true, this layer will map indices to vocabulary items instead + of mapping vocabulary items to indices. + name: Name of the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: + a string or int tensor of shape `[batch_size, d1, ..., dm]` + Output shape: + an int tensor of shape `[batch_size, d1, ..., dm]` + + Example: + >>> vocab = [12, 36, 1138, 42] + >>> data = tf.constant([[12, 1138, 42], [42, 1000, 36]]) + >>> layer = IntegerLookup(vocabulary=vocab) + >>> layer(data) + + """ + pass + + +`tf.keras.layers.CategoryCrossing` +CategoryCrossing(PreprocessingLayer): +"""This layer transforms multiple categorical inputs to categorical outputs + by Cartesian product, and hash the output if necessary. + If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense.""" + + def __init__(self, depth=None, separator=None, name=None, **kwargs): + """Constructs a CategoryCrossing layer. + Args: + depth: depth of input crossing. By default None, all inputs are crossed into + one output. It can also be an int or tuple/list of ints. Passing an + integer will create combinations of crossed outputs with depth up to that + integer, i.e., [1, 2, ..., `depth`), and passing a tuple of integers will + create crossed outputs with depth for the specified values in the tuple, + i.e., `depth`=(N1, N2) will create all possible crossed outputs with depth + equal to N1 or N2. Passing `None` means a single crossed output with all + inputs. For example, with inputs `a`, `b` and `c`, `depth=2` means the + output will be [a;b;c;cross(a, b);cross(bc);cross(ca)]. + separator: A string added between each input being joined. Defaults to '_X_'. + name: Name to give to the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: a list of string or int tensors or sparse tensors of shape + `[batch_size, d1, ..., dm]` + + Output shape: a single string or int tensor or sparse tensor of shape + `[batch_size, d1, ..., dm]` + + Example: (`depth`=None) + If the layer receives three inputs: + `a=[[1], [4]]`, `b=[[2], [5]]`, `c=[[3], [6]]` + the output will be a string tensor: + `[[b'1_X_2_X_3'], [b'4_X_5_X_6']]` + """ + pass + +`tf.keras.layers.CategoryEncoding` +CategoryEncoding(PreprocessingLayer): +"""This layer transforms categorical inputs from index space to category space. + If input is dense/sparse, then output is dense/sparse.""" + + def __init__(self, max_tokens=None, output_mode="binary", sparse=False, name=None, **kwargs): + """Constructs a CategoryEncoding layer. + Args: + max_tokens: The maximum size of the vocabulary for this layer. If None, + there is no cap on the size of the vocabulary. + output_mode: Specification for the output of the layer. + Defaults to "binary". Values can be "binary", "count" or "tf-idf", + configuring the layer as follows: + "binary": Outputs a single int array per batch, of either vocab_size or + max_tokens size, containing 1s in all elements where the token mapped + to that index exists at least once in the batch item. + "count": As "binary", but the int array contains a count of the number + of times the token at that index appeared in the batch item. + "tf-idf": As "binary", but the TF-IDF algorithm is applied to find the + value in each token slot. + sparse: Boolean. If true, returns a `SparseTensor` instead of a dense + `Tensor`. Defaults to `False`. + name: Name to give to the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: A int tensor of shape `[batch_size, d1, ..., dm-1, dm]` + Output shape: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]` + + Example: + >>> layer = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + ... max_tokens=4, output_mode="count") + >>> layer([[0, 1], [0, 0], [1, 2], [3, 1]]) + + """ + pass + +`tf.keras.layers.Hashing` +Hashing(PreprocessingLayer): +"""This layer transforms categorical inputs to hashed output. + If input is dense/sparse, then output is dense/sparse.""" + def __init__(self, num_bins, salt=None, name=None, **kwargs): + """Constructs a Hashing layer. + + Args: + num_bins: Number of hash bins. + salt: A single unsigned integer or None. + If passed, the hash function used will be SipHash64, with these values + used as an additional input (known as a "salt" in cryptography). + These should be non-zero. Defaults to `None` (in that + case, the FarmHash64 hash function is used). It also supports + tuple/list of 2 unsigned integer numbers, see reference paper for details. + name: Name to give to the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: A single or list of string, int32 or int64 `Tensor`, + `SparseTensor` or `RaggedTensor` of shape `[batch_size, ...,]` + + Output shape: An int64 `Tensor`, `SparseTensor` or `RaggedTensor` of shape + `[batch_size, ...]`. If any input is `RaggedTensor` then output is + `RaggedTensor`, otherwise if any input is `SparseTensor` then output is + `SparseTensor`, otherwise the output is `Tensor`. + + Example: + >>> layer = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=3) + >>> inp = [['A'], ['B'], ['C'], ['D'], ['E']] + >>> layer(inp) + + """ + pass + +``` + +### Alternatives Considered +An alternative is to provide solutions on top of feature columns. This will make user code to be slightly cleaner but far less flexible. + +### Performance Implications +End to End benchmark should be same or faster than feature columns implementations. + +### Dependencies +This proposal does not add any new dependencies. + +### Engineering Impact +These changes will include more layers and thus binary size and build time. It will not impact startup time. +This code can be tested in its own and maintained in its own buildable unit. + +### Platforms and Environments +This proposal should work in all platforms and environments. + +### Best Practices, Tutorials and Examples +This proposal does not change the best engineering practices. + +### Compatibility +No backward compatibility issues. + +### User Impact +User facing changes to migrate feature column based Keras modeling to preprocessing layer based Keras modeling, as the example workflow suggests. + +## Questions and Meeting Notes +We'd like to gather feedbacks on `IndexLookup`, specifically we propose migrating off from mutually exclusive `num_oov_buckets` and `default_value` and replace with `num_oov_tokens`. +1. Naming for encoding v.s. vectorize: encoding can mean many things, vectorize seems to general. We will go with "CategoryEncoding" +2. "mode" should be "count" or "avg_count", instead of "sum" and "mean". +3. Rename "sparse_combiner" to "mode", which aligns with scikit-learn. +4. Have a 'sparse_out' flag for "CategoryEncoding" layer. +5. Hashing -- we refer to hashing when we mean fingerprinting. Keep using "Hashing" for layer name, but document how it relies on tf.fingerprint, and also provides option for salt. +5. Rename "CategoryLookup" to "IndexLookup" + +## Updates on 07/14/20 +Mark the RFC as completed, update the layer naming and arguments. diff --git a/rfcs/20200107-tf-data-snapshot.md b/rfcs/20200107-tf-data-snapshot.md new file mode 100644 index 000000000..a0f6c3358 --- /dev/null +++ b/rfcs/20200107-tf-data-snapshot.md @@ -0,0 +1,396 @@ +# tf.data Snapshot + +| Status | Accepted | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [193](https://github.com/tensorflow/community/pull/193) | +| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain | +| | (rohanj@google.com) | +| **Sponsor** | Jiri Simsa (jsimsa@google.com) | +| **Updated** | 2020-02-10 | + +## Objective + +With ever faster accelerators available in Cloud and hyperparameter tuning +consuming larger chunks of accelerator time, TensorFlow users are increasingly +finding that they don’t have enough CPU resources to keep up with these +accelerators, leaving valuable accelerator resources idle. + +To alleviate this problem, we are proposing a `snapshot` API within `tf.data`, +to allow users to transparently persist the output of their preprocessing +pipeline to disk, and materialize the pre-processed data on a different training +run. + +This API enables repeated preprocessing steps to be consolidated, and allowing +re-use of already processed data, trading off disk storage and network bandwidth +for freeing up more valuable CPU resources and accelerator compute time. + +## Motivation + +Large TensorFlow users have indicated that they have complicated input +processing pipelines which saturate their CPUs before saturating their +accelerators (TPUs in particular). Since they often experiment with +hyperparameter tuning or tweaks to existing model without affecting their input +pipeline, they are asking for ways to avoid similar repeated preprocessing of +data by either saving a dataset or caching it to disk. + +## User Benefit + +Users will be able to transparently persist partially or fully processed data +from `tf.data` input pipelines to disk or Cloud storage systems, and materialize +the pre-processed data during subsequent runs from the same pipeline. This will +cut down on the input pipeline processing overheads during second and subsequent +runs. + +## Design Proposal + +We propose that we add a new `snapshot` transformation to tf.data. To illustrate +the usage of the transformation, we can start with some sample code: + +```python +dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i) +dataset = dataset.parallel_interleave(TFRecordDataset) +dataset = dataset.map(my_preprocessing_fn) +dataset = dataset.apply(tf.data.snapshot("/saved/data", options...)) +dataset = dataset.repeat() + +model = ... +model.fit(dataset) +``` + +As we can see, the end user simply has to add this transformation in order to +use this functionality. In essence, the transformation is similar to the +existing `tf.data.Dataset.cache`, with the key difference is being that, unlike +`cache`, `snapshot` is intended to re-used across different executions of the +same input pipelines. + +### Proposed API + +We are proposing the following API for the snapshot transformation. + +```python +def snapshot(path, + compression=None, + reader_fn=None, + writer_fn=None, + pending_snapshot_expiry_seconds=None): + pass # Implementation goes here. +``` + +1. `path`: Required. A directory where we want to save our snapshots and/or + read from a previously saved snapshot. + +1. `compression`: Optional. The type of compression to apply to the snapshot + written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to + AUTO. + +1. `reader_fn`: Optional. The input pipeline transformation specified by + `reader_fn` is executed when the snapshot detects that there is an existing, + valid snapshot available. + + `reader_fn` is a user specified function that accepts a single argument: + (1) a Dataset of Datasets, each representing a "splits" of elements of the + original dataset. The cardinality of the input dataset matches the + cardinality of the output of `writer_fn` (see below). The function should + return a Dataset of elements of the original dataset. + + A default `reader_fn` will look like the following: + + ```python + def default_reader_fn(datasets): + # shuffle the datasets splits + datasets = datasets.shuffle(NUM_DATASETS) + # read datasets in parallel and interleave their elements + return dataset.interleave(lambda x: x, num_parallel_calls=AUTOTUNE) + ``` + +1. `writer_fn`: Optional. The input pipeline specified by `writer_fn` is + executed when the snapshot op detects that there are no valid snapshots + and no other threads are currently attempting to write a snapshot. + + `writer_fn` is a user specified function that accepts a single argument: + (1) a Dataset of elements to be written out. The function should return + a Dataset of Datasets, each representing "splits" of elements of the + original dataset. The tf.data snapshot implementation will then persist + splits in parallel. + + A default writer_fn will look like the following: + + ```python + def default_writer_fn(dataset): + # add a component with element index + dataset = dataset.enumerate() + # split input dataset in a round-robin fashion + return dataset.split(num_splits=NUM_CORES, key_fn=lambda i, _: i % NUM_CORE + ``` + +1. `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds) + before the snapshot op considers a previously unfinished snapshot to be + stale and starts writing a snapshot from scratch again. Defaults to 86400 + seconds (1 day). + +#### Achieving Parallelism + +`reader_fn` and `writer_fn` will default to passing the dataset through unchanged +by default. In other words, the default implementation will result in +single-threaded reads and writes on snapshots. Parallelism can be achieved in +`writer_fn` by splitting up the dataset into multiple datasets, and using +`num_parallel_calls` in the `interleave` function of the `reader_fn`. + +#### Computing Graph Fingerprints + +Snapshot attempts to determine whether a run of an input pipeline is the same +as a previous run by computing the fingerprint of the nodes within the pipeline. + +However, some input pipelines might vary in insignificant ways from run to run +that causes the fingerprinting of them to differ. For instance, consider the +following preprocessing function: + +```python +features_to_multiply = {"feature1", "feature2", "feature3", "feature4"} + +def preprocessing_fn(value): + keys_to_features = { + "feature1": tf.FixedLenFeature([], tf.float32, 0.0), + "feature2": tf.FixedLenFeature([], tf.float32, 0.0), + "feature3": tf.FixedLenFeature([], tf.float32, 0.0), + "feature4": tf.FixedLenFeature([], tf.float32, 0.0) + } + + parsed = tf.parse_single_example(value, keys_to_features) + combined_feature = 1.0 + for item in features_to_multiply: + combined_feature *= parsed[item] + + return combined_feature + +dataset = ... +dataset = dataset.map(preprocessing_fn) +``` + +In the above example, our `features_to_multiply` variable uses a `set`, which is +not guaranteed to be ordered in Python. When we iterate over the set in the +for loop within `preprocessing_fn`, we may get a different graph on each +run (i.e. one run could have us multiplying `feature2` first, then `feature4`, +etc..., while another run may have us multiplying `feature1`, then `feature3`, +and so on). + +In cases like these, we can ask fingerprinting to use a fixed value for the +fingerprint of the map function with a new `with_snapshot_fingerprint` +transformation, which asks the fingerprinting function to not compute the +fingerprint of the previous node but to use a user-specified value instead: + +```python +dataset = ... +dataset = dataset.map(preprocessing_fn) +dataset = tf.data.experimental.with_snapshot_fingerprint( + dataset, fingerprint="my_fixed_fp") +``` + +### External API Guarantees + +Externally, we guarantee that snapshots written by a particular version of +TensorFlow will be readable by that specific version of TensorFlow. + +We are not currently handling the case where workers do not go through the +entire training set at least once. + +### Alternatives Considered + +An alternative proposal for an API would be `save()` and `load()`, where the +saving and loading of the input pipeline would be made more explicit, avoiding +some of the logic needed in determining whether to snapshot or read from a +snapshot of a model. + +The downside here would be that the user would have to split the preprocessing +and training into potentially different files, and users would be forced to +select whether to train or preprocess on their own, which is not good. + +### Performance Implications + +Benchmarks for this feature will be included as part of Dataset microbenchmarks. + +### Dependencies + +No new dependencies will be introduced as part of this project to TensorFlow. +Dependent projects may be able to use this additional op, but there should be no +significant changes otherwise. + +### Engineering Impact + +Binary sizes increases slightly with the inclusion of this new op, and this code +will be maintained by the `tf.data` team. + +### Platforms and Environments + +This op will work on all TensorFlow-supported platforms. We do not anticipate +this to work on embedded systems as it is not useful in resource-constrained +environments. + +### Best Practices, Tutorials and Examples + +A user guide for snapshot will be published to guide new users in using this +feature. + +### Compatibility + +This introduces a new op, which will impact future backwards compatibility. + +### User Impact + +A new python function and a new op are the only user-facing changes visible. + +## Detailed Design + +### Implementation Assumptions + +The following implementation is based on the following assumptions that define +the MVP this is designed for: + +1. We assume that at least for one pipeline run, you can go through the entire + training dataset and be able to store that data on disk. Otherwise, a + snapshot will never get created. + +2. In the cases where there are multiple workers and the dataset is sharded with + `Dataset.shard`, we assume that the number of workers remains the same from + the initial (writing) run through to the reading runs. + + If the number of workers change, then the `num_shards` parameter to + `Dataset.shard` will change, and this will result in a different graph + fingerprint and another snapshot write will be triggered. + + If all workers use the exact same input pipeline with no sharding (e.g. all + workers will read from all the files), then snapshot will still be able to + read from previous snapshots even if the number of workers is different. + +3. Any `repeat`s in the dataset should be moved to after the `snapshot` op, to + avoid writing large (or infinite) amounts of data during a snapshot writing + run. + +### New `SnapshotDatasetOp` + +To implement the transformation, we are introducing a new `SnapshotDatasetOp` +dataset kernel that will implement all of the functionality in TensorFlow C++. +Python code is mostly glue code to pass relevant parameters into the op kernel. + +### Internal Directory / File Structure + +Given a user directory path (e.g. `/path/to/snapshot`), the directory will look +like: + +* /path/to/snapshot + * `fingerprint`/ + * snapshot.metadata + * `run-id`/ + * 0000000.snapshot + * 0000001.snapshot + +The `fingerprint` is a hash of the input processing graph. The `run-id` is +unique training run ID generated. + +### Standard Kernel Workflow + +_Note: This is an implementation detail, and may change in the future. This +should not be relied upon except as a reference to the current implementation._ + +By default, the `snapshot` operation will, upon startup, make a determination +using the following algorithm as to whether the operation should be in the +WRITE, PASSTHROUGH, or READ state. + +1. We will compute a graph fingerprint containing all the information from the + Dataset preprocessing graph before the `snapshot` op. We’ll use the + `AsGraphDefInternal` method on DatasetBase for this. + +1. We will attempt to enter the corresponding fingerprint directory. For + instance, if the computed fingerprint is `f-abc123` and the base snapshot + directory is `/saved/data`, then we will attempt to enter + `/saved/data/f-abc123`. + +1. If the snapshot directory is non-existent, empty or it doesn’t contain a + `metadata` file, we will enter the **WRITE** state. + +1. If the snapshot directory contains a `metadata.final` file, we will read + the final metadata file and proceed to the **READ** state. + + 1. The file contains the following fields: + 1. A training run ID, + 1. A boolean indicating if the snapshot is complete. + 1. A training run start-time. + +1. If the snapshot directory contains a `metadata` file but not a + `metadata.final` file, we will read the metadata file. + +1. If the training run start-time is more than the (configurable) training run + timeout (set with the `pending_snapshot_expiry_seconds` parameter), we will + enter the **WRITE** state. + +1. If the training run start-time is less than the training run timeout, but + the snapshot is not complete, then we will enter the **PASSTHROUGH** state. + +1. If the snapshot is complete, we will enter the **READ** state. + +#### WRITE State + +1. We generate a random training run ID. + +1. We write (possibly overwriting) the `snapshot.metadata` file. + +1. We proceed to create a subdirectory containing the training run ID, and + start writing data asynchronously in chunks. + +1. At the end of the dataset (when `end_of_sequence == true`), we will check + the snapshot.metadata file to determine whether it contains the same + training run ID. + + 1. If it does, we write a `metadata.final` file containing the + same information as the `metadata` file but with the complete + bit set to true. + 1. If it does not, it means that someone else is concurrently writing the + snapshot and we lost the race to them. We delete all data in the + training run directory. + +For the current implementation, we will store the data in chunked TFRecord +files. Eventually we may move to other more higher performance data stores or +support additional storage systems such as Cloud BigTable. + +#### PASSTHROUGH State + +1. This is a no-op, where we simply pass through the tensors to the downstream + operations. + +#### READ State + +1. We will read from the snapshots contained within the subfolder with the + correct graph fingerprint and specified training run ID. + +1. Optionally, the user may choose to tell us to specify that the snapshots + should be read back in shuffled order. + +### Concurrency: Handling Multiple Input Workers + +If input workers are sharded, then they will generate different graph +fingerprints as their shard indexes will be different. This will result in each +worker writing to a different subdirectory. + +If input workers are not sharded, then this will result in a race and +potentially multiple workers writing data (still with different training run +IDs). Eventually, if each worker finishes, we will be left with one copy of the +data as all the other workers will determine that they have lost the race and +delete their own copy of the snapshot data. + +## Questions and Discussion Topics + +* Should we implement this as three ops (a control opt o determine whether a + snapshot is to be read from/written to) and a write and read op to do the + respective operations? + * Pros include: + * Modularizes the implementation into smaller chunks + * Allows someone else to do the "control" + * Challenges include: + * Where/how the "control" runs? + * How do we construct the dataset graph properly? +* How should autotuning be integrated into the snapshot transformation? +* Are the configuration options well named? Is it possible to consolidate some + of these options? +* What other compression/decompression options would you like to see + supported? +* Any other performance / feature tuning knobs we should make available? diff --git a/rfcs/20200113-tf-data-service.md b/rfcs/20200113-tf-data-service.md new file mode 100644 index 000000000..8e80c6cdc --- /dev/null +++ b/rfcs/20200113-tf-data-service.md @@ -0,0 +1,648 @@ +# Distributed tf.data service + +| Status | Accepted | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [195](https://github.com/tensorflow/community/pull/195) | +| **Author(s)** | Andrew Audibert (aaudibert@google.com) Rohan Jain (rohanj@google.com) | +| **Sponsor** | Jiri Simsa (jsimsa@google.com) | +| **Updated** | 2019-01-30 | + +## Objective + +Provide an API and implementation of a tf.data service which can process tf.data +datasets in a distributed manner. The service can be run outside the TensorFlow +cluster or be exported as a gRPC service by TensorFlow servers. + +Goals: + +- Enable horizontal scaling of dataset computation to improve performance of + input-bound dataset pipelines. +- Improve tf.data integration with the tf.distribute API. In particular, + support dynamic sharding of data across multiple processes. +- Provide visitation guarantees for distributed training jobs. + +Non-goals: + +- Process non-dataset data. +- Distribute datasets that rely on external / non-serializable state. +- Support non-graph computation (e.g. py_function). + +## Motivation + +### Host machine input pipelines can't always keep up with accelerators. + +Some input pipelines require significant resources to produce their data, e.g. +due to image transformations. When the host machine isn't powerful enough to +generate input data at the rate the attached accelerator(s) consume the data, +the accelerator(s) will idle. This slows down training time, and also wastes +valuable accelerator resources. The tf.data service solves this problem by using +N input workers to feed M accelerators. The number of input workers can be +scaled up or down as needed to keep up with the accelerators. + +### Distributed training requires a distribution-aware input pipeline. + +Today tf.data supports the tf.distribute API by providing mechanisms for +sharding, cloning, and re-batching. The tf.distribute API uses these primitives +to implement their own version of a distributed dataset. If distributed datasets +become a core feature of tf.data, tf.data can provide a public API for +tf.distribute (and users who wish to implement their own distribution) to use +instead. This will also allow us to support feature requests that require +cross-worker coordination, such as dynamic sharding. + +## User Benefit + +### Input-bound models + +Users with input-bound models can leverage the tf.data service to distribute +input processing across horizontally-scaling compute resources. This can improve +utilization for valuable accelerator resources, reducing total cost. + +### Dynamic load balancing + +Today, the tf.distribute API statically shards data across accelerators. This +can lead to suboptimal utilization because some shards may contain more data +than others. The tf.data service provides a mechanism for dynamically sharding, +reducing the data imbalance across accelerators. Note that dynamic load +balancing and deterministic output are mutually exclusive; if users require +deterministic output, they must trade off dynamic load balancing. + +### Visitation guarantees + +Model accuracy can often be improved when each training sample is trained on +exactly once per epoch. The tf.data service can coordinate across workers to +provide this guarantee. + +## Design Proposal + +The tf.data service is a master-worker system which iterates through datasets, +producing outputs to be consumed by accelerators. The service is comprised of a +few components: + +* User-facing Python API for interacting with the tf.data service. +* Dataset splitting API for determining how to split up datasets for parallel + processing. +* Master and worker gRPC services. + +### Architecture + +The tf.data service is comprised of master and worker gRPC services which could +be run in a couple of different configurations: + +#### Glossary + +**Master**: The single master coordinating the tf.data service. + +**Worker**: A tf.data service worker which performs dataset processing and +provides dataset elements to consumers over RPC. + +**Consumer**: A machine which consumes data from the tf.data service. The +consumer may be attached to a GPU or TPU, or use data for on-CPU training. + +#### Option 1: Separate Cluster Architecture + +Each server is run on a separate host from the TensorFlow cluster. This +configuration gives users a way to provide horizontally scaling CPU for +processing their input pipelines and quickly feeding data to accelerators. + +#### Option 2: Embedded Cluster Architecture + +Each TensorFlow server runs the tf.data worker gRPC service, and one server also +runs the master gRPC service. This lets users leverage the tf.data service +without needing to provision additional compute resources. and gives all the +benefits of the tf.data service except for horizontal scaling. + +#### Option 3: Hybrid Architecture + +Users could run tf.data workers embedded in their TensorFlow cluster, and also +run additional tf.data workers (and potentially the tf.data master) outside the +cluster. This allows for horizontal worker scaling, while still leveraging the +compute resources of the TensorFlow cluster for input processing. + +### User-facing Python API + +This API is how users will interact with the tf.data service from their Python +code. The steps for distributed iteration over a dataset are + +1. Create a dataset like usual. +2. Apply the `distribute` transformation to indicate that the dataset should be + processed by the tf.data service. +3. Begin an *iteration* by calling `create_iteration`. An *iteration* is a + single pass through the dataset. Multiple consumers can read from the same + iteration, resulting in each consumer receiving a partition of the original + dataset. We represent an iteration with an iteration id, which is generated + by the tf.data service when you call `create_iteration`. +4. Share the iteration id with all consumer processes which are participating + in the iteration. +5. Create per-consumer iterators using `make_iterator`, and use these iterators + to read data from the tf.data service. + +We move away from the idiomatic `for element in dataset` control flow because +there is now an extra step when going from dataset to iterator: creating an +iteration. A higher layer API such as tf.distribute may use the API presented +here to implement datasets which produce per-replica elements, enabling +idiomatic control flow. + +```python +def tf.data.experimental.service.distribute(address_or_resolver): + """Marks that a dataset should be processed by the tf.data service. + + ds = ... # dataset to distribute + ds = ds.apply( + tf.data.experimental.service.distribute(address_or_resolver)) + + Args: + address_or_resolver: The address of the tf.data service master, or a + cluster resolver that can be used to determine the master address. + + Returns: + A function that can be passed to `dataset.apply()`. + """ + +def tf.data.experimental.service.create_iteration( + dataset, num_consumers=1, num_tasks=None, deterministic=False): + """Begins distributed iteration over a dataset. + + It is expected that the dataset contains at least one `.distribute(address)` + transformation, otherwise this method will print a warning and do nothing. + + `create_iteration` will first register the dataset with the tf.data service + if it isn't already registered. It will then request the creation of + `num_consumers` dataset iterators which divide the dataset `num_consumers` + ways. The returned object can be used to read from one of the + iterators using + `tf.data.experimental.service.make_iterator(ds, obj, consumer_index)`. + + ds = ... # dataset to distribute + ds = ds.apply(tf.data.experimental.service.distribute(address)) + if consumer_index == 0: + # The iteration object is a byte array which needs to be shared among all + # consumers. Here we suppose there are broadcast_send and broadcast_recv + # methods available. + iteration_id = tf.data.experimental.service.create_iteration(ds, 3) + broadcast_send(iteration_id) + else: + iteration_id = broadcast_recv() + it = tf.data.experimental.service.make_iterator( + ds, iteration_id, consumer_index) + for element in it: + # process element + + Args: + dataset: The dataset to begin iteration over. + num_consumers: The number of consumers to divide the dataset between. Set + this if you require determinism. + num_tasks: The number of tasks to use for processing. Tasks run for + the duration of an epoch, and each worker should typically process a single + task. Normally it is best to leave this as None so that the master can + choose a reasonable number of tasks. Setting `num_tasks` is useful for + producing deterministic results. + deterministic: Whether the iteration should be performed + deterministically. Fully deterministic output also requires setting + `num_tasks` to a fixed number, and that the input dataset is itself + deterministic. + + Returns: + An iteration_id which can be used to created iterators via + `tf.data.experimental.service.make_iterator` + """ + +def tf.data.experimental.service.make_iterator( + dataset, iteration, consumer_index=0): + """Creates an iterator for reading from the specified dataset. + + Args: + dataset: The dataset to read from. + iteration: An iteration_id object generated by + `tf.data.experimental.service.create_iteration`. + consumer_index: The consumer index within the iteration to read from. If + the iteration was created with `n` consumers, `consumers_index` must be + less than `n`. + + Returns: + A Python iterator which iterates over the dataset elements. + """ +``` + +### Dataset splitting API + +To parallelize dataset processing, the tf.data service needs a way to split up +datasets. We will achieve this by adding a splitting API that allows source +datasets to express how they can be split. + +Our goals for the API are + +* Performance: The splitting API can be used to performantly split and process + datasets. +* Extensibility: User-defined datasets can be split as long as they implement + the splitting API. +* Minimize Surprises: Users write their datasets as though they will not be + split, so introducing splitting can easily lead to unexpected outcomes. To + mitigate this, we will be conservative about which dataset transformations + support splitting. + +The API will be used internally by the tf.data service to distribute datasets. +It will be entirely in C++, and we don't currently have any plans to expose +splitting through Python. + +The API focuses on producing and consuming `Split`s. A `Split` is a variant +Tensor that can be subclassed to represent arbitrary types of splitting. The +`Split` base class is intentionally general so that subclasses have the +flexibility to define splits however they like. + +```cpp +class Split { + public: + virtual std::string DebugString() const = 0; + // Methods to support being used as a Variant tensor. + virtual std::string TypeName() const = 0; + virtual void Encode(VariantTensorData* data) const = 0; + virtual bool Decode(const VariantTensorData& data) = 0; +}; +``` + +To iterate over splits for a dataset, we will use a new +`DatasetBase::MakeSplitGenerator()` method. This method creates a +`SplitGenerator`, which is responsible for generating all of the splits for the +dataset. We use an intermediate `SplitGenerator` object instead of generating +splits directly because there could be a large number of splits, and the +`SplitGenerator` gives us as way to tune split size in response to pipeline +performance. + +```cpp +class SplitGenerator { + public: + virtual Status GetNext(std::unique_ptr* split, + bool* end_of_splits) = 0; + // Instructs the SplitGenerator to adjust the size of future splits by the + // specified percent. 100% means no change, 50% means half-sized splits, and + // 200% means double-sized splits. The SplitGenerator will make a best effort + // to incorporate the feedback when creating splits. + virtual void AdjustSplitSize(int percent) = 0; +}; +``` + +It is tempting to process each split independently, but this would cause issues +when splits are small. tf.data pipelines need to populate internal buffers for +shuffling, prefetching, and batching. If we use a separate pipeline to process +each split, our shuffling will be lower quality, we will have performance jitter +as we keep needing to refill prefetch buffers from scratching, and we will +produce many more partial batches (each split might not even have enough data to +fill a full batch). To avoid these issues, we use a small number of tasks, where +each task processes many splits as a single pipeline. + +To enable processing of multiple splits in a dataset, we will add an optional +`SplitProvider` field to the `IteratorContext` passed to +`IteratorBase::Initialize`. The `SplitProvider` produces splits which tell the +iterator what source data to iterate over. For example, if splits are +represented by filenames, and a SplitProvider produces `["file1", "file6", +"file11"]`, an iterator initialized by that `SplitProvider` should process those +three files only. + +```cpp +class SplitProvider { + public: + virtual Status GetNext(std::unique_ptr* split, + bool* end_of_splits) = 0; +}; +``` + +When processing datasets, tf.data service workers will use `SplitProvider`s +which provide splits by querying the tf.data service master for which splits to +process. A few splits will be prefetched to hide the latency of needing to +request a new split from the master. + +#### Supported Datasets + +Not all dataset sources and transformations are easily splittable. For example, +`take`, `skip`, and `scan` require a global view of the dataset to produce +correct results. Datasets which require multiple input datasets such as `zip` +are also difficult to support, since we don't have a good way of aligning the +splits of multiple input datasets. Users who rely on these unsupported datasets +will need to move those datasets to come after the distributed part of their +pipeline. + +Initially, we will support splitting for the following dataset sources and +transformations: + +* `batch`, `CsvDataset`, `dense_to_sparse_batch`, `filter`, + `FixedLengthRecordDataset`, `flat_map`, `from_tensor_slices`, + `group_by_window`, `ignore_errors`, `interleave`, `list_files`, `map`, + `range`, `repeat`, `padded_batch`, `prefetch`, `shuffle`, `SSTableDataset`, + `TextLineDataset`, `TFRecordDataset`, `unbatch`, `window`. + +### Master and worker services + +This section discusses the design for the master and worker services. These +services are used by the Python API to provide distributed dataset processing, +and these services use the splitting API as a part of their implementation. + +#### Master API + +The master is responsible for registering datasets, generating and tracking +iteration and worker ids, and generating dataset splits for processing on +workers. + +Below is a sketch of the Master API. This API is not public and is subject to +change. + +```cpp +/// ---- Methods called by consumers ---- + +// Registers a dataset and returns an id for the dataset. If the dataset is +// already registered, its dataset id is returned. +int GetOrRegisterDataset(GraphDef dataset); + +// Creates and returns `num_consumers` iterator ids which partition the +// specified dataset. This also creates an internal `iteration_id` used to +// track the overall dataset iteration. `num_tasks` defines how many tasks to +// create. If `num_tasks` is -1, it is up to the master to determine how many +// tasks to create. +list CreateIterators(int dataset_id, int num_consumers, + int num_tasks); + +// Returns the list of tasks processing data for `iterator_id`. Consumers query +// this to find which worker addresses to read data from. +list GetWorkersForIterator(int iterator_id); + +///---- Methods called by input workers ---- + +// Registers a worker and returns its worker id. +int RegisterWorker(WorkerInfo worker_info); + +// Requests the next splits to process on the given worker for the given +// iteration_id. +List GetSplits(int worker_id, int iteration_id); +``` + +#### Worker API + +The worker is responsible for processing datasets and providing dataset elements +to consumers. + +Below is a sketch of the Worker API. This API is not public and is subject to +change. + +```cpp +/// ---- Methods called by consumers ---- + +// Gets the next element for the specified iterator_id. +list GetElement(iterator_id); + +/// ---- Methods called by master ---- + +// Requests that the worker process the specified dataset. This will trigger the +// worker to start requesting splits from the master using the `iteration_id`. +void ProcessDataset(int dataset_id, int iteration_id, list iterator_ids); +``` + +#### Visitation Guarantees + +When iterating over a deterministic dataset, the tf.data service will process +all input data exactly once, even in the presence of master or worker failures. +We achieve exactly-once by having consumers keep track of their index within +each task, and having restored tasks skip elements to reach the requested index. +For the skipping to give exactly-once semantics, the dataset must produce +outputs deterministically. + +If the dataset is not deterministic, the user can choose either at-least-once or +a close-to-exactly-once visitation guarantee. We can achieve +close-to-exactly-once by using the same skipping technique that we use to +achieve exactly-once for deterministic datasets. If users prefer an +at-least-once guarantee, we can instead start restored tasks from their latest +checkpoint. + +In some cases, we can provide an exactly-once visitation guarantee to +non-deterministic pipelines. If input workers are brought down gracefully, they +can first write checkpoints of their tasks. This way, tasks can begin exactly +where they left off. + +#### Determinism + +Deterministic processing is a cornerstone of tf.data. Determinism is valuable +for debugging and experimentation. This section discusses how the tf.data +service will provide determinism. + +To get deterministic behavior, the tf.data service will require three things: + +1. The dataset being distributed has deterministic output. +1. The user sets `num_consumers`, `num_tasks`, and `deterministic=True` when + calling `tf.data.experimental.service.create_iteration`. +1. Each consumer uses a unique `consumer_index` when calling `make_iterator`. +1. The consumers do not fail. + +In the absence of failures, determinism is achieved by distributing splits +round-robin among `N` input workers and having input workers earmark every `ith` +element for consumer `i`. + +To provide determinism even when servers fail, consumers can keep track of which +element index they have processed up to for each task. Input workers would +attach per-task element indices when they produce elements, so consumers can +ignore duplicate elements caused by worker restarts. + +#### Failure Recovery + +The tf.data service can recover from master and worker failures while preserving +determinism and its at-least-once visitation guarantee. The master achieves this +by writing its unrecoverable state to a persistent journal, and taking +checkpoints of its recoverable state to improve recovery time. When workers +reconnect to a restarted master, they update the master with their state so that +the master can recover its knowledge of its workers. + +The unrecoverable state includes + +* **Registered datasets** +* **ID generators** for iteration ids, iterator ids, dataset ids, and worker + ids. +* **In-progress iteration state**: + * **dataset id** for the iterated dataset so that we can recover the + iteration's split generator + * **iteration id** + * **assignments from splits to tasks**, so that we can restart failed + tasks on new workers. + +Recoverable state includes + +* **Split generators**: Recoverable from our information about in-progress + iterations. +* **Worker addresses**: Recoverable when workers reconnect. +* **Worker loads**: Recoverable when workers reconnect. +* **Assignment from tasks to workers**: Recoverable when workers reconnect. + +To improve recovery time, the master will periodically write checkpoints of its +split generators and outstanding splits, so that split generators don't need to +be run from the beginning during master recovery. + +Workers have no unrecoverable state. If a worker crashes, a new worker can take +its place. It is up to the master to reassign splits from the crashed worker to +the new worker. + +To improve worker recovery time, workers will periodically write checkpoints of +their iterators to directories named using their worker ids. When the restarted +worker connects, the master will tell it which iterator checkpoints to recover +from. + +We will read and write this state through a MasterState interface which can be +implemented using various storage backends. For use cases that require fault +tolerance, the user must configure a fault-tolerant MasterState, e.g. Cloud +Spanner or etcd. If fault tolerance isn't required, the user could configure +state to be held in memory only. + +#### Leadership Transfer + +The master writes state to journal files so that the state can be recovered on +restart. It is possible that a new master could be brought up while the old +master is still running. If we aren't careful, this could result in corruption +of the journal as both masters try to write to it. + +Ideally we could rely on a distributed coordination service such as ZooKeeper. +However, this would add a significant burden to users who don't have access to a +ZooKeeper cluster, and it would also require adding a new dependency on a +ZooKeeper client. + +What TensorFlow does have is a FileSystem API. We will leverage this API to +perform leadership transfer by creating empty files and inspecting file +modification times. + +``` +files = list_directory(leadership_directory) +if all_files_older_than(files, leadership_transfer_interval): + file = create_unique_file(leadership_directory); + if file_is_strictly_newest(file, leadership_directory): + become_leader() +# Another master may be leader. Wait for some time before trying again. +wait_random_interval() +``` + +The leader master will periodically write files to the leadership directory to +indicate that it is still leading. + +The above scheme relies on the filesystem's create_file() and list() operations +being strongly consistent . Users may opt to use a filesystem that doesn't +support strong consistency, but they do so at the risk of two concurrently +running masters thinking they are leader. Common filesystems such as POSIX, +HDFS, and GCS support such strong consistency, but S3 does not. + +#### Caveats + +This section calls out caveats that users will need to be aware of when using +the tf.data service. + +- Due to the nature of dataset splitting, elements will not be processed in + the same order as they were in the pre-distributed dataset. If a dataset + relies on the order of the input files, the user's assumptions will be + violated when splitting causes each input worker to process only a subset of + the input files. +- If a particular dataset operation doesn't support splitting, it must be + moved after the part of the dataset which is distributed. Alternately, the + user could set num_tasks=1 to avoid the need for splitting, but this will + have a heavy performance cost since it only allows a single worker to + generate dataset elements. The most commonly used but unsupported datasets + are `from_generator` and `zip`. + +#### Framework Integration + +Many users interact with TensorFlow through a framework such as +[TFX](https://www.tensorflow.org/tfx). A framework can make leveraging the +tf.data service as simple as toggling a configuration boolean, triggering the +framework to bring up tf.data service servers and add a +`tf.data.experimental.service.distribute` transformation at the end of the +users' data pipeline. By inspecting the amount of time blocked on the input +pipeline, the framework could dynamically scale the number of input workers up +and down to find the minimum number of workers needed so that the input pipeline +can keep up with the model. + +### Alternatives Considered + +#### Use Beam for distributed dataset processing. + +Beam is an open-source data processing framework capable of large-scale parallel +computation. Instead of implementing distributed computation ourselves, we could +execute Beam jobs to perform dataset processing. + +We chose not to follow this direction to avoid creating a dependency on Beam. +Many users don't depend on Beam, and it would be a limitation to require that +dependency. If we depend on Beam, it will not be possible to use the tf.data +service with out-of-the-box TensorFlow. This is especially important as tf.data +service is expected to be used by the tf.distribute API. + +### Performance Implications + +With tf.data workers running in a separate cluster, we expect to be able to +horizontally scale until the input pipeline is no longer the bottleneck, +improving performance for input-bound pipelines. + +If a pipeline input-bound or close to input-bound, tf.distribute could see +performance regressions when it uses the tf.data service to serve elements +across replicas. The issue is that the tf.data service will incur the cost of +transferring elements over the network to feed replicas, instead of having each +replica perform its input processing locally. On the other hand, if the input +pipeline is not the bottleneck, tf.distribute could see training speedups as +dynamic sharding mitigates the time spent waiting for stragglers. + +### Dependencies + +This proposal does not add any new dependencies to TensorFlow. + +### Engineering Impact + +The tf.data service will be maintained by the tf.data team. + +### Platforms and Environments + +The tf.data service is compatible with all platforms supported by TensorFlow. + +### Best Practices, Tutorials and Examples + +The tf.data performance guide will be updated to explain when to use the tf.data +service. We will also provide a tutorial for using the tf.data service. + +### Compatibility + +* Does the design conform to the backwards & forwards compatibility + requirements? + - Yes, this design only adds new functionality, so it doesn't break any + backwards or forwards compatibility guarantees. +* How will this proposal interact with other parts of the TensorFlow + Ecosystem? + - How will it work with TFLite? + * We aren't planning any integration with TFLite, where we haven't + seen a need for distributed input processing. Traditionally TFLite + is used for inference, while tf.data is used for training. + - How will it work with distribution strategies? + * Distribution strategies will be able to leverage the tf.data service + to replace its static sharding with dynamic sharding, and to support + efficient splitting for a wider range of datasets. + - How will it interact with tf.function? + * The tf.data service APIs will work both inside and outside of + tf.functions. + - Will this work on GPU/TPU? + * This proposal does not change the status quo of support for + executing tf.data pipelines on GPU/TPU. + +## Questions and Discussion Topics + +* How should we communicate that distributing a dataset will change the order + in which elements are processed? If users' datasets rely on elements being + processed in a certain order, they could face unpleasant surprises. + - Final decision: Address this through documentation. +* Should we support splitting `skip`, `take`, and `scan` by having them + operate at a per-task level (e.g. skip or take the first `N` elements within + each task)? + - Final decision: Prohibit distributing these transformations, and tell + users to instead use these transformations *after* applying the + `distribute` transformation. +* Is there a more user-friendly way to share iteration ids across consumers? + Distribution strategy is well-equipped with collective ops to share the + iteration ids, but sharing the iteration id could be a heavy burden for + some users. + - Final decision: It is a reasonable expectation for users to either use + distribution strategies, or distribute their own iteration ids. + TensorFlow will soon have public APIs for collective operations that + would make it easy to broadcast iteration ids. +* Can `service.distribute` take a `ClusterResolver` so that the master + hostname isn't baked into the dataset definition? + - Final decision: Accept `master_address_or_resolver`, and wait to resolve + the master address until iteration begins. The `ClusterResolver` will be + stored in the Python `Dataset` object. In the future, we may want C++ + implementations of `ClusterResolver` so that we can represent the + resolver within the dataset graph. diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator.md b/rfcs/20200117-tfx-combining-model-validator-with-evaluator.md new file mode 100644 index 000000000..9a19a34f5 --- /dev/null +++ b/rfcs/20200117-tfx-combining-model-validator-with-evaluator.md @@ -0,0 +1,674 @@ +# Combining ModelValidator with Evaluator + +| Status | Accepted | +| :------------ | :----------------------------------------------------- | +| **RFC #** | 200 | +| **Author(s)** | Gene Huang (jinhuang@google.com), Mike Dreves (mdreves@google.com), Neoklis Polyzotis (npolyzotis@google.com) | +| **Sponsor** | Konstantinos Katsiapis (katsiapis@google.com), Neoklis Polyzotis (npolyzotis@google.com) | +| **Updated** | 2020-01-17 | + +## Objective + +This document discusses proposals for the next iteration of TFX Evaluator +design. The design has the following goals: + +1. Fuse the Evaluator and the ModelValidator into a single component, + eliminating redundant evaluation runs. +2. Adding the ModelValidator related configs into Evaluator, streamlining the + configurations for model evaluations and model validations. +3. Enabling new functionality, namely, model-diff metrics, confidence + intervals, more flexible model/data selection, which was difficult in the + previous split Evaluator/Validator design. +4. Backwards compatibility for existing uses of the two components in TFX. + +**Note**: the new Evaluator component will built on the newly released TFMA >= +v0.21. + +## Motivation + +Under the current TFX setup, +[ModelValidator](https://www.tensorflow.org/tfx/guide/modelval) is completely +separated from [Evaluator](https://www.tensorflow.org/tfx/guide/evaluator): they +run in separate binaries, have separate specifications, and do not communicate +with each other at all. This has several drawbacks: + +1. Computational redundancies: a typical use case of Evaluator is to run + evaluation on the latest data with the newly trained model + (Datalatest on Modelslatest). A typical use case of + ModelValidator is to compare the aforementioned evaluation + (Datalatest on Modelslatest) with the evaluation on + the same data with the already blessed model (Datalatest, + Modelblessed). Since, Evaluator and ModelValidator run + separately, one evaluation (Datalatest on + Modelslatest) runs twice. +2. Code redundancies: the separation makes it harder to support a consistent + set of behaviors across the two components. For instance, today + ModelValidator supports only a subset of the metrics computed by Evaluator + and making them consistent would require code duplication between the two + components. This also raises the bar to implement new requested + functionality such as computing model diff-metrics (with confidence + intervals) for both evaluation and validation. +3. Config redundancies: the configurations for both Evaluator and + ModelValidator are both partially redundant and confusing. + +## User Benefits + +We first introduce some terms we will use in the rest of the proposal. + +* A **model** is the conceptual output of a + [Trainer](https://www.tensorflow.org/tfx/guide/trainer) component in the TFX + pipeline. +* A **model artifact** represents a single instance of a model and corresponds + to the output of an execution of the corresponding + [Trainer](https://www.tensorflow.org/tfx/guide/trainer) component. + +Hence, if a TFX pipeline has two distinct trainers we will talk about two +models. Each execution of a Trainer will result in a model artifact. Another way +to think of this is that each model represents the “stream” of model artifacts +that result from the Trainer’s successive executions. + +Henceforth we assume that each model is identified by some unique name (e.g., +the name of the corresponding trainer component, or a trainer-generated handle) +and each model artifact is further identified by a unique id within that name +(e.g., the MLMD-generated artifact id, or trainer-generated version id for the +model). + +We introduce several ways to select model artifacts within the pipeline: + +* Latest output of a model: the last model artifact in the corresponding + “stream”. +* Latest blessed: the last model artifact that was blessed by the TFX pipeline + (i.e., validated by ModelValidator and InfraValidator). +* Latest pushed: the last model artifact that was pushed by the pipeline’s + Pusher component. + +Correspondingly, we also leverage existing ways to select data artifacts within +the pipeline: + +* [Span, Version, and Split](https://www.tensorflow.org/tfx/guide/examplegen#span_version_and_split): + a Span may correspond to a day of data, an hour of data, and so on. There + might be multiple versions of a Span, but in this proposed work, we always + pin on the latest version per Span. Each Span of data can be subdivided into + multiple Splits. A typical use case would be to split a span (and a version) + of data into training, evaluation, and validation data. + +As part of the unification we plan to extend the behaviors supported by the new +component. Here are a few notable examples: + +* Within a TFX pipeline run, evaluation and validation of several models + compared to a baseline (e.g., latest blessed model artifact from the + pipeline). A typical case is evaluation and validation of each new artifact + of a single model compared to the latest blessed model artifact. +* Single component run of the validation of several models to a baseline. This + can be used to: + * Unblock a previously unsuccessful validation + * Experiment or debug two related models. +* More flexible data selection + * Include multiple selections of data (e.g., selecting rolling window of + data and a fixed golden dataset) + * Exclude selections of data (e.g., selecting a rolling window of data + with exclusion of certain spans) + +In general, the new supported behaviors can be described in terms of the +following orthogonal dimensions: + +* Entire pipeline run vs single component run operation. +* Single-model vs multi-model metrics (e.g., AUC of a single model artifact vs + diff of AUCs between a model artifact and a baseline). +* Single-model vs multi-model validation constraints. + +This results in functionality that significantly expands on the available +functionality in TFX. For instance, (one-off, multi-model metrics, multi-model +constraints) is a new behavior that is not possible before. + +### Model comparison and model validation + +With a given evaluation run, we can now import two models: one candidate model +and one baseline model. The evaluation will not only calculate normal metrics +(see +[supported metrics](https://github.com/tensorflow/model-analysis/blob/master/g3doc/metrics.md)), +but also calculate corresponding diff metrics between the two models, optionally +with a confidence interval. + +With the diff metrics, users can then gate the candidate model blessing using +thresholds on the diff metrics. We introduce two ways of thresholding: value +thresholding and diff thresholding. Please see +[this section](#model-validation-config) for more details. + +## Design Proposal + +We propose to merge the Evaluator and ModelValidator component as a single +component. Here is a diagram of what is being changed in terms of data/model +flow: + +
+ +### Executor signature + +**Input**: Input of the Evaluator, which included data, model, and optional data +validation artifacts. + +* Eval dataset artifact(s) + * We assume that the contents of these artifacts will be accessed through + the + [tfx-io](https://github.com/tensorflow/community/blob/master/rfcs/20191017-tfx-standardized-inputs.md) + layer and so do not make assumptions about formats or example payloads. + * Proposed label in the input dict: “examples” +* A list of model artifacts and an optional baseline model artifacts for + evaluation/validation + * The component will support the following formats out of the box: + SavedModel, EvalSavedModel (what is used by TFMA v0.15.x), Keras. There + will also be customization hooks for other model formats, as supported + in TFMA v0.21. + * Proposed label in the input dict: “model” (same as current evaluator) + and/or "baseline_model" if there is a baseline model. The model + artifacts are auto-infered from the topology of the TFX pipeline, and + the baseline model artifact is linked to the + [ModelSpec](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto) + with "is_baseline" being True (see also the section below on + *Configuration*). + * Baseline model is needed to compare different models to a baseline + model. It is specified in the + [ModelSpec](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto) + with a boolean “is_baseline” turned on. +* (optional) A data-validation artifact for each of the eval datasets + * The payload of such artifacts is assumed to be an + [Anomalies](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/anomalies.proto) + message. + * Proposed label in the input dict: “data_validation”. Note that in the + current setup of the executor we need to assume that the order of data + validation results matches the order of dataset artifacts. + +**Configuration**: controls which evaluation metrics are computed, slicing, how +to validate models. The fomer two parts (metrics and slicing) are built on the +existing TFMA codebase (and v0.21 adds native support for Keras and TF2 +signatures). In this document, we propose to add the configuations for model +validation. We discuss the details for validation logic at this +[section](#model-validation-config). + +* Model specifications: Specify each + [ModelSpec](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L32) + for each model that is linked to a model artifact in the inputs. The spec + also identifies saved model and specifies how to load and run inference. + Please see + [supported model types](https://github.com/tensorflow/model-analysis/blob/master/g3doc/get_started.md#model-types-supported) + for most updated details. +* Evaluation specifications: Specify the + [MetricsConfig](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L179), + [SlicingSpec](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L75), + [AggregationOptions](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L106) + method. Please see the + [TFMA metrics guide](https://github.com/tensorflow/model-analysis/blob/master/g3doc/metrics.md) + for guidance on how to set up model evaluations. Please see + [this](https://github.com/tensorflow/model-analysis/blob/master/g3doc/setup.md#slicing-specs) + for a brief explanation of slicing config. +* **Model validation specifications**: Specify validation thresholds, + including value and diff based thresholding method with or without + confidence interval. +* [Options](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L201): + miscellaneous configuration metrics. + +**Output**: model evaluation and validation results. For each model artifact: + +* Evaluation results + * These results comprise the values of configured metrics on the eval + dataset. + * If a baseline is provided, the results will also contain a comparison of + the model to the baseline. +* Validation results + * These include validations on the configured metrics. + * If the data-validation artifact is present, it is also taken into + account for the result of data validation. + +In what follows we describe in more detail the strategies to resolve the inputs, +the model validation configuration (i.e., what validations are possible), and +the information stored in the output artifacts (i.e., the payload for evaluation +and validation). + +### Inputs + +We assume that both pipeline and single component operation is feasible through +TFX’s orchestration subsystem. In the pipeline mode, the driver can trigger the +executor when any of the following conditions hold: + +* new artifact(s) for evaluation data +* new artifact of a model under evaluation (or, latest artifact if several + have been generated since the last component execution) +* new baseline (if configured) +* new data-validation results (if configured) + +Note that users can exclude any of them from the triggering logic, e.g, when +both models and data are specified, user can configure the driver logic so that +only a new model export triggers a new evaluation run, while a new span of data +does not trigger a new evaluation run. + +We now describe possible ways to resolve each one of the inputs, motivated by +existing and upcoming use cases for model analysis and validation. + +#### Data Selection + +We identify the following ways to resolve the eval dataset artifact: + +* a rolling range of input spans (e.g., last N-spans). +* a fixed set or range of input spans. +* use the same span(s) used to train a specific model. + +Different +[Resolvers](https://github.com/tensorflow/community/blob/master/rfcs/20190828-tfx-resolver.md) +will be created to support the use cases above. Users will need to link +different resolvers to the evaluator component for different data selection +strategies. + +#### Model Selection + +We assume that the baseline model is resolved using some identifier and then one +of the following options: + +* latest pushed +* latest blessed +* latest model output +* fixed selection (e.g., by pinning a specific model artifact) + +Each evaluated model is similarly resolved using an identifier and one of the +following options: + +* fixed artifact +* latest artifact + +Similar to data selection, +[Resolvers](https://github.com/tensorflow/community/blob/master/rfcs/20190828-tfx-resolver.md) +of different functionalities should be used to specify model selection +strategies. + +### Model Validation Config + +The configuration primarily controls which metrics to compute per model and +which thresholds over these metrics control validation. The structure of the +proposed configuration protos follows largely from the +[existing config](https://github.com/tensorflow/model-analysis/blob/master/tensorflow_model_analysis/proto/config.proto) +. Here, we propose to add model validation related options in +[MetricConfig](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/config.proto#L162) +as follows: + +```proto +message MetricConfig { + // Name of a class derived for either tf.keras.metrics.Metric or + // tfma.metrics.Metric. + string class_name = 1; + // Optional name of module associated with class_name. If not set + // then class will be searched for under tfma.metrics followed + // by tf.keras.metrics. + string module = 2; + // Optional JSON encoded config settings associated with the class. + // + // The config settings will be passed as **kwarg values to + // the __init__ method for the class. + // + // Example: '"name": "my_metric", "thresholds": [0.5]' + string config = 3; + + // If validate_absolute is configured then the metric is used + // for validation based on a threshold. + oneof validate_absolute { + GenericValueThreshold value_threshold = 4; + GenericValueCIthreshold value_ci_threshold = 5; + } + + // If validate_relative is configured then validation uses a comparison + // of the metric between the model and the baseline. + oneof validate_relative { + GenericChangeThreshold change_threshold = 6; + GenericChangeCIthreshold change_ci_threshold = 7; + } +} +``` + +The validation constraints are embedded in metrics_specs and is defined per +metric. If there are no constraints (i.e. an empty validate_absolute and +validate_relative), then the metrics will be computed only for evaluation. + +We also propose to add model validation option based on data validation result, +i.e., if there are anomalies detected by +[ExampleValidator](https://www.tensorflow.org/tfx/guide/exampleval) component, +the Evaluator will not bless the model. + +Model validation succeeds if all of the following conditions are satisfied: + +* All configured validation constraints are true. +* If provided, the data-validation artifacts indicate that there are no data + errors. + +#### Confidence Interval thresholds + +The use of +[confidence intervals](http://www.stat.yale.edu/Courses/1997-98/101/confint.htm) +for validation is an important part of the supported functionality. Compared to +value threshold, confidence interval adds statistical rigor to the parameter +with which the model is validated. Confidence interval is a common method for +model validation. Next, we describe some concepts behind this feature. There are +four types of thresholds: + +| | Absolute | Change | +| :-------------- |:--------------------------------- |:------------------- | +| w/o Confidence Interval | GenericValueThreshold | GenericChangeThreshold | +| w/ Confidence Interval | GenericValueCIThreshold | GenericChangeCIThreshold | + +```proto +message GenericValueThreshold { + double lower_bound = 1; + double higher_bound = 2; +} + +enum Direction { + UNKNOWN = 0; + LOWER_IS_BETTER = 1; + HIGHER_IS_BETTER = 2; +} + +message GenericValueCIThreshold { + double significance_level = 1; + double minimum_lower_end = 2; + double maximum_upper_end = 3; + double maximum_spread_when_insignificant = 4; + Direction direction = 5; +} + +message GenericChangeThreshold { + double absolute = 1; + double relative = 2; + Direction direction = 3; +} + +message GenericChangeCIThreshold { + // The significance level used for hypothesis testing. Verification + // will fail if the probability that the candidate model metric + // equals the baseline model metric is less than the significance + // level. + double significance_level = 1; + // The maximum width of the confidence interval (on the difference) + // for a verification to succeed. Set this to avoid verifying + // models based on unreliable metrics. + double maximum_spread_when_insignificant = 2; + // How to use the confidence interval on the relative difference + // between new and old metrics ((new - old) / old) in verification. + // For a CI on the relative diff with bounds, diff_upper and + // diff_lower: + // - ABSOLUTE => fail if (diff_lower > 0) or (diff_upper < 0) + // - HIGHER_IS_BETTER => fail if diff_upper < 0 + // - LOWER_IS_BETTER => fail if diff_lower > 0 + Direction direction = 3; +} +``` + +##### Value CI thresholds + +Consider a metric that is validated based on an absolute value using confidence +intervals (see `GenericValueCIthreshold` above). There are two ways to set the +value CI thresholds: `minimal_lower_end` or `maximum_upper_end` which can +coexist. + +* If the lower end of the interval is larger than `minimal_lower_end`, we + consider this metric is significantly above a threshold; this is useful for + uptrend-favored metrics like accuracy. + +
+ +* On the other hand, if the upper end of the interval is smaller than + `maximum_upper_end`, we consider the metric below a threshold, this is + useful for downtrend-favored metrics like loss. + +
+ +##### Change CI thresholds + +In the case of relative validation constraints, the user can still set a +significance level for the confidence intervals on the difference between old +and new metrics. There are three ways to use the change CI thresholds, +corresponding to the direction enum in the `GenericChangeCIThreshold` message: + +* `HIGHER_IS_BETTER` +* `LOWER_IS_BETTER` +* `ABSOLUTE` + +For each metric, a validation will compute the value on the two models, and a +confidence interval on the difference between the two metrics (computed as new - +old). + +* HIGHER_IS_BETTER + + * PASS: If the lower end of the interval is larger than 0.0, the + validation will pass, regardless of whether the CI width is larger than + the maximum_spread_when_insignificant. This is useful for + uptrend-favored metrics like accuracy. A passing validation result might + look like the result below. + +
+ + * FAIL: **If the CI upper endpoint is _below _zero**, this means that the + new model is significantly worse than the previous model, and a + validation error will be raised, regardless of whether the CI width is + larger than the `maximum_spread_when_insignificant`. + * PASS: If the CI contains zero and the CI width is less than + `maximum_spread_when_insignificant`, then the validation will also + pass. 4. FAIL: If the CI contains zero and the CI width is larger than + `maximum_spread_when_insignificant`, regardless of whether the CI + includes zero or not, a validation error will be raised. + +* LOWER_IS_BETTER + + * PASS: If the upper end of the interval is smaller than + `maximum_upper_end`, regardless of whether the CI width is larger than + the `maximum_spread_when_insignificant`, we consider the metric + significantly negative, this is useful for downtrend-favored metrics + like loss. + +
+ + * FAIL: **If the CI lowerendpoint is _above _zero,** this means that the + new model metric is significantly higher than the previous model, and + thus this represents a regression. A validation error will be raised + regardless of whether the CI width is larger than the + `maximum_spread_when_insignificant`. + * PASS: If the CI contains zero and the CI width is less than + `maximum_spread_when_insignificant` then the validation will also + pass. 8. FAIL: If the CI contains zero and the CI width is larger than + `maximum_spread_when_insignificant` then a validation error will be + raised. + +* ABSOLUTE + + * FAIL: If the CI lower endpoint is above zero, or the upper endpoint is + below zero a validation error will be raised, regardless of whether the + CI width is larger than the `maximum_spread_when_insignificant`. + + * PASS: If the CI contains zero and the CI width is less than + `maximum_spread_when_insignificant` then the validation will also pass. + + * FAIL: If the CI contains zero and the CI width is larger than + `maximum_spread_when_insignificant` then a validation error will be + raised. + +### Output + +The component will output the following artifacts per model: + +* A validation artifact with a [VerifierResult](#VerifierResult) payload that + explains which metric is blocking the bless of the model. +* A metrics artifact for all slices with a payload that allows indexing per + slice key into a + [MetricsForSlice](https://github.com/tensorflow/model-analysis/blob/1301797060a0e0d099d05eb4994f8879bce400ff/tensorflow_model_analysis/proto/metrics_for_slice.proto#L240) + payload. +* A BLESSED artifact when the model passes all the specified thresholds. + +The last two outputs could be conceptually merged into a single artifact. +However, they remain separate in this proposal to help with backward +compatibility. Each output artifact is expected to carry a “model” property that +links it back to the input model. + +**Note**: an alternative to the “model” property would be to record fine-grained +lineage relationships between the output artifacts and the input models. This +can be done in two ways in MLMD: by breaking up a single component execution +into multiple “smaller” executions per model; or, by using special Contexts to +associate outputs with specific inputs. One disadvantage of this approach is +that it may dramatically increase the amount of paths to be tracked by MLMD, +making the linage query for TFMA special. Currently, we do not have plan to +follow this approach so that Evaluator can be consistent with other TFX +components. + +### VerifierResult + +We propose the VerifierResult that contains the result of model verification run +on a pair of models, which reports the following: + +* Any runtime error during the Evaluator run (runtime_status) +* Passing of verification (verification_ok) +* The specifc model anomaly if verification is not passed + (per_head_verifer_results) +* The specifc data anomaly if verification is not passed (data_anomaly). Note + that there is also anomaly detected by Example Validator, which is in a + separate payload provided by Example Validator. + +```proto +message VerifierResult { + + message MetricsValidationForSlice { + message Failure { + MetricKey metric_key = 1; + // Textual error message about which threshold is failing. + string message = 2; + } + SliceKey slice_key = 2; + repeated Failure failures = 3; + } + + message DataAnomaly { + // True if there is no input for Model Validator. This is mostly likely caused + // by empty example files. + bool input_is_empty = 1; + } + + // Any metrics validation failure or data anomaly will fail overall verifcation. + bool verificaton_ok = 1; + + // Details about which threshold is blocking which metric. + repeated MetricsValidationForSlice metric_validation_failures = 2; + // All data related anomaly will be here. + DataAnomaly data_anomaly = 3; +} +``` + +### Compatibility + +With the proposed Evaluator that has combined functionalities of Model +Validator, the current Model Validator will be deprecated. The current Model +Validator takes the latest exported model, compares it against the latest +blessed model on the eval split of the examples. The only metric being used for +gating the blessing is overall accuracy. To migrate it to the proposed +Evaluator, we can deploy the Evaluator as in +[Deployment Example](#Deployment-Example), and with the following setup: + +```python + model_analyzer_with_diff = Evaluator( + examples=example_gen.outputs['examples'], + model=trainer.outputs['model'], + baseline_model=latest_blessed_model_resolver.outputs['latest_blessed_model'], + tfma_config=tfma.Config( + model_specs=[ + tfma.ModelSpec(name="candidate", ...), + tfma.ModelSpec(name="baseline", ..., baseline=True) + ], + metric_specs=[ + tfma.MetricSpec( + model_name="candidate", + metrics=[tfma.MetricConfig( + class_name="tf.keras.metrics.Accuracy", + value_threshold=tfma.GenericValueThreshold(lower_bound=0))]) + ], + ...) + ) +``` + +## Deployment Examples + +The following shows a proposed way to configure the new component in a TFX +pipeline: + +```python +def _create_pipeline(pipeline_name: Text, pipeline_root: Text, data_root: Text, + module_file: Text, serving_model_dir: Text, + metadata_path: Text, + direct_num_workers: int) -> pipeline.Pipeline: + + """Implements the chicago taxi pipeline with TFX.""" + examples = external_input(data_root) + + # Brings data into the pipeline or otherwise joins/converts training data. + example_gen = CsvExampleGen(input=examples) + + # Computes statistics over data for visualization and example validation. + statistics_gen = StatisticsGen(examples=example_gen.outputs['examples']) + + # Generates schema based on statistics files. + infer_schema = SchemaGen( + statistics=statistics_gen.outputs['statistics'], + infer_feature_shape=False) + + # Performs anomaly detection based on statistics and data schema. + validate_stats = ExampleValidator( + statistics=statistics_gen.outputs['statistics'], + schema=infer_schema.outputs['schema']) + + # Performs transformations and feature engineering in training and serving. + transform = Transform( + examples=example_gen.outputs['examples'], + schema=infer_schema.outputs['schema'], + module_file=module_file) + + # Get the latest model so that we can warm start from the model. + latest_model_resolver = ResolverNode( + instance_name='latest_model_resolver', + resolver_class=latest_artifacts_resolver.LatestArtifactsResolver, + latest_model=Channel(type=Model)) + + # Uses user-provided Python function that implements a model using TF-Learn. + trainer = Trainer( + module_file=module_file, + transformed_examples=transform.outputs['transformed_examples'], + schema=infer_schema.outputs['schema'], + base_model=latest_model_resolver.outputs['latest_model'], + transform_graph=transform.outputs['transform_graph'], + train_args=trainer_pb2.TrainArgs(num_steps=10000), + eval_args=trainer_pb2.EvalArgs(num_steps=5000)) + + # Get the latest blessed model. + latest_blessed_model_resolver = ResolverNode( + instance_name='latest_blessed_model_resolver', + resolver_class=latest_artifacts_resolver.LatestArtifactsResolver, + latest_model=Channel(type=Model)) + + # Performs model evaluations and model validations. + model_analyzer_with_diff = Evaluator( + examples=example_gen.outputs['examples'], + model=trainer.outputs['model'], + baseline_model=latest_blessed_model_resolver.outputs['latest_blessed_model'], + tfma_config=tfma.Config( + model_specs=[ + tfma.ModelSpec(name="candidate", ...), + tfma.ModelSpec(name="baseline", ..., baseline=True) + ] + ...) + ) + + return pipeline.Pipeline( + pipeline_name=pipeline_name, + pipeline_root=pipeline_root, + components=[ + example_gen, statistics_gen, infer_schema, validate_stats, transform, + latest_model_resolver, latest_blessed_model_resolver, trainer, model_analyzer, + model_validator, pusher + ], + enable_cache=True, + metadata_connection_config=metadata.sqlite_metadata_connection_config( + metadata_path), + beam_pipeline_args=['--direct_num_workers=%d' % direct_num_workers]) +``` diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator/before-after.png b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/before-after.png new file mode 100644 index 000000000..a4578ad42 Binary files /dev/null and b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/before-after.png differ diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-a.png b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-a.png new file mode 100644 index 000000000..901f9b18b Binary files /dev/null and b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-a.png differ diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-b.png b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-b.png new file mode 100644 index 000000000..159340df6 Binary files /dev/null and b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/change-CI-threshold-b.png differ diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-a.png b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-a.png new file mode 100644 index 000000000..f6a0b7eca Binary files /dev/null and b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-a.png differ diff --git a/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-b.png b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-b.png new file mode 100644 index 000000000..c746f35fe Binary files /dev/null and b/rfcs/20200117-tfx-combining-model-validator-with-evaluator/value-CI-threshold-b.png differ diff --git a/rfcs/20200117-tfx-generic-trainer.md b/rfcs/20200117-tfx-generic-trainer.md new file mode 100644 index 000000000..20098a623 --- /dev/null +++ b/rfcs/20200117-tfx-generic-trainer.md @@ -0,0 +1,253 @@ +# TFX Generic Trainer + +| Status | Accepted | +| :------------ | :-------------------------------------------------------- | +| **Author(s)** | Jiayi Zhao (jyzhao@google.com) | +| **Sponsor** | Konstantinos Katsiapis (katsiapis@google.com), Zhitao Li (zhitaoli@google.com), Karmel Allison (karmel@google.com) | +| **Updated** | 2020-01-17 | + +## Objective + +### Goal + +* Support any TensorFlow Training loop in TFX Trainer in addition to + tf.estimator, primarily focused on native Keras model. + +### Non Goal + +* Natively support multi-worker distributed training by the system. +* Non-TF training that generates savedmodel. + +## Background and Motivation + +In current TFX Trainer component, only tf.estimator is supported for training +and generating models. User provides a module file which contains a +`trainer_fn`, trainer will call the function to get the estimator model and +related spec for training, and generate a saved model by +`tf.estimator.train_and_evaluate`. + +[tf.keras](https://www.tensorflow.org/guide/keras) is TensorFlow's high-level +API for building and training models. It’s currently supported in TFX by using +`tf.keras.estimator.model_to_estimator` in module file. User can create keras +model in their `trainer_fn` but need to convert it to estimator for return (for +example, +[cifar10](https://github.com/tensorflow/tfx/blob/r0.15/tfx/examples/cifar10/cifar10_utils.py)). + +This doc will focus on native Keras support (without model_to_estimator) in TFX. +We propose changing the user facing API to be more generic so that users can do +(single node) native Keras model training within TFX. + +## User Benefit + +* Allows non estimator based training, especially Keras as TensorFlow is + establishing Keras as the + [Standardized high-level API](https://medium.com/tensorflow/standardizing-on-keras-guidance-on-high-level-apis-in-tensorflow-2-0-bad2b04c819a). +* Allows + [custom training](https://www.tensorflow.org/tutorials/customization/custom_training) + for customization of training loop. + +## Detailed Design + +Below shows the pseudo code for current TFX Trainer’s executor: + +```python +class Executor(base_executor.BaseExecutor): + + def Do(self, input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> None: + """Uses a user-supplied tf.estimator to train a tf model locally.""" + trainer_fn = self._GetFn(exec_properties) # load from module file + trainer_fn_args = self._GetFnArgs( + input_dict, output_dict, exec_properties) + + training_spec = trainer_fn(trainer_fn_args) + tf.estimator.train_and_evaluate(training_spec['estimator'], ...) + # For TFMA (downstream evaluator and model validator component). + tfma.export.export_eval_savedmodel(training_spec['estimator'], ...) +``` + +And the user supplied module file contains a function called `trainer_fn` which +returns an estimator: + +```python +def _build_keras_model() -> tf.keras.Model: + model = keras.XXX + model.compile(...) + return model + +def trainer_fn( + trainer_fn_args: trainer.executor.TrainerFnArgs) -> Dict[Text, Any]: + """Build the estimator using the high level API. + + Args: + trainer_fn_args: Holds args used to train the model as name/value pairs. + + Returns: + A dict of the following: + - estimator: The estimator that will be used for training and eval. + - train_spec: Spec for training. + - eval_spec: Spec for eval. + - eval_input_receiver_fn: Input function for eval. + """ + ... + + estimator = tf.keras.estimator.model_to_estimator( + keras_model=_build_keras_model(), ...) + + return { + 'estimator': estimator, + 'train_spec': ..., + 'eval_spec': ..., + 'eval_input_receiver_fn': ... + } + +``` + +We propose that in generic trainer's module file, user not only need to provide +the model, but also control how the model is trained (`train_and_evaluate` for +estimator and `model.fit` for keras will be in user module file instead of in +executor), thus executor can be generic to model, and users can customize the +[training loop](https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough#training_loop). +The executor pseudo code would look like below: + +```python +class Executor(base_executor.BaseExecutor): + + def Do(self, input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> None: + """Train a user-supplied tf model.""" + run_fn = self._GetRunFn(exec_properties) # load from module file + + # run_fn_args contains + # 1. input train and eval data path. + # 2. desired output model path for the trained savedmodel. + # 3. training args, e.g., train/eval steps. + # 4. optional base model. + # 5. optional tuning result (kerastuner.HyperParameters config). + # 6. optional custom config for passing params from component. + run_fn_args = self._GetRunFnArgs( + input_dict, output_dict, exec_properties) + + run_fn(run_fn_args) + # Validates the existence of run_fn's output savedmodel. + ... +``` + +In module file, user needs to provide `run_fn` instead of previous `trainer_fn`. +The `trainer_fn` was responsible for creating the model, in addition to that, +`run_fn` also needs to handle training part and output the trained model to a +desired location given by run args: + +```python +def run_fn(args: trainer.executor.TrainerFnArgs) -> None: + """Build the TF model and train it.""" + model = _build_keras_model() + model.fit(...) + # Save model to args.serving_model_dir. + model.save(...) +``` + +In generic trainer, executor is mainly for handling the +[artifact](https://github.com/tensorflow/tfx/blob/r0.21/docs/guide/index.md#artifacts) +(a unit of data that is passed between components), all model related logic is +user supplied. + +A separate GenericExecutor will be created, and the existing trainer executor +will be sunsetted. We plan to keep estimator based executor for one more version +and then deprecate it. + +### How to convert current estimator based module file + +To convert the current estimator based module file (e.g., +[iris](https://github.com/tensorflow/tfx/blob/r0.15/tfx/examples/iris/iris_utils.py)) +for generic trainer, simply add a run_fn that calls the trainer_fn and train the +returned model (code that used to be in the trainer.executor.Do). + +```python +def run_fn(fn_args: executor.TrainerFnArgs): + """Train the model based on given args. + + Args: + fn_args: Holds args used to train the model as name/value pairs. + """ + schema = io_utils.parse_pbtxt_file(fn_args.schema_file, schema_pb2.Schema()) + + # Reuse the trainer_fn. + training_spec = trainer_fn(fn_args, schema) + + # Train the model + absl.logging.info('Training model.') + tf.estimator.train_and_evaluate(training_spec['estimator'], + training_spec['train_spec'], + training_spec['eval_spec']) + absl.logging.info('Training complete. Model written to %s', + fn_args.serving_model_dir) + + # Export an eval savedmodel for TFMA, note that for keras, eval savedmodel is + # not needed as TFMA2 can use serving model for evaluation. + absl.logging.info('Exporting eval_savedmodel for TFMA.') + tfma.export.export_eval_savedmodel( + estimator=training_spec['estimator'], + export_dir_base=fn_args.eval_model_dir, + eval_input_receiver_fn=training_spec['eval_input_receiver_fn']) + + absl.logging.info('Exported eval_savedmodel to %s.', fn_args.eval_model_dir) +``` + +### tf.distribute.Strategy + +Distribution strategy will be user module's responsibilty with the new generic +trainer interface. To use it, user needs to modify the `run_fn()` in the module +file, below shows the pseudo code example for single worker and multi-worker +distribute strategy. + +For single worker distribute strategy, you need to create an appropriate +[tf.distribute.Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy), +and move the creation and compiling of Keras model inside `strategy.scope`: + +```python +def run_fn(args: trainer.executor.TrainerFnArgs) -> None: + """Build the TF model and train it.""" + mirrored_strategy = tf.distribute.MirroredStrategy() + with mirrored_strategy.scope(): + model = _build_keras_model() + model.fit(...) + model.save(...) +``` + +For multi-worker distribution strategy, the TFX Trainer does not have ability to +spawn multi-worker cluster by +[current executor](https://github.com/tensorflow/tfx/blob/r0.21/tfx/components/trainer/executor.py), +hence not covered in the scope of this RFC. If the execution environment of an +implementation of TFX Trainer has the ability to bring up the cluster of worker +machines, and execute user funtion in the workers with correct +[TF_CONFIG setup](https://www.tensorflow.org/guide/distributed_training#setting_up_tf_config_environment_variable), +such as GCP AI Platform Training service via +[extensions/google_cloud_ai_platform/trainer/executor.py](https://github.com/tensorflow/tfx/blob/r0.21/tfx/extensions/google_cloud_ai_platform/trainer/executor.py), +the `run_fn()` would look like below: + +```python +def _is_chief() -> bool: + """Decide whether the current worker's role is chief.""" + # Check TF_CONFIG (set by TFX when bring up the worker) in execution env. + ... + +def run_fn(args: trainer.executor.TrainerFnArgs) -> None: + """Build the TF model and train it.""" + ps_strategy = tf.distribute.experimental.ParameterServerStrategy() + with ps_strategy.scope(): + model = _build_keras_model() + model.fit(...) + if _is_chief(): + model.save(...) +``` + +For details about `tf.distribute.Strategy`, please refer to +[here](https://www.tensorflow.org/guide/distributed_training). + +## Future work + +* Examples for custom training loop. +* Native support for multi-worker distribution. diff --git a/rfcs/20200205-standalone-keras-repository.md b/rfcs/20200205-standalone-keras-repository.md new file mode 100644 index 000000000..46f174ca3 --- /dev/null +++ b/rfcs/20200205-standalone-keras-repository.md @@ -0,0 +1,503 @@ +# Standalone Keras Repository + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [202](https://github.com/tensorflow/community/pull/202) | +| **Author(s)** | Qianli Zhu (scottzhu@google.com), Francois Chollet (fchollet@google.com) | +| **Sponsor** | Karmel Allison (karmel@google.com) | +| **Updated** | 2020-02-05 | + +## Objective + +Move the Keras code from the TensorFlow main GitHub repository to its own +repository, with TensorFlow as a dependency. + +## Motivation + +### TensorFlow API modularity + +Currently, Keras has to rely on a number of private TensorFlow APIs. However, a +litmus test of the quality of the public TensorFlow low-level APIs is that they +should be strictly sufficient to a higher-level API like Keras. +After splitting the repository, Keras will have to import TensorFlow and +rely exclusively on public APIs. If Keras still ends up using TensorFlow +private features, it might be an indication of tight coupling of +implementation details. If certain private features are extensively used, +we might want to consider exposing them as public low level API. + +This design is also aligned with the design for +[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md), +which splits the TensorFlow project into smaller components that are not +tightly coupled together. + +### Build times + +Building the open-source TensorFlow project end-to-end is an extensive exercise. +With a standard GCP instance, it might take more than one hour to finish the +whole build process (it might take longer with a Mac laptop). Although the local +build cache might help speed up the follow-up builds, the initial time cost is +too high for regular software development workflows. Internally, Google has a +distributed build and caching service, which Googlers heavily rely on, +that can build TensorFlow and run all Keras tests within 5 mins. Sadly, +we can't expose this to external contributors. + +Currently, any contribution to Keras code will require building all of +TensorFlow c++ binary, which is quite expensive to do for average users. +Having a separate repository will allow the Keras package to be built +without building TensorFlow. This should greatly improve the +velocity of open-source developers when they contribute to Keras code. + +### Community Benefit + +The difficulty of building TensorFlow from scratch in order to make a PR +to Keras code has been a significant source of issues: + +* It discouraged contributions, since many external developers couldn't test +their changes and make sure they were correct. +* External developers would send unverified PRs, and Google reviewers spend time +back and forth, fixing the PR. Sometimes PR is just not moving forward because +of the lengthy feedback loop. + +With the new standalone Keras repository, external contributors should +experience much shorter turn-around time when building/testing Keras, since they +don't need to build TensorFlow anymore. +This should have a positive impact on building a vibrant open-source +developer community. + +In addition, by getting the Keras team at Google to start developing Keras +using the same public tools and infrastructure as third-party developers, +we make the development process more transparent and more community-oriented. +In the meantime, some of the workload for repository management can be shared +with community so that Keras team member within Google won't be the bottleneck +for all the issues. + + +## Design Proposal + +### New location of the code + +GitHub: the code will live at [keras-team/keras](https://github.com/keras-team/keras), +joining the other Keras SIG projects and replacing the current external Keras +codebase. `tf.Keras` will also replace Keras on PyPI. + +Also considered: `tensorflow/keras`. + +Pros: +1. Under the umbrella of Keras SIG, which hosts all other Keras related projects +like keras-application, KerasTuner etc. +1. Lots of existing followers on keras-team, who may not be easily migrated to +TF project. +1. Can't easily delete keras project, which already have tons of stars and +incoming reference links. Continued existence of external Keras code will create +confusion ("why is there tensorflow/keras AND keras-team/keras?"). + +Cons: +1. The repo isn't under the same organization as tensorflow, which makes it hard +to manage issues/PRs and references across the organization. +1. Existing issue/PR under the same org can be transferred easily, but not cross the different org. See [here](https://help.github.com/en/github/managing-your-work-on-github/transferring-an-issue-to-another-repository). + +### Source of Truth + +TensorFlow uses a Google-internal code repository as its source of truth. Every PR +submitted though GitHub is converted to a Google-internal change first, +submitted through the internal system, and then copied to GitHub as commits. +At the same time, PR is marked as merged with the corresponding commit hash. + +Likewise, issue tracking and code review takes place through Google-internal tools. + +For Keras, since we are trying to promote community engagement, we hope to use +GitHub as source of truth. This will have the following implications: + +* We expect the majority of the code development/contribution from GitHub +and the dev tools / tests / scripts should focus on the GitHub development use +case. See below for more details. +* Keras CI/presubmit build for the GitHub repo should target a stable PIP +version of tensorflow package as dependency. It could either be (preferably in +this order): + * a stable version + * a release candidate version + * a `tf-nightly` with explicit version. +Using a nightly version for testing should be motivated by the usage of a API +feature not present in the stable or pre-release version. +Depend on a floating `tf-nightly` could cause CI build to be instable, which has +been observed in other repository +[like tf-addons](https://github.com/tensorflow/addons/pull/912). +* The Keras code will be mirrored to a Google-internal code repository via +Google-internal tools within a very short time window after each change. +The Google-internal CI tests will run on HEAD for both Keras and TF code. +* The CI build for the repository on GitHub might break when it points to a +new version of `tf-nightly`, if certain behavior has been changed and wasn't +caught by unit tests. We have observed a few similar cases with +[tf/addons](https://github.com/tensorflow/addons). +We hope this can be reduced by stronger unit test coverage by Google internel +systems, when both TF and Keras code are tested at HEAD. +* pip package management. Keras will now follow the `tf-estimator` approach. +"pip install tensorflow" should also install Keras (from PyPI) as well. +There are more details for the pip package in the +[Improved pip package structure](https://github.com/tensorflow/community/pull/182) RFC. + +### Dependency Cleanup + +As the high-level API of TensorFlow, Keras should have a direct dependency on +TF low-level APIs, but not the other way around. Unfortunately, there is some +existing reverse logic in the TF code that relies on Keras, which we should +update/remove when we split the repository. + +So far there are about 120 usages for Keras within Tensorflow, the current usage +are: +* Unit tests, which relies on Keras to verify certain behavior of TF, like +distribution strategy, tf.function, and eager context. They should either be +converted to integration tests, or port the tests to Keras repository. +* `feature_column`, which uses Keras base layer and model. +* Legacy `tf.layers` in v1 API, which uses Keras base layer as base class. +* legacy RNN cells, which uses Keras serialization and deserialization. +* TPU support code does a isinstance() check for `optimizer_v2`. +* TF Lite for keras model saving utils. +* Aliases from tf.losses/metrics/initializers/optimizers in tf.compat.v1. +* Keras symbolic tensor check in the ops library for tf.function. + +The conclusion from the design meetings are: +1. We prefer to have a clear cut, which means any backwards deps from TF to +keras is not accepted. LazyLoading should be used for this case. +2. `feature_column` package will be moved to Tensorflow/Estimator project, +and will still exported under tf.feature_column name space. +3. Legacy `tf.layers` and RNN cell code will move to keras and still export +under same name space. +4. TPU support code will change to use new type spec, which will be proposed by +new RFC from Dan. +5. The saving util will either be moved to TF, or copeid to TF lite. +6. Any unittest in TF that rely on Keras should either be moved to Keras, or +if it is an integration test, we will create a new package for it, which will +do verification e2e. +7. Keras symbolic tensor check in the ops library, we will do some write with +composite tensor. Since this is a implementation details, but not a hard code +level dependency, this shouldn't be a blocking issue. + +**Note that this is a key point to prevent Keras accidentally break Tensorflow.** + + +### Update Keras to only use public TF APIs + +The current Keras code will still work if we do e.g.: +```python +from tensorflow.python.ops import array_ops + +ones = array_ops.ones([2, 3]) +``` + +However, since Keras is a separate repository, having it only use TF +public APIs will heavily reduce the chance of breakage caused by relying +on private methods or implementation details. We think this point is +critial to the health of the project. This also allows TF to change internal +implementation details without worrying about breaking Keras. + +The converted code should look like e.g.: + +```python +import tensorflow as tf + +ones = tf.ones([2, 3]) +``` + +During this conversion, we might notice that certain TF features used in Keras +are not public. A decision should be made on a case-by-case basis: + +* Copy the functionality from TF to Keras. +* Replace the usage with another alternative TF public API. +* Make the functionality a new TF public API. + +So far there is a long tail of private API usage, can they shouldn't be blocking +the repo split, as long as the majority of the usage has been addressed. + +**Note that the open-source community is encouraged to contribute to this effort.** + +### Two-stage change process + +For any change that is affecting both TensorFlow and Keras, the change +will need to be split into two, one as a PR to the TF repo, +and the other as a PR to the Keras repo. This will introduce overhead and slow +down the change for area's like distribution stragey, and other areas that +might under active development. + +With the internal change history between 2019-01-01 and 2020-01-01: +1. There are 6756 changes submitted to tensorflow/python +2. There are 5115 changes submitted to tensorflow/python but not +tensorflow/python/keras. +3. Among the 1641 changes submitted to tensorflow/keras, 1338 of +them change Keras only without touching tensorflow, and 303 of them change both +Keras and TF. + +This means about 18.5% change that change Keras will change TF, and +4.4% change that change TF will touch Keras in the meantime. + +Here are some common scenarios: + +1. Adding a new feature to TensorFlow, and having Keras rely on it. Note that +the TF change needs to be submitted first, and the Keras PR needs to wait for +the new TF nightly to become available on PyPI. + +Also note that any rollback of the TF PR will cause Keras to break, the +rollback sequence should be PR 33333 and then PR 22222 (see example below). +The Google-internal test for TF should catch the error if the rollback sequence +is not correct. + +```python +# Existing scenario. +# PR 11111 (2 files updated) +# +++ tensorflow/python/ops/array_ops.py +def some_new_function(inputs): + ... + +# +++ tensorflow/python/keras/layers/core.py + +class new_layer(Layer): + + def call(inputs): + array_ops.some_new_function(inputs) + ... +``` + +```python +# New scenario. +# PR 22222 (1 file updated) +# +++ tensorflow/python/ops/array_ops.py +@tf.export('some_new_function') +def some_new_function(inputs): + ... + +================================== +# PR 33333 (1 file updated) +# +++ tensorflow/python/keras/layers/core.py + +class new_layer(Layer): + + def call(inputs): + tf.some_new_function(inputs) + ... +``` + +2. Changing the behavior of an existing TF API. + +Note that the PR 22222 needs to be submitted with both the new and old +function since Google internal CI is still testing from HEAD. +The previous function can be +deleted after PR 33333 is submitted. Also note that this issue is caused by +Keras not using exclusively public TF API, but relying on TF implementation details. +Moving towards only using public APIs should reduce the likelihood of this kind of issue. + +```python +# Existing scenario. +# PR 11111 (2 files updated) +# tensorflow/python/ops/array_ops.py +<<< +def existing_function(inputs): + ... +>>> +def new_function(inputs, knob1=False, knob2=1): + ... +# tensorflow/python/keras/layers/core.py + +class existing_layer(Layer): + + def call(inputs): +<<< + array_ops.existing_function(inputs) +>>> + array_ops.new_function( + inputs, + knob1=True, + knob2=3) +``` + +```python +# New scenario. +# PR 22222 (1 file updated) +# tensorflow/python/ops/array_ops.py +<<< +def existing_function(inputs): + ... +>>> +def existing_function(inputs): + return new_function( + inputs, + knob1=False, + knob2=1) + +def new_function(inputs, knob1, knob2=1): + ... + +================================== +# PR 33333 (1 file updated) +# tensorflow/python/keras/layers/core.py +class existing_layer(Layer): + + def call(inputs): +<<< + array_ops.existing_function(inputs) + ... +>>> + array_ops.new_function( + inputs, + knob1=True, + knob2=3) +``` + +### Continuous integration and presubmit test +Due to the fact that Keras code is also being used within Google, apart from +the normal Github CI (action) tests, We will also run the same tests internally +against HEAD. +1. Github CI and presubmit test will use a stable version of TF binary during +test. +2. Google CI and presubmit test will run against HEAD for both TF and Keras +code. Note that we won't allow submiting Keras code directly to Google +internal code repo, engineers within Google are still allowed to create changes +internally and run test for it. + +The gap between the HEAD version and TF used by Keras should be +as close as possible. Large gap is expect to cause issue for debugging and code +tracing. + +There are a few common cases that either CI could break: +1. Github CI could break when the version of TF it depend on is changed. We +think this can be mitigated by pinning Keras to a explicit version of TF, rather +than a floating version like `tf-nightly`. The presubmit test when changing the +verison nubmer should catch this. In the case that a new stable version is +breaking some Keras test, we should + + * Disable the failed tests and move forward to minimize the gap between + TF HEAD and Keras used version. Report the issue TF team for fix. + + * In the case of major breakage, Keras will stay with old version, report + to TF team and get the issue fixed. + + We hope the second case should be minimized since same tests are running on + Google CI as well. Any change that might break Keras should be caught + with internal presubmits. + +2. Google CI could break when a submitted PR for Keras is mirrored into Google +code base. We can't foresee these breakage since we don't run global presumbit +internally for every CL. In the case of breakage, since external contributor +won't notice this, Keras team in Google will: + + * Rollback the original Keras PR if the fault is at Keras side (miss test + coverage, or bad code interface). + + * Update the internal tests to correctly rely on Keras public contract, or + disable the failed test for the moment. + + We hope both case can be minimized with the internal dependency cleanup as + well as only relying on public TF API described above. + + +### Github Repository Migration + +* For any open Github PR/issue in Keras-team/keras, it need to be copied to +Tensorflow if the content is still relevant in Tensorflow. Otherwise it will +be closed as obsolete. We intend to have a clean keras-team/keras repository +before we copy any issue or PR from TF side. +* For any opening PR in Tensorflow for Keras, team will try to merge them as +much as possible before the migration. For any open PR that hasn't been merged, +we will check if it is still relevant/active, and will be copied to +keras-team/keras. +* The permission of keras-team/keras need to be updated as the codebase is new. +The access level for the repository need to be reestablished. +From least access to most access, the permission levels for an organization +repository are: + + * Read: Recommended for non-code contributors who want to view or discuss the + project. + * Triage: Recommended for contributors who need to proactively manage issues + and pull requests without write access. + * Write: Recommended for contributors who actively push to your project. + * Maintain: Recommended for project managers who need to manage the repository without access to sensitive or destructive actions. + * Admin: Recommended for people who need full access to the project, including sensitive and destructive actions like managing security or deleting a + repository. + +Any existing Keras-team active member should get `Triage` level for now, and +more permission will be granted once we identified active contributers. In the +meantime, Keras team in Google will manage the repository initially, and will +share more permissions with the community member. + +See more details about the project permission in https://help.github.com/en/github/setting-up-and-managing-organizations-and-teams/repository-permission-levels-for-an-organization. + +### Alternative Considered + +Split the Tensorflow python and c++ code into separate pip package, eg +tf-core and tf-python, and tf-python will use a stable version of tf-core +package to build Tensorflow python. We have to maintain the compatiblity +between c++ layer and python layer, which is currently quite stable. + + Pros: + * This should allow us to enjoy speed up of build time for OSS build, since + build/test TF won't require building all the c kernel, which is the majority + of the build time. Internal CI won't be affected since it will always run + against HEAD. + * All the python code still lives in one repository, so we don't need to + split the change into 2 if it changes Keras and TF python at the same time. + + Cons: + * The change that touch both c kernel and TF python code will need to do the + two stage commit process, if the python change relies on c kernel change. + * Less motivated to cleanup the cross dependency between TF and Keras since + it is no longer a required task. + * With Google internel code repo as source of turth, most of the workflow/ + tools will still be Google centric instead of Github centric. + * Keras-team/keras code base will still be there if we don't move new TF + code to it. Having a staled version out there is not ideal, and we should + really merge them (code/issue/community member) together. + +### Performance Implications + +There may be some performance implications as we move towards only using +public TF APIs. We need to maintain a benchmark to ensure that there +is no performance regression. + +### Dependencies + +The TensorFlow pip package will auto-install the Keras package, which shouldn't +make any difference on the end-user side. Under the hood, Keras will be a +different package imported by `tf_core`, like what we do for TF estimator. + +### Developer experience Impact + +* The local build and test times should be greatly reduced, since compiling TF +is no longer needed, and Keras is so far a pure-Python project (this might +change in future when custom c ops are added to Keras). +* Cross-boundary changes will require some extra handling since such changes +needs to be split into two or more PRs. Same for rollbacks. +* Tooling on the GitHub side (for code review, etc.) is not as good as +Google-internal tools. This may impact the develoment velocity for +Keras team members at Google. + +### Best Practices, Tutorials and Examples + +* The new Keras repository should have a new contribution guide about how to +setup a local test environment and iterate based on that. A similar one in +tf/addons can be used as an example. +* The existing TF docs needs to be updated to highlight that Keras code now lives +in a different repository, with a new process for sending PRs, etc. +* When filing an issue, people might need to consider where to send the issue, +e.g. is it a Keras issue or an issue caused by TF but surfaced by Keras. The +different ownership of the repository will also cause difficulties for +transferring the issue. + +### User Impact + +* No end-user facing change for current TF users; the split would only affect +developers, e.g. in-flight PRs during the transition period. +* For current Keras pip package users, they will get the new TF keras package +when they update their pip, which should have more features than the current +Keras-team version. + +## Questions and Discussion Topics + +1. Tools for issue tracking: we can't rely on Google-internal bug tracking tool +since it's not publicly visible, also if managing GitHub issues across the orgs +is hard, we might need to find some alternatives for tracking bugs/features etc. +2. OSS tests for TPU-related code. Since TPUs are not available during local +testing, the verification will have to be done when the PR is mirrored to Google's +internal systems. +3. Transition period for moving the Keras code from `tensorflow/tensorflow` to +`keras-team/keras`. All in-flight PRs / issues will be affected: they need +to be copied to `keras-team/keras`, or if they also touch TensorFlow, then they +need to split into two. diff --git a/rfcs/20200211-tf-types.md b/rfcs/20200211-tf-types.md new file mode 100644 index 000000000..ba027eeb2 --- /dev/null +++ b/rfcs/20200211-tf-types.md @@ -0,0 +1,343 @@ +# TensorFlow Canonical Type System + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [208](https://github.com/tensorflow/community/pull/208) +| **Author(s)** | Dan Moldovan (mdan@google.com) | +| **Sponsor** | Gaurav Jain (gjn@google.com) | +| **Updated** | 2020-03-21 | + +## Objective + +This RFC proposes a new TensorFlow module and namespace (`tf.types`) dedicated to storing implementation-free type definitions, similar to Java interfaces or C++ forward declarations. This module has no other dependencies inside TensorFlow, so any other internal module can depend on it to ensure interoperability without the risk of creating circular dependencies. These definitions can also be used by external users, for example in pytype annotations. +The RFC focuses on the Python API, however the design should be reviewed with cross-language consistency in mind. + +## Motivation + +**Interoperability and composability**. A set of standard types that formalize an interface and decouples it from implementation ensures composability between components, especially when multiple implementations are involved. + +**Supports the [acyclic dependencies principle](https://en.wikipedia.org/wiki/Acyclic_dependencies_principle)**. In many instances, circular dependencies are caused between low-level complex components that need to compose (in this [example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/operators/control_flow.py#L361), AutoGraph needs to recognize datasets, and datasets need to use AutoGraph). Interface extraction is a common pattern for breaking such cycles. + +**Supports pytype**. A set of static types that is consistent under Python’s `isinstance`/`issubclass` is required to support [PEP-484 type annotations](https://www.python.org/dev/peps/pep-0484/) in TensorFlow. This module can serve as the basis for that. + +**Helps formalize requirements for new APIs**. Having a formal, implementation-independent definition for things such as tensors, variables, iterables, iterators makes it easy to document and test compatibility between APIs. + +## User Benefit + +Application developers may use these canonical definitions for pytype annotations. + +Library developers can more easily define their API interfaces by referring to this namespace. + +Developers of modules internal to TensorFlow can use this module to avoid creating circular dependencies. + +## Design Proposal + +### The `tf.types` Namespace / Module + +All the declarations exposed under the `tf.types` namespace reside in the `python/types/*.py` module. These are [abstract base classes](https://docs.python.org/3.7/library/abc.html) with a bare minimum of method definitions and minimal or no implementation, which serve to formalize and document the contract of common types such as `Tensor`, `Variable`, etc. + +These definitions may be used as PEP 484 type hints, although in some cases they may be type- or shape- erased (for example, `tf.types.Tensor` may not necessarily be parametrized by `dtype` or `shape`). Note however that designs which parametrize on shape do exist, see for instance [tensorflow#31579](https://github.com/tensorflow/tensorflow/issues/31579). + +The type definitions are consistent with standard Python [subtyping mechanics](https://docs.python.org/3.8/library/typing.html#nominal-vs-structural-subtyping) such as instance checks or protocols (in versions prior to Python 3.8, it is difficult to simultaneously support both). + +### General Principles + +This module should not contain any implementation code. An advantage of that is that users exploring the implementation of specific types will not need to inspect this module. However, users who do not wish to inspect the code may visit the documentation of these generic types to better understand specifically what are the concrete subclasses of this type expected to do. + +The `tf.types` module may depend on external packages (such as `numpy`) _strictly for the purpose of defining type annotations and documentation_. No dependencies to other TensorFlow interfaces are allowed. Any dependencies on external packages which themselves depend on TensorFlow are expressly forbidden. + +Changes to definitions inside `tf.types` must be approved by TensorFlow leads, and typically should be accompanied by an RFC. + +All type declarations are based on PEP-484 and related specifications, and defined using [typing](https://docs.python.org/3/library/typing.html), with the aim of being compatible with static type checkers like [pytype](https://github.com/google/pytype), [mypy](http://mypy-lang.org/), [pyre](https://pyre-check.org/). + +It is recommended that internal and external type annotations, `isinstance` and `issubclass` checks use these types, eventually deprecating helpers like `tf.is_tensor`. However, concrete types continue to exist - for example, variables are instances of `tf.Variable`, which is now a subclass of `tf.types.Variable`. + +Class type definitions define a minimum of abstract methods and properties which are required for pytype compatibility. + +### Custom `Tensor` types and `tf.register_tensor_conversion_function` + +Custom objects can be used in standard TF operations using [tf.register_tensor_conversion_function](https://www.tensorflow.org/api_docs/python/tf/register_tensor_conversion_function). This dependency injection mechanism allows implicit casting from existing types such as list and tuple without modifying the type definition of these objects. + +``` +>>> class MyClass: +... pass +>>> def conversion_func(value, dtype=None, name=None, as_ref=False): +... return tf.constant(1) +>>> tf.register_tensor_conversion_function(MyClass, conversion_func) +>>> obj = MyClass() +>>> tf.convert_to_tensor(obj) + +``` + +However, `register_tensor_conversion_function` is not compatible with static type checking. + +As a side note, types which that can be converted to a [NumPy array](https://docs.scipy.org/doc/numpy/user/basics.dispatch.html#basics-dispatch) can leverage that mechanism instead, because TensorFlow supports implicit conversion from `ndarray`: + +``` +>>> class MyClass: +... def __array__(self): +... return np.array(1) +>>> obj = MyClass() +>>> tf.convert_to_tensor(obj) + +``` + +For true custom `Tensor` objects, we propose a protocol approach similar to NumPy’s, as alternative to `register_tensor_conversion_function`: + +``` +>>> class MyClass: +... def __tf_tensor__(self): +... return tf.constant(1) +>>> obj = MyClass() +>>> tf.convert_to_tensor(obj) + +``` + +Note that the mechanism above can be made compatible with static type checks using [typing.Protocol](https://www.python.org/dev/peps/pep-0544/#defining-a-protocol): + +``` +>>> class SupportsTensor(Protocol): +... def __tf_tensor__(self): +... pass +>>> def f(x: SupportsTensor): +... pass +>>> obj = MyClass() +>>> f(obj) # passes static type checks +``` + +Ultimately, `TensorLike` is to become a union of the protocol type along with any other types supported through legacy mechanisms: + +``` +TensorLike = Union[List, Tuple, tf.Tensor, SupportsTensor, ...] +``` + +The `Protocol` type is only standardized in Python 3.8. Backports exist through [typing_extensions](https://github.com/python/typing/tree/master/typing_extensions), although they still don’t support Python 3.5. Therefore, typing annotations will only be supported in 3.6+, and complete support is only available in 3.8+. + +Note that `Protocol` subtypes require [@runtime_checkable](https://www.python.org/dev/peps/pep-0544/#runtime-checkable-decorator-and-narrowing-types-by-isinstance) in order to be compatible with `isinstance`. However, that degrades the performance of `isinstance` in a way similar to `abc.ABCMeta`. For that reason, TensorFlow internal logic is encouraged to use the the more direct `hasattr` test for structural type checks of this kind. + +Although this RFC proposes the deprecation of `register_tensor_conversion_function`, it does not set a timeline for removing it. It remains an open question whether interim support for type annotations should be added to `register_tensor_conversion_function`. + +### Support for `tf.function`'s `input_signature` + +Note: this section is non-normative, and only establishes direction for future work. + +The type system listed here can be expanded to allow input signatures using type annotations, see for instance [this thread](https://github.com/tensorflow/tensorflow/issues/31579). + +Presently, the [input_signature](https://www.tensorflow.org/api_docs/python/tf/function) optional mechanism uses [tf.TensorSpec](https://www.tensorflow.org/api_docs/python/tf/TensorSpec) to describe the function arguments: + +``` +>>> @function(input_signature=[TensorSpec([3], dtype=int32)]) +... def f(x): +... tf.print(x) +>>> f(constant([1, 2, 3])) +[1 2 3] +>>> f(constant([1, 2])) # Shape mismatch +ValueError: Python inputs incompatible with input_signature +>>> f(constant([1.0, 2.0, 3.0])) # DType mismatch +ValueError: Python inputs incompatible with input_signature +``` + +It is expected however that some or all of this information will be repeated by the function's type annotations. Type annotations may be generic, for example by only specifying a dtype: + +``` +>>> @function(input_signature=[TensorSpec([3], dtype=int32)]) +... def f(x: Tensor[int32]): +... ... +``` + +In such cases, `tf.function` is expected to verify that the type annotation matches the `input_signature`. + +In the long term, this RFC recommends that type annotations fully replace the `input_signature`, so far as the Python type annotation system allows it. This RFC does not prescribe a scheme for such type annotations; the implementation should make best use of the available Python capabilities and standards. + +An example of such annotations could be: + +``` +>>> class BatchSize(Dimension): +... value = 32 +>>> @function +... def f(x: Tensor[Int32, Shape2D[BatchSize, DynamicSize, DynamicSize]]): +... ... +``` + +Internally, such type annotations should still be represented as `tf.TypeSpec` objects, ensuring backward compatbility. + +### Initial Type Hierarchy + +TensorFlow generally adopts an incremental development method. This RFC aims to remain consistent with that. + +Below are listed the major types presently used in TensorFlow. All types included in this list are subject to [normal compatibility rules](https://www.tensorflow.org/guide/versions), so they are unlikely to change in the future. It is therefore preferable to maintain a strict minimum of orthogonal declarations and carefully vet any additions. + +Most of these symbols will not be initially exported as public symbols. Only internal submodules will be able to use unexported types. The unexported types may be gradually exposed under `tf.types` or under `tf.types.experimental`. + +The initial type hierarchy is focused on V2 symbols. We expect to encounter places where these symbols would not be compatible with V1 code; in such cases, the V1 symbols will not be affected. + +#### Types created by this RFC + +These types will be added with the initial creation of the `tf.types` namespace. + +* Core tensor types + + * `DType` + * `Shape` + * `Tensor` - generic dense tensor + * `TensorLike` - any type that can be implicitly converted to `Tensor` (see for example https://github.com/tensorflow/addons/blob/master/tensorflow_addons/utils/types.py) + +#### Potential types for subsequent implementation + +These types are raised for discussion by this RFC, but are not part of the original implementation, unless they are strictly required for consistency (to be determined during the initial submission). + +Many of these are expected to be required when breaking the cyclic dependencies that currently exist between submodules. However, it is hoped that opening them up for discussion early can help create a more coherent type system. + +* Core types + + * Tensor specializations + * `Symbol` - the regular graph tensor + * `Value` - eager tensors + * `Variable` + +* Container types + + * `Composite` - low-level static structure (opaque to GraphDef/IR) + * `Module` - builder for structures of `Variables` (invisible to GraphDef/IR) + * `Optional` - basic programming construct, currently prototyped in `tf.data.experimental.Optional`; unlike `typing.Optional`, it doesn't include `None` + * `List` - superclass for `TensorArray`, `Queue`, etc. (opaque to GraphDef/IR) + +* Higher-level types + * `Dataset` - ETL pipeline + * `Iterator` - basic stateful programming construct + * `Iterable` - basic stateless programming construct + * `Function` - basic programming construct + * `Error` - superclass of all TF-specific errors + + * Distributed types + * `DistributedDataset` - collective ETL + * `DistributedIterator` - collective iterator + + * Low-level execution primitives + * `Graph` - GraphDef/IR program + * `FunctionGraph` - IR of a single concrete function + +#### Adding new types + +This module may contain public symbols, exported using `@tf_export`, and private (unexported) symbols. Private symbols are reserved exclusively for internal submodules. Only public types are subject to the normal compatibility guarantees. + +Private types should only be added here if more than one submodule requires them. + +Public types represent established, well-known types that are critical to the TensorFlow API. They may only be added with API owners approval. In general, a type should be thoroughly scrutinized before being made public. Prefer to err on the side of keeping it private, when in doubt. Ideally, new public types should be introduced using the RFC process. Using `experimental` to pilot new types before a complete implementation is encouraged. + +A good candidate for a public `tf.types` definition meets the following criteria: + * has at least two concrete implementations, and at least one is part of the core TensorFlow API + * represents a well-established programming abstraction (e.g. list, map, object) + * does not overlap with existing definitions + * is consistent with existing definitions + * is compatible with all applicable core APIs + +#### Detailed notes + +##### Optional + +`tf.types.Optional` is the [nullable type](https://en.wikipedia.org/wiki/Nullable_type) in TensorFlow. + +Example graph code: + +``` + >>> ds = tf.data.Dataset.range(3) + >>> itr = iter(ds) + >>> opt_next = tf.data.experimental.get_next_as_optional(itr) + >>> tf.print(opt_next.has_value(), opt_next.get_value()) + 1 0 +``` + +It is not fully equivalent with `typing.Optional`, as TensorFlow has no explicit type or value for `None`. + +### Alternatives Considered + +* N/A + +### Performance Implications + +* There is a potential performance concern if using `abc` for the abstract base types, which are about an order of magnitude slower for `isinstance` checks. The cost of `isinstance` may be non-negligible for eager execution or scaling to large graphs. In such cases, we may want to avoid using `abc`. See https://github.com/tensorflow/community/pull/208#discussion_r382494902. + +### Dependencies + +* None, by definition. + +### Engineering Impact + +* Engineering impact: Separate interfaces allow for faster loading times by reducing coupling between modules. +* Maintenance: Minimal maintenance overhead since there is no functionality involved. The TensorFlow team and contributors will maintain the documentation up to date. Changes should be reviewed and approved by the TensorFlow team leads. + +### Platforms and Environments + +* Platforms: Python only, in the first stage. However, the type system should be aligned as much as possible with the core types in the TensorFlow runtime, and be language-independent as much as possible. +* Execution environments: The type system is independent of platform. This also implies that no platform-specific types (such as `TPUTensor`) exist. + +### Best Practices + +* This set of type definitions support the acyclic dependencies principle, by requiring that implementations avoid lateral dependencies (e.g. with a linter rule). + +### Tutorials and Examples + +* As the design matures, we plan to showcase libraries that leverage this pattern. +* Type annotations will be included in existing tutorials as definitions become final. + +### Compatibility + +* New minor version. Existing classes (`tf.Tensor`) will become subclasses of the new type interfaces. +* Most subcomponents of TF (Lite, distributed, function, SavedModel) will depend on this new module, although their functionality is not impacted. +* Libraries which depend on TensorFlow are encouraged to refer to `tf.types` definitions, rather than the concrete implementations for better future compatibility. + +### User Impact + +* Users will see a new `tf.types` module, that may be referenced from documentation and type annotations. + +## Design Review Notes + +Changes since initial PR: + * New section discusses overlap of type system with existing register tensor conversion function: + * TLDR: The python protocol mechanism is much more compatible with type annotations compared to register_tensor_conversion_function. + * Python Protocol works with type annotations. + * We should be able to deprecate `register_tensor_conversion_function`. + * Protocol support for type annotations introduced with typing in 3.8, backported to 3.x except 3.5, so type annotations will not work correctly in 3.5. + * New section discussing relationship with TensorSpec/TypeSpec/input signatures: + * TLDR: Will continue to have input_signature in conjunction with function type annotations. A complete design for annotation-based input_signature is future work and currently out of scope. + * We will need to verify that input_signarture agrees with type annotations when both are present. + * There is concern about having two ways of expressing signatures (type annotations and input_signature). + * There were many questions around the best way to represent known and unknown shapes, interaction with concrete functions, the concrete type of the result of tf.function, etc. A few points raised: + * `input_signature` cannot be recognized as type annotation. + * Supporting parameterized types with values as required for shapes is tricky. + * Dimension values can be known or not known in today’s implementation; `Literal` may be able specify dimension sizes, but that needs to be verified. + * Supporting meta functions where signature is an argument still requires the input signature mechanism (e.g. get_concrete_function). + * If we add this to `tf.function`, should the function take a signature argument that looks like the type annotation? Unclear whether that’s possible or makes sense. + * Will need to support both at the same time at least for a while. + * If both are present then they will likely need to be merged to the most specific common signature. + * There is a concern about whether to choose specific or generic shapes when the type annotation is incomplete. + * Union types will also need to instantiate multiple concrete functions (like `tf.function` does today). + * A proposal was to allow a function with types to be passed to tf.function as long as you don’t pass a `input_signature`. +This also applies to `tf.map_fn` and other APIs, not just `tf.function`. + * Another question was about using type annotations to parametrize structures of types, e.g. where output has the same structure but different types. This may be beyond the expressivity of the current type annotation system. + * New section containing guidance on adding new public generic types: + * TLDR: The RFC only adds a very small number of types, a complete type system is out of scope. + * There is some concern about the graph-vs-eager distinction when composing `Tensor` objects, e.g. with `CompositeTensor`. + * For now we’ll try to avoid exporting specific types for Graph vs. Eager until the `CompositeTensor` issue is resolved. Most likely we’ll need some type erasure. + * RFC proposes guidance for adding new types in the future. + * Other changes: + * Added mention on `Tensor`-like types + * List of lists can be converted to tensor but only if they are not ragged. The type annotation system will not enforce that. + * Clarified difference between `tf.Optional` and `typing.Optional`. + * Added clarifications that the spirit of RFC is to be compatible with Python standards and support all python static checkers, not just pytype. + * `isinstance` is not necessary the recommended mechanism, but the type system should be compatible with it and other Python mechanisms + * Discussion on open questions + * Having multiple type definitions (e.g. `tf.Tensor`, `tf.types.Tensor`) can be confusing to users: + * We want to try to expose only one type. + * The preference is to only expose the generic type, but in some cases the specialized one is already exported. In some cases, like `tf.Tensor` we can make the switch. + * In other cases, like `tf.Variable`, we can’t because `Variable` objects are instantiated directly. + * So we’ll punt on `tf.Variable` for now. + * Flat vs. hierarchical namespace: + * Does not need to be answered now, since the initial number of types is small. + * A good default is to mimic the existing namespace structure (e.g. for datasets use `tf.types.data.Dataset`) + * Should we include highly specialized types? e.g. `FuncGraph` classes for cond/while branches: + * They may be needed for Keras. + * In general, if an external library needs a type that’s an indication that it might be useful to export that type. + * `isinstance` support for `register_tensor_conversion_function` using e.g. ABC magics. + * Decision made to accelerate the deprecation of register_tensor_conversion_function instead. diff --git a/rfcs/20200218-tf-c-saved-model.md b/rfcs/20200218-tf-c-saved-model.md new file mode 100644 index 000000000..a17aaae5d --- /dev/null +++ b/rfcs/20200218-tf-c-saved-model.md @@ -0,0 +1,539 @@ +# TF SavedModel C/C++ API + +| Status | Accepted | +| :---------- | :------------------------------------------------------------------------------------------------- | +| **RFC #** | [207](https://github.com/tensorflow/community/pull/207) | +| **Authors** | Brian Zhao (bmzhao@google.com), Hye Soo Yang (hyey@google.com), Paige Bailey (webpaige@google.com) | +| **Sponsor** | Gunhan Gulsoy (gunan@google.com) | +| **Updated** | 2020-02-19 | + +## Motivation + +Many [developers would like to use Tensorflow’s C++ API](https://stackoverflow.com/questions/33620794/how-to-build-and-use-google-tensorflow-c-api), but the current developer user journey has a few pain points. This is a result of many factors, including: insufficient documentation, no readily-available build artifact (we do not distribute libtensorflow\_cc, forcing users to compile from source), and the general low-level nature of the APIs themselves. Since the most critical C++ API use case is production inference, we will focus on the [Saved Model API](https://github.com/tensorflow/tensorflow/blob/c347ded23c5fa658bcd315b4fdaa5e09ed4e3ef4/tensorflow/python/saved_model/README.md). + +Currently, using the API requires significant setup, including [manually copying header directories from bazel outputs](https://github.com/tensorflow/tensorflow/blob/b391cb55c2861f1cf57311f85b4a893604fea3af/tensorflow/cc/BUILD#L813). For a full example loading and running inference in C++, please refer to [here](https://github.com/bmzhao/saved-model-example/blob/bee5d5a8be80eeee3134ec8f4dc8fc2a75eebf97/load_example.cc). + + +## Objectives + +We would like to revamp the C++ saved model user journey by + +1. Examining the set of use cases we intend to support +2. Identifying API constraints, requirements, and pain points for each use case +3. Designing a new C++ saved model API, while being thoughtful of evolution in the surrounding space (notably TF2 and MLIR) +4. Creating detailed documentation for C++ API and examples. + +# Background + +This RFC builds on top of TF2 Python SavedModel APIs, which are described in previous RFCs such as [SavedModel Save/Load in 2.x](https://github.com/tensorflow/community/blob/d066269dd0f231b8804c016c27ecfd2e809fa613/rfcs/20181116-saved-model.md) and [Keras SavedModel saving/loading](https://github.com/tensorflow/community/blob/d066269dd0f231b8804c016c27ecfd2e809fa613/rfcs/20190509-keras-saved-model.md). We assume the reader is familiar with concepts presented in these RFCs. + +## Use Cases + +Tooling across the TF ecosystem relies on saved model’s interface (eg: [TF.js](https://github.com/tensorflow/community/blob/d066269dd0f231b8804c016c27ecfd2e809fa613/rfcs/20190821-nodejs-saved-model.md), [TF Hub](https://www.tensorflow.org/hub/tf2_saved_model), XLA, MLIR, [Servo](https://www.tensorflow.org/tfx/serving/serving_basic#train_and_export_tensorflow_model), [TF Lite](https://www.tensorflow.org/lite/convert), etc). + +![tf_status_internal_actual_dependency_graph](20200218-tf-c-saved-model/saved_model_diagram.png) + +Defining a unified, modern SavedModel C/C++ API: + +1. Enables features across multiple languages (Java, Go, JS, etc) through C bindings +2. Raises the abstraction level for users in the ecosystem closer to TF Python APIs + +The current set of use cases we’ve considered are documented below: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Use Case + Priority + Supported by Current C++ SavedModel API +
+ Inference + High + Yes, but V1 style (with session) +
+ Data-only retraining (fine-tuning weights or “resuming training after interruption”) + Medium + Inconvenient (possible through running session.run calls + SaverDef signature on an exported training function) +
+ From scratch training + Low + Inconvenient (effectively same as above, but with no initial weights) +
+ Model Composability (eg: transfer learning) + Low + Inconvenient (currently limited to session.run on whatever subgraphs are saved) +
+ Limited Introspection/Modification of the model + Low + Yes (SignatureDefs are exposed in the loaded model, or ahead of time usage of CLI tool saved_model_cli) +
+ +### Inference + +The highest prioritized use case for the C++ API is serving-time inference. Although it works today, there are a few areas for improvement. + +#### TF 1 Session Coupling + +The current API is still tightly coupled to TF1’s graph + session representation of the world. + +Loading a savedmodel requires a [SessionOptions](https://github.com/bmzhao/saved-model-example/blob/bee5d5a8be80eeee3134ec8f4dc8fc2a75eebf97/load_example.cc#L20) object. Running the model uses a [Session object](https://github.com/bmzhao/saved-model-example/blob/bee5d5a8be80eeee3134ec8f4dc8fc2a75eebf97/load_example.cc#L54) with [“feeds” and “fetches”](https://github.com/bmzhao/saved-model-example/blob/bee5d5a8be80eeee3134ec8f4dc8fc2a75eebf97/load_example.cc#L58-L59). For newcomers to the Tensorflow community, this breaks the higher level abstraction boundary that tf.function introduces. + +**Design goal: Align C++ APIs with TF 2.x concepts and paradigms** by decoupling session from SavedModel API. + +#### Opaque Magic Strings + +The [names of the input and output tensors required to use the API](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/core/public/session.h#L134-L136) are opaque implementation details outside of user control. For example, the [Keras MNIST tutorial’s](https://www.tensorflow.org/tutorials/quickstart/beginner) saved tensor names are [“serving\_default\_flatten\_input” and “StatefulPartitionedCall”](https://github.com/bmzhao/saved-model-example/blob/bee5d5a8be80eeee3134ec8f4dc8fc2a75eebf97/load_example.cc#L58-L59). + +Finding these strings involves: + +1. Manually running the [saved\_model\_cli](https://www.tensorflow.org/guide/saved_model#overview_of_commands) command line tool ahead of time on the model, then hardcoding the appropriate constants, or... +2. Dynamically iterating through the SignatureDef map, [expecting some predefined structure](https://github.com/tensorflow/serving/blob/b9602b3820dddea0d9fa9423b1ae9eaaf2aec977/tensorflow_serving/g3doc/signature_defs.md#classification-signaturedef) + +**Design goal: Make the API simple and intuitive**. Do not force users to understand implementation details. + +#### AOT Compilation + +Another inference consideration is AOT (ahead of time) compilation of Tensorflow graphs for optimal performance. Our definition of AOT compilation is any preprocessing that occurs on a saved model to “lower” the computation into a more machine friendly representation (like flatbuffers or C++ object files). This preprocessing is typically done by a compiler which can perform additional optimizations (like variable freezing, constant folding, common subexpression elimination, etc), and may impose additional constraints (like device placement). + +For example, [tfcompile](https://www.tensorflow.org/xla/tfcompile#what_is_tfcompile) allows users to convert an [XLAConfigProto](https://www.tensorflow.org/xla/tfcompile#step_1_configure_the_subgraph_to_compile) describing a TF subgraph into an [autogenerated C++ library](https://www.tensorflow.org/xla/tfcompile#step_3_write_code_to_invoke_the_subgraph) that a user can compile and link against. + +We expect that an AOT workflow may be required for performance-sensitive environments (mobile, edge, etc), and that different AOT formats may be needed. + +Our current thoughts are: + +1. Saved Model’s design shouldn’t preclude AOT workflow +2. We shouldn't add AOT compilation directly in SavedModel’s API + +**Design Goal:** **Decouple “serialization format” from “runtime”** **by removing the “Run” method from SavedModel**. SavedModel should be divided into a “SerializedModel” type that represents access to on-disk data only and allow “other types” to perform “running/compilation”. EG: data-only SavedModel can be used to construct another type which we call “.run()” or “.compile()” on. + +### Training From Scratch/Retraining + +Training support requires the ability to: + +1. Express tensor computation +2. Compute gradients +3. Save weights + +There are a few limitations to the current C++ API that make this difficult. + +First, there are _no C++ APIs today to express computation beyond “[executing ops](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/c/eager/c_api.h#L379)” or “[executing a session](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/core/public/session.h#L134)”_. This means that users must: + +1. Manually build a graph of all computation and invoke “session.run” or +2. Write boilerplate code to call “TFE\_Execute” on each op computation. + +This is rather inconvenient. An ideal solution involves designing a full-fledged C++ “eager” API like the following: + +```c++ +tensorflow::Tensor w({ + {1, 2}, + {3, 4} +}); +tensorflow::Tensor x({100, 100}); +tensorflow::Tensor b({5, 6}); + +// Underneath the hood this calls TFE_Execute( matmul op …), TFE_Execute( add op …) +tensorflow::Tensor y = w * x + b; +``` + +Second, gradient support is required for training. Today, gradients are registered in python via [tf.RegisterGradient](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/python/framework/ops.py#L2399), and [gradients\_util.py](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/python/ops/gradients_util.py#L479) manipulates the graph to include necessary gradient logic. Gradient construction/registration from C++ is limited to a [small subset of ops](https://github.com/tensorflow/tensorflow/tree/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/cc/gradients) using the [REGISTER\_OP\_GRADIENT](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/core/framework/function.h#L888) macro. A significant amount of per-op python gradient implementation and python logic porting to C++ would be necessary for a pure C++ training API. + +For the SavedModel use case, TF python code can export a tf.function whose graph contains these gradients already. This means that training is possible in C++ today as long as TF python code first generates the saved graph. + +Finally, C++ support for saving variable weights is necessary. Fortunately, this is possible today by invoking the functions referenced in a TF2 Saved Model’s [SaverDef](https://github.com/tensorflow/tensorflow/blob/31679b0d8440d2f119a2dc060b7d04fe77111bda/tensorflow/core/protobuf/meta_graph.proto#L81) proto. _Ideally_, a C++ saving API would be similar to the current [TF2 Python saving API](https://github.com/tensorflow/tensorflow/blob/59840cf101741aac00070a066259bf0b6d4d17ec/tensorflow/python/saved_model/save.py#L767) ([RFC](https://github.com/tensorflow/community/blob/master/rfcs/20181116-saved-model.md)), offering users a way to serialize an arbitrary “object hierarchy” composed of functions and variables, and would be transparently interoperable with an object hierarchy saved from python. + +Building a fleshed out C++ training API will require significant investment, especially since most of Tensorflow’s high level logic is implemented only in python. + +**Design Goal: Ensure current design does not preclude the creation of higher level C++ APIs described above.** We should aim for incremental delivery of small portions of functionality. Future work should be guided by a northstar of a high-level C++ API. + +### Introspection vs Composability + +In general, we don’t want to expose APIs that make assumptions on TF’s serialized representation. + +For example, we could offer more advanced savedmodel “introspection/modification” APIs that manipulate the saved graph to add additional nodes, remove nodes, or fuse nodes in a saved function. These modifications might be nontrivial, or require a different API surface depending on the internal serialized representation (graph vs IR). + +We want Savedmodel C++ APIs to expose the [same primitives](https://github.com/tensorflow/community/blob/d066269dd0f231b8804c016c27ecfd2e809fa613/rfcs/20181116-saved-model.md#serialization-primitives) as TF2 Python SavedModel’s object hierarchy. + +**Design goal: Do not allow arbitrary introspection. Hide internal representation.** Users should only compose whatever primitives our API provides. Initially this will be limited to invoking serialized tf.functions. + +### Additional Design Considerations + +#### Modular Tensorflow + +In accordance with [Modular Tensorflow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md), we would like our APIs to have ABI stability, so that users can simply link against a provided shared object without worrying about compiling Tensorflow from source, or ABI compatibility issues (eg: due to mismatching compiler flags, C++ std type layout changes, etc). + +To achieve this, we will implement a C++/C/C++ sandwich. A user-facing C++ header only API will internally call into a C API (exposed by [libtensorflow](https://www.tensorflow.org/install/lang_c)). This C API will wrap a C++ implementation. + + +## Summary of Goals + +Given the above use cases and API constraints, here is a summary of our design goals. + +Goals: + +* Align C++ APIs with TF 2.x concepts and paradigms (e.g. decouple session from SavedModel API) +* Make the API simple and intuitive +* Decouple “serialization format” from “runtime” +* Work towards higher level C++ APIs as a northstar +* Favor Composition over Introspection +* Use a C layer to maintain ABI stability + +Non-goal: + +* Changing current core TF runtime. + +## Detailed Design + +We introduce a new C++ type `tensorflow::cc::SavedModelAPI`. + +`tensorflow::cc::SavedModelAPI::Load` is a static factory function that loads a SavedModel from a directory path passed as a string. It also takes two additional parameters: `tensorflow::cc::Context` (a C++ wrapper of `TFE_Context)` and an optional `std::unordered_set` of tags. + +1. The `Context` argument can be used to provide global runtime configuration options (analogous to the existing SavedModel’s [SessionOptions](https://github.com/tensorflow/tensorflow/blob/20c1ba21a9bf0ef413c83a6bcc4e79c6f65eb868/tensorflow/core/public/session_options.h#L28) struct). This also fits TF2’s general API direction, which uses [TFE\_ContextOptions](https://github.com/tensorflow/tensorflow/blob/ee4a891f34d6f634a38eb889759f3ad49a17a22d/tensorflow/c/eager/c_api_internal.h#L53-L54) to wrap SessionOptions. +2. `Context` provides the SavedModel a hook into the Tensorflow runtime, which decouples SavedModel from the runtime implementation. +3. To support loading TF1 SavedModels (which may contain multiple MetaGraphs), we will add an optional “tags” argument to support loading a particular MetaGraph. + +`tensorflow::cc::SavedModelAPI` has a method `GetFunction`, which _effectively_ takes a json-like path to a serialized `tf.function` in the SavedObjectGraph proto, and returns a `StatusOr`. (The actual C++ types is slightly different, explained further below). This json-like function path is the set of object names (separated by dots) that the equivalent python code would have to access in order to obtain a handle to the function. + +This method + +4. Removes Session from SavedModel’s interface +5. Raises the abstraction level to TF2 type Functions +6. Offers a way to traverse TF2’s SavedObjectGraph hierarchy. + +Additionally, `tensorflow::cc::SavedModelAPI` has a method `GetSignatureDefFunction`, which takes the string key of a [SignatureDef map](https://github.com/tensorflow/tensorflow/blob/69b08900b1e991d84bce31f3b404f5ed768f339f/tensorflow/core/protobuf/meta_graph.proto#L89). + +7. This allows users to continue loading TF1 SavedModels. + +`tensorflow::cc::ConcreteFunction` becomes a first class C++ type that users can independently manage. + +8. Users can compose `tensorflow::cc::ConcreteFunctions` in their own “Module” like C++ classes by storing them as member variables, and invoking them. +9. This plays nicely for a future potential `tensorflow::Module` C++ type. + +`tensorflow::cc::ConcreteFunction` has a Run() method which takes vectors of input tensors and output tensors, and returns a Status. + +10. This decouples SavedModel from “Run” + +`ConcreteFunction` has a method to retrieve `FunctionMetadata`, which will include additional, optional metadata objects such as SignatureDefs (for a TF1 function), InputSignatures (for TF2 tf.functions), a “funcpath” string (for TF2 functions). + +11. `FunctionMetadata` provides minimal runtime introspection of `ConcreteFunctions`, allowing users to dynamically choose which functions to run. + +In the existing API, per-session.run() configuration options are exposed via [RunOptions](https://github.com/tensorflow/tensorflow/blob/20c1ba21a9bf0ef413c83a6bcc4e79c6f65eb868/tensorflow/core/protobuf/config.proto#L593). In the new API, we propose using a new type `tensorflow::cc::FunctionRunOptions`. `FunctionRunOptions` is semantically equivalent to the existing `RunOptions`, but is implemented as a C++ type instead of a protobuf message. + +This allows TF to: + +12. Move away from serialized protos on the API surface, which is currently the case for [RunOptions](https://github.com/tensorflow/tensorflow/blob/20c1ba21a9bf0ef413c83a6bcc4e79c6f65eb868/tensorflow/c/c_api.h#L1221) +13. Improve runtime performance, by removing a layer of proto marshalling and unmarshalling. + +`tensorflow::cc::ConcreteFunction::Run()` will have a default parameter for `tensorflow::cc::FunctionRunOptions`, with suitable defaults set. + +14. This makes the default API simple to use, but gives power users (like Tensorflow Serving) the ability to tweak fine grained knobs + +### C++ API + +`tensorflow/cc/saved_model/experimental/saved_model_api.h` + +```c++ +namespace tensorflow { +namespace cc { + +class SavedModelAPI { + + ConcreteFunction* GetFunction( + const std::string& function_path, + Status* status); + + ConcreteFunction* GetSignatureDefFunction( + const std::string& signature_def_key, + Status* status); + + std::vector ListFunctions(); + + static std::unique_ptr Load( + const std::string& saved_model_path, + const Context& context, + Status* status, + const std::unordered_set* tags = nullptr); + + private: + explicit SavedModelAPI(TF_SavedModel* model) : saved_model_(model) {} + + struct TFSavedModelDeleter { + void operator()(TF_SavedModel* p) const { TF_DeleteSavedModel(p); } + }; + std::unique_ptr saved_model_; +}; + +} // namespace cc +} // namespace tensorflow +``` + +`tensorflow/cc/saved_model/experimental/concrete_function.h` + +```c++ +namespace tensorflow { +namespace cc { + +class ConcreteFunction { + + std::vector Run( + const std::vector& inputs, + Status* status, + const FunctionRunOptions& options = FunctionRunOptions::Defaults()); + + FunctionMetadata* GetFunctionMetadata(); +}; + +class FunctionMetadata { + // TBD; let's start with something similar to tf.function input + // signatures +}; + +// Wraps a options we can pass to the runtime. +class FunctionRunOptions { + + // Fields TBD. Most likely some of these will be similar to + // RunOptions. + + // Alternatively, we could have the default constructor do this instead + // by having fields set via default member initializers instead. + static FunctionRunOptions& Defaults(); +}; + +} // namespace cc +} // namespace tensorflow +``` + +### C API + +`tensorflow/c/experimental/saved_model/public/saved_model_api.h` + +```c++ +#ifdef __cplusplus +extern "C" { +#endif + +typedef struct TF_SavedModel TF_SavedModel; + +// Load a SavedModel from `dirname`. +// +// Params: +// dirname - A directory filepath that the SavedModel is at. +// ctx - A TFE_Context containing optional load/TF runtime options. +// `ctx` must outlive the returned TF_SavedModel pointer. +// tags - Pointer to char* array of SavedModel tags. Conceptually, +// this is a std::optional>>. The first pointer +// represents the "optional" part. If tags = nullptr, we expect the +// SavedModel to contain a single Metagraph (as for those exported from +// `tf.saved_model.save`). If tags != nullptr, we expect +// *tags = char*[tags_len], and load the metagraph matching the tags. +// tags_len - number of elements in the `tags` array. +// status - Set to OK on success and an appropriate error on failure. +// Returns: +// If status is not OK, returns nullptr. Otherwise, returns a newly created +// TF_SavedModel instance. It must be deleted by calling TF_DeleteSavedModel. +// TODO(bmzhao): Before this API leaves experimental, consider introducing a +// new C API Symbol TF_LoadSavedModel that doesn't take `tags`, so that this +// function can take a `tags` double pointer instead. +TF_CAPI_EXPORT extern TF_SavedModel* TF_LoadSavedModel( + const char* dirname, TFE_Context* ctx, const char* const* const* tags, + int tags_len, TF_Status* status); + +// Deletes a TF_SavedModel, and frees any resources owned by it. +TF_CAPI_EXPORT extern void TF_DeleteSavedModel(TF_SavedModel* model); + +// Retrieve a function from the TF2 SavedModel via function path. +// +// Params: +// model - The TF2 SavedModel to load a function from. +// function_path - A string containing the path from the root saved python +// object to a tf.function method. +// TODO(bmzhao): Add a detailed example of this with a +// python tf.module before moving this out of experimental. +// status - Set to OK on success and an appropriate error on failure. +// Returns: +// If status is not OK, returns nullptr. Otherwise, returns a +// TF_ConcreteFunction instance. The lifetime of this instance is +// "conceptually" bound to `model`. Once `model` is deleted, all +// `TF_ConcreteFunctions` retrieved from it are invalid, and have been deleted. +TF_CAPI_EXPORT extern TF_ConcreteFunction* TF_GetSavedModelConcreteFunction( + TF_SavedModel* model, char* function_path, TF_Status* status); + +// Retrieve a function from the TF SavedModel via a SignatureDef key. +// +// Params: +// model - The SavedModel to load a function from. +// signature_def_key - The string key of the SignatureDef map of a SavedModel: +// https://github.com/tensorflow/tensorflow/blob/69b08900b1e991d84bce31f3b404f5ed768f339f/tensorflow/core/protobuf/meta_graph.proto#L89 +// status - Set to OK on success and an appropriate error on failure. +// Returns: +// If status is not OK, returns nullptr. Otherwise, returns a +// TF_ConcreteFunction instance. Once `model` is deleted, all +// `TF_ConcreteFunctions` retrieved from it are invalid, and have been deleted. +TF_CAPI_EXPORT extern TF_ConcreteFunction* TF_GetSavedModelSignatureDefFunction( + TF_SavedModel* model, char* signature_def_key, TF_Status* status); + +// Returns a list of all ConcreteFunctions stored in this SavedModel. +TF_CAPI_EXPORT extern TF_ConcreteFunctionList* TF_ListSavedModelFunctions( + TF_SavedModel* model); + +#ifdef __cplusplus +} +#endif +``` + +`tensorflow/c/experimental/saved_model/public/concrete_function.h` + +```c++ +#ifdef __cplusplus +extern "C" { +#endif + +typedef struct TF_ConcreteFunction TF_ConcreteFunction; + +// Returns FunctionMetadata associated with `func`. Metadata's lifetime is +// bound to `func`, which is bound to the TF_SavedModel it was loaded from. +TF_CAPI_EXPORT extern TF_FunctionMetadata* TF_ConcreteFunctionGetMetadata( + TF_ConcreteFunction* func); + + +#ifdef __cplusplus +} +#endif +``` + +### Additional Implementation Considerations + +#### ABI Stability + +Finally, due to our ABI stability goal, all types exposed in the C++ header only API’s surface (function input and output types) must be either + +1. "`std::`" types + +2. C++ types implemented in a header only library (there must not be a .cc file) + +3. types that wrap a C type. + +For category 3 of "types that wrap a C type", we propose putting them under a new "`tensorflow::cc`" namespace. These types will have an ABI stability guarantee. + +Note that this will mean the Tensorflow codebase will have the same "conceptual type" under two parallel namespaces, but the one under the "`tensorflow::cc`" namespace has the ABI stability guarantee. + +For example, `SavedModelAPI`'s `GetFunction` method will have to return a `tensorflow::cc::Status`, that wraps the C [TF_Status](https://github.com/tensorflow/tensorflow/blob/c347ded23c5fa658bcd315b4fdaa5e09ed4e3ef4/tensorflow/c/tf_status_internal.h#L24), instead of directly being a [tensorflow::Status](https://github.com/tensorflow/tensorflow/blob/c347ded23c5fa658bcd315b4fdaa5e09ed4e3ef4/tensorflow/core/platform/status.h#L39). + +#### TFE\_Execute Integration + +As a result of [Single python code path for eager and graph (RFC)](https://github.com/tensorflow/community/blob/d066269dd0f231b8804c016c27ecfd2e809fa613/rfcs/20191203-single-eager-graph-path.md), Python’s FuncGraph representation is moving to C++. We expect this C++ FuncGraph to be the underlying representation of the opaque C type `TF_ConcreteFunction`. Our work aligns in building a common way to execute the FuncGraph (perhaps by dispatching a PartitionedCallOp to `TFE_Execute/TF_Execute`). + +Some important caveats: + +1. Variable lifetime - In TF1, variable lifetime was managed by the VarHandleOp tied to a Session. This is no longer the case with op-by-op dispatch in TF2 style execution. We will need to implement our own variable lifetime management. +2. `PartitionedCallOp` - For graphs that are missing `(Stateful)PartitionedCallOp`, we will need a mechanism to build the `PartitionedCallOp` for running (`TFE_Execute`) the TF function. + +#### **V2 SavedModel Loader** + +The [MLIR](https://mlir.llvm.org/)/[IREE](https://github.com/google/iree#project-goals) team’s work has laid an excellent foundation for the C++ TF2 SavedModel API through [bundle\_v2.h](https://github.com/tensorflow/tensorflow/blob/c347ded23c5fa658bcd315b4fdaa5e09ed4e3ef4/tensorflow/cc/saved_model/bundle_v2.h#L37). It traverses and loads the object hierarchy of a V2 saved model. + +However, it comes with a few limitations such as: + +1. [requiring input\_signature specifications for each exported function](https://docs.google.com/presentation/d/1R6H_Eax6sXT2-ffpmF5zjHwS1F22D2DF2ty7EvdAUUw/edit#slide=id.g758020ea42_0_37), and +2. requires a 1:1:1:1 mapping from [tf.function <=> SavedFunction <=> SavedConcreteFunction <=> FunctionDef](https://github.com/tensorflow/tensorflow/blob/c347ded23c5fa658bcd315b4fdaa5e09ed4e3ef4/tensorflow/compiler/mlir/tensorflow/translate/import_model.cc#L2357-L2359) + +We should try to leverage this existing work to implement a V2 SavedModel loader. + +#### Utility Functions + +`saved_model_cli` is a Python binary tool for inspecting and running a SavedModel (.pb). There isn’t a utility tool like such in C++. We will provide basic model inspection functionalities as part of the API so that the users do not need to jump between Python and C++. This is fulfilled by the `tensorflow::cc::SavedModel::ListFunctions` `tensorflow::cc::FunctionMetadata` APIs. + +#### API Usage Examples + +A model can be exported and loaded in the manner described below. + +Export model from Python: + +```python +import tensorflow as tf +import numpy as np + +class Dense(tf.Module): + def __init__(self, in_features, output_features, name=None): + super(Dense, self).__init__(name=name) + self.w = tf.Variable( + tf.random.normal([in_features, output_features]), name='w') + self.b = tf.Variable(tf.zeros([output_features]), name='b') + + @tf.function + def apply(self, x): + y = tf.matmul(x, self.w) + self.b + return tf.nn.relu(y) + +class MyModel(tf.Module): + def __init__(self): + self.a = Dense(10, 3) + self.b = Dense(10, 3) + + @tf.function + def run(self, x): + return self.a.apply(x) + self.b.apply(x) + +x = MyModel() +input = tf.random.uniform((2, 10), minval=0, maxval=100) +print(x.run(input)) + +saved = tf.saved_model.save(x, './saved') +y = tf.saved_model.load('./saved') + +print(y.a.apply(tf.random.uniform((2, 10), minval=0, maxval=100))) +``` + +Loading the model in C++: + +```c++ +using tensorflow::cc::Status; +using tensorflow::cc::SavedModelAPI; +using tensorflow::cc::ConcreteFunction; + +// Load a model. +Status status; +std::unique_ptr saved_model = + SavedModel::Load("saved", context, &status); + +if (!status.ok()) { + LOG(FATAL) << "Failed to get load model."; +} + +// Get function. +ConcreteFunction* func = + saved_model->GetFunction("a.apply", &status); +if (!status.ok()) { + LOG(FATAL) << "Failed to get Function."; +} + +// Run function. +func->Run(...) +``` diff --git a/rfcs/20200218-tf-c-saved-model/saved_model_diagram.png b/rfcs/20200218-tf-c-saved-model/saved_model_diagram.png new file mode 100644 index 000000000..0a2ed2d6b Binary files /dev/null and b/rfcs/20200218-tf-c-saved-model/saved_model_diagram.png differ diff --git a/rfcs/20200306-single-client-parameter-server.md b/rfcs/20200306-single-client-parameter-server.md new file mode 100644 index 000000000..a57ae4231 --- /dev/null +++ b/rfcs/20200306-single-client-parameter-server.md @@ -0,0 +1,647 @@ +# Single-client Parameter Server Training + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) | +| **Sponsor** | Priya Gupta (priyag@google.com) | +| **Updated** | 2018-03-06 | + + +## Background + +Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training inputs, pull variable values from parameter servers, compute gradients and send them to parameter servers. + + +### Distribution Strategy + +Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`. + +Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API. + + +### Single-Client Distributed Training + +We recommend a single client architecture for parameter server training in TensorFlow 2. This means there is only one client in a training cluster that coordinates the training of all workers in contrast to the multi-client setup in TensorFlow 1.x where each worker has its own coordinator. + +We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively. + + +## Goal + +The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs. + +The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents. + + +## Overview + + +### Programming Model + +With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps: + + +1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them. +3. Create datasets and iterators on workers. +4. Create the replica function that takes an iterator as input, trace it and register it on all workers. Note: a function may create variables as well. If not specified, they will be created on parameter servers at the time the function is traced. +5. Dispatch the step function on one available worker. +6. Repeat 5 until the end of epoch. +7. Repeat 5 - 6 until the stop criteria is reached. + + +### Interfaces + +One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well. + +**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.** + + +#### Constraints + +Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables. We will only support `tf.function`s. Scheduling arbitrary Python functions will not be supported in the first cut. + +Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value. + + +#### Schedule/Join Primitives + +The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to + +* hide the details of load-balancing, fault tolerance and dynamic scheduling +* expose the non-blocking semantics to users. + +```python +class ParameterServerStrategyV2: + + def schedule(self, replica_fn, args=(), kwargs=(), schedule_options=None): + """Schedule the `replica_fn` on a worker. + + Schedule the `replica_fn` on a worker that is available, returns a future + object immediately. + + By default, it implements at-least-once semantics for function execution. If + client gets a retryable error, e.g. worker preemption, it will reschedule the + function on another worker. So this method assumes that function execution can + be out of order. + + If `args` or `kwargs` contains distributed values such as a distributed dataset + returned from `strategy.distribute_dataset` or + `strategy.distribute_dataset_from_function`, the slice of the dataset + corresponding to the scheduled worker will be substituted for the original + distributed value. + + If some element in `args` or `kwargs` is bound to a specific worker, the + execution of the function may fail if the worker fails. We will consider + rebuilding the inputs to achieve at-least-once in all cases. + + The `schedule_options` will give users flexibility to specify which worker to + schedule on. We will support more options in the future. + + If there are barriers in `replica_fn`, it is users' responsibility to make + sure they won't cause deadlock. If `replica_fn` has collective ops that are + bound to specific devices, we recommend users use the run method instead. + """ + pass + + def join(self, futures=None): + """Wait until all given futures are ready. + + Raises an error if any of the functions fails to execute. In this case, + there is no guarantee that non-failing functions will complete. + + When join() is being called, it is not allowed to call `schedule`. + """ + pass + + def done(self): + """Returns True if there are no pending functions to be executed.""" + pass + + def local_results(self, futures): + """Get concrete values of the futures. + + Poisoned future objects will give `None`. + """ + pass + + +class Future(object): + + def wait(self): + """Block until the corresponding function is executed.""" + pass + + def result(self): + """Materialize the future. + + This is a blocking call. An exception will be thrown if the corresponding + function fails to execute or schedule. + """ + pass + + +class ScheduleOption(object): + + def __init__(assigned_worker=None): # More options to be added. + pass +``` + + +#### Dataset Interface + +The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`: + +``` +for x, y in distributed_iter: + loss = strategy.schedule(replica_fn, x, y) +``` + +If we do the same thing with the `strategy.schedule` API, there are several challenges. + +The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored. + +The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. + + +##### Alternative: passing iterators to `strategy.schedule` + +The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps. + +```python +# … omitted +with strategy.scope(): + # … omitted + distributed_iter = iter(distributed_dataset) + for i in range(total_steps): + strategy.schedule(replica_fn, args=(distributed_iter,)) +# … omitted +``` + +**We will start with this kind of training loop in our first version. We hope to get rid of this restriction in the future.** + + +#### Example: Estimator-style Training with Custom Training Loop + +In Estimator, workers independently run training steps. Datasets created on each worker are usually identical but shuffled differently. The termination of training is decided based on the global step. Since workers are independent and stateless, workers can come and go freely. We can achieve similar behavior with our proposed interfaces. + +To construct a custom training loop for Estimator-style training, users need to + +* use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. The dataset should be the same but shuffled differently across workers. +* create models under `strategy.scope` so variables will be assigned to parameter servers. +* likewise, create a Keras metric object under `strategy.scope`. Each worker, within their `replica_fn`, updates the metric states. +* use `strategy.schedule` to schedule the `replica_fn` into the cluster, which will end up scheduled on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` returns one or several `Future` objects immediately. +* use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`. +* call `strategy.join` to wait until all scheduled functions are executed. + +```Python +# Connect to remote servers with a user-provided `ClusterResolver` object. +strategy = ParameterServerStrategyV2(cluster_resolver) + +dataset_fn = # a function that returns a dataset + +# Clone the dataset on all workers, shuffled with different seeds. +distributed_dataset = strategy.experimental_distribute_datasets_from_function( + dataset_fn) + +with strategy.scope(): + # Create variables on parameter servers in a round-robin fashion. + model = create_model() + optimizer = tf.keras.optimizers.Adam() + accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy") + checkpoint_manager = tf.train.CheckpointManager( + tf.train.Checkpoint(model=model), checkpoint_dir, max_to_keep=2) + + @tf.function + def replica_fn(iterator): + x, y = next(iterator) + with tf.GradientTape() as tape: + predictions = model(x, table, training=True) + loss = compute_loss(y, predictions) + gradients = tape.gradient(loss, model.trainable_variables) + optimizer.apply_gradients(zip(gradients, model.trainable_variables)) + accuracy.update_state(y, predictions) + return loss + + for _ in range(num_epoches): + distributed_iter = iter(distributed_dataset) + for i in range(steps_per_epoch): + # strategy.schedule pushes a closure in the scheduling queue and + # returns a list of future objects immediately. + loss = strategy.schedule(replica_fn, + args=(distributed_iter,)) + strategy.join() + checkpoint_manager.save() # save checkpoint/summary... + print ("Loss = %f, accuracy = %f" % ( + strategy.local_results(loss) or float('nan'), accuracy.result())) +``` + + +### Fault Tolerance + +This section talks about the failure model and how we will support it. It has limitations and we will consider exposing APIs for users to define custom failure recovery policies in the future. + + +#### Task Failure + + +##### Worker failure + + +###### When scheduling + +When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. + +For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back. + +When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. + + +###### When materializing a `Future` object + +It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object. + +We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. + +We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric. + + +##### Parameter server failure + +When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well. This would require the cluster management to start the program again, once it receives an error from the client program due to parameter server failures. + + +##### Client failure + +When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts. + + +#### Resource Management for Workers + +When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources. + +Keeping track of resources and rebuilding them will be achieved depending how users create their resources: + +* we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work. +* In the future we will provide users an API to create worker-local resources. We will capture these resources in the API. + +If users create iterators or other resources inside a function but don’t expose them as outputs, we don’t need to rebuild them. + + +#### The Unknown of Scheduled Functions + +For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty: + +* keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed. +* eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier. +* recommend users schedule only small functions. Large functions are more expensive to retry. + + +#### Schedule Affinity + +When there is schedule affinity, specified by `ScheduleOptions` or inferred from input affinity, the aforementioned failure handling mechanism of rescheduling a function on other workers will not work. In this case, the default behavior is the client waits for the failing worker to come back until timeout and returns a schedule error to users. + + +### Evaluation + +Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. On the other hand, `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. + +With `ParameterServerStrategyV2`, we will start with two schemes: 1) evaluation done by a dedicated **** evaluator that runs alongside the training cluster, aka “sidecar evaluation”, with a supporting utility function, and 2) evaluation done by a function executed on a single worker or functions executed on multiple workers, aka “inline evaluation”, where evaluation takes place in an alternating manner with training. + +Sidecar evaluation is especially useful for those users who prefer the settings where evaluation does not interrupt training progress, if saving/loading checkpoints are not considered expensive. + +Inline evaluation is especially useful for those users who would like to avoid checkpoint saving/loading, and those who feel performing evaluation isn’t too expensive so that it’s fine training is stopped for a short period of time. + + +#### Sidecar evaluation + +In this scheme, the training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With our recommendation[^1], users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths: + +```Python +if cluster_resolver.task_type == "chief": + run_training_loop() +elif cluster_resolver.task_type == "evaluator": + run_evaluation_loop() +``` + +For user’s convenience, we will provide an `EvaluationLoop` API where the user provides key components for evaluation: + +```Python +def run_evaluation_loop(...): + """Run the example custom evaluation loop.""" + + model, eval_dataset, checkpoint_dir, eval_metrics = ... + + utils.EvaluationLoop( + model, + eval_dataset, + checkpoint_dir, + eval_metrics).start() + +class EvaluationLoop(object): + + def __init__(self, model, eval_dataset, checkpoint_dir, eval_metrics, + eval_steps=None): + """Initializes an EvaluationLoop object.""" + + @tf.function + def eval_fn(dataset): + """Evaluation function to compute metrics given a dataset. + + This creates a tf.function'ed evaluation function, where the dataset is + iterated over until exhaustion, or until eval_steps is met, whichever comes + earlier. If `eval_steps` is None, it exhausts the dataset. If dataset is + repeated, `eval_steps` must be provided or evaluation will be performed + indefinitely. + """ + pass + + self._eval_fn = eval_fn + # Other self attributes. + + def start(self): + """Starts an evaluation loop. + + This will start an evaluation loop which attempts to read the latest + checkpoint file. If a checkpoint file exists, and it has not been + evaluated, it loads it into the model, and executes the `eval_fn` locally. + After each evaluation run, it logs the metrics requested by the user, + writes to summary file for TensorBoard visualization, and possibly outputs + files for chief to read for further actions such as early stopping or + adjusting learning rate. + """ + pass +``` + +As illustrated above, evaluation loads into the model the checkpoints that were periodically saved (by the training client), does evaluation over a full pass of the eval dataset, and outputs the eval results. It may also export results to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.) + +At evaluator’s failures or preemptions, we expect the evaluator job to be restarted, pick up the latest checkpoint, and continue with the next round of evaluation. + + +#### Inline evaluation + +In this scheme, there’s no checkpoint needed (although the training/evaluation can still involve one at user’s choice), and the same set of workers is used for evaluation after some amount of training (usually an epoch of training) has completed. No dedicated evaluator job is needed. As illustrated below, this would require users to write their `eval_fn` and schedule it to workers. + +```Python +strategy = ParameterServerStrategyV2(cluster_resolver=...) + +with strategy.scope(): + model, train_metric, train_dataset = ... + @tf.function + def train_fn(): + ... + + eval_metric = tf.keras.metrics.CategoricalAccuracy(name="eval_accuracy") + @tf.function + def eval_fn(shard_id, num_shards): + eval_dataset = ... + for x, y in eval_dataset.shard(shard_id, total_shard): + eval_metric.update_state(y, model(x, training=False)) + + for _ in range(num_epochs): + for _ in range(num_steps): + strategy.schedule(train_fn, args=...) # Training for num_steps steps. + strategy.join() # Make sure training ends and nobody is updating PS. + + # NUM_SHARDS' some sensible number, needs to be at least the number of workers, + # preferably much larger than that. + for shard_id in range(NUM_SHARDS): + strategy.schedule(eval_fn, args=(shard_id, NUM_SHARDS)) + strategy.join() + print("Eval result is %f." % eval_metric.result()) + + # Optionally save checkpoint/summary, adjust learning rate or early stop, + # based on the evaluation result. + checkpoint_manager.save() +``` + + +If the worker that’s actively performing the evaluation encounters failures or preemptions, it is expected that `eval_fn` with a specific `shard_id` will be taken over by another available worker. This may result in duplicated evaluation on some input examples. This can be solved by having metrics as worker local resources, and returning the metric results as the return value of `eval_fn`. The user would then aggregate on the results of those `eval_fn`s. + + +## Implementation + + +### Low-level Primitives + +We can potentially expose them in the future when they are more stable and when we want to allow more advanced use cases. + +We will have `Cluster` and `Worker` classes to encapsulate logic related to remote function scheduling. + +```Python +class Cluster(object): + + def __init__(self, cluster_resolver, failure_handler=None): + """Create the cluster instance and connect to the remote cluster.""" + pass + + @property + def workers(self): + """Return all available workers.""" + return self._workers + + def schedule(self, function, args=None, kwargs=None): + """Schedule the function on one worker. + + It adds the function to the global scheduling queue and returns future + objects immediately. + """ + pass + + def join(self): + """Block until all scheduled functions are complete.""" + pass +``` + +We will probably merge this `Worker` with executors. + +```Python +class Worker(object): + + def __init__(self, + worker_job_name, + cluster, + max_scheduled_functions=100): + """Create a scheduling queue and a thread that processes the queue.""" + pass + + def schedule(self, function, args=None, kwargs=None): + """Schedule the function on the worker. + + It adds the function to the scheduling queue. It returns Future object + immediately. + """ + pass + + def healthy(self): + """Return a boolean indicating whether the worker is health or not.""" + pass + + def _set_dead(self): + """Declare the worker is dead and poison all future objects.""" + pass + + def _rebuild_resources(self): + """Rebuild worker-local resources when it is recovered from failure.""" + pass +``` + +As we mentioned the return value of `schedule` will be `Future` objects. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits: + +* It allows the `schedule` method to return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers. +* It serves as the container for values or errors. It would be binded with a value or an error later. When it is rebuilt, we can replace its underlying value silently. +* When being passed to `local_result`, we flag it to indicate that this value needs to be fetched eagerly. +* It provides a handle for user to wait for and get the error of a particular function. +* (Future work) It captures the lineage between functions and return values so that we can rebuild any poisoned objects. + +```Python +class Future(object): + + def __init__(self, closure): + pass + + def wait(self): + """Block until the corresponding function is executed.""" + pass + + def result(self): + """Materialize the future. + + An exception will be thrown if the corresponding function fails to + schedule/execute. + """ + pass + + def _set_value(self, value): + pass + + def _set_error(self, error): + pass + + def _set_eagerly_fetch(self): + pass +``` + +We can potentially merge this `Future` class with our `Tensor` class. + + +## Future Work + +The following are features we have been considering to support in the future although this is not an exhaustive list. We don’t promise to support all of them. We’ll prioritize according to the feedback from the community and users. + + +### Dynamic Membership + +Workers can come and go. To support this, we’ll probably need a mechanism to discover and remove workers and make our implementation of `tf.distribute` reactive. + + +### Automated Worker Pool Resizing + +Once dynamic membership is supported, it would be useful that there is automation built on top of dynamic membership, where the number of workers increases or decreases automatically based on the usage. + + +### Caching Variables/Resources + +Some variables or resources can be cached on workers to achieve faster read and update. They can have a global copy on parameter servers and local copies on all workers. We should allow users to define policies to use cached local copies to update the global copy whenever the latest value is needed. + +These variables include loss scales in mixed precision training and batchnorm statistics. These are similar to sync-on-read variables in other distribution strategies. A possible way to update the global copy using a local copy is: `global_value += (local_value - global_value) / num_workers`. + +Hash tables for embedding lookup can also be cached on workers. + + +### Worker-local Resources + +Lookup tables, replay buffers or any other worker-local resources that need to be elastic to work with the `schedule` API. The `distribute_dataset` method can also call this method to create elastic datasets for training. + +```Python +class ParameterServerStrategyV2(BaseStrategy): + + def create_worker_resource(self, resource_creation_fn): + """Create one resource per worker. + + If workers are added, the `resource_creation_fn` will be called to create + resources on new workers. + """ + pass + +class ElasticResource(object): + + def __init__(self, resource_dict): + pass + + def add_resource(self, worker_resource_pair): + pass + + def remove_resource(self, worker): + pass + + def get(self, worker): + """Return the concrete resource on the given `worker`. + + If an scheduled function takes `ElasticResource` as input, the scheduler, after + deciding which worker to schedule the function on, will call this method to + get the underlying resource on the corresponding worker. + """ + pass +``` + + +### Integration with tf.data Service + +In our design, we assume that `replica_fn` can be scheduled on any worker with some constraints. For example, datasets can not be sharded across workers; rebuilding iterators will lose their states. With the help of `tf.data` service, we can get rid of these constraints. + + +### Keras Integration + +Integrating with Keras `model.fit()` will largely be reusing previous work done when synchronous distribution strategies were integrated with Keras. We hope from the end-user’s perspective, they will notice minimal changes when they switch from other strategies. + +Most important implication of integrating with Keras `model.fit()` is that we will need support for `strategy.join()` and/or `strategy.local_results()` for callbacks. This would have performance implications but that would be the trade off for fitting the synchronous `model.fit()` semantics. + + +### More ScheduleOptions + +More schedule options can be added such as how many times of reschedules before returning an error to users if a function gets interrupted because of worker preemption. + + +### Versioning + +The client and standard server binaries may be in different versions. There is no backward or forward compatibility guarantee. For now, we recommend users run the same binary which will run standard TensorFlow servers if it is not the client. + + +### Better Preemption Handling + +We can leverage features of container orchestration frameworks to improve preemption handling. For example, if we can get notifications about a worker or a parameter server about to be preempted, we can save some of its state and recover much faster with this state. + + +### Advanced Fault Tolerance + + +#### Reschedule Functions with Input Affinity + +Our proposed `schedule` method supports at-least-once semantics only when functions don't have input affinity. Functions that depend on inputs that only exist on one worker can not be rescheduled. We can think of ways to rebuild these inputs to achieve at-least-once in more cases. + +With input-affinity, there may be futures that are bound to a worker and the worker can die and don’t come up within a reasonable timeout. We should poison these futures in this case. + + +#### Rebuild Arbitrary Resources and Future Objects + +Any poisoned future can be rebuilt according to the lineage relationship between functions and futures. For example, in the following diagram, to rebuild `future3`, we can rerun function `A` and function `B`, likely on a different worker if the original worker is dead. + + +![Rebuild Arbitrary Futures](20200306-single-client-parameter-server/rebuild_arbitrary_future.png) + + +### Multi-GPU Support + +We can expand the `replica_fn` into a multi-GPU function before we schedule it. + + +### Wrap `schedule`/`join` into a `tf.function` + +It is possible to implement ops for `schedule` and `join` and make the training loop wrappable by a `tf.function`. + +When a `Cluster` object is created, we use an op to create a scheduling queue and launch a background thread in runtime to periodically check the scheduling queue and schedule items on one available worker. + +The `schedule` op could just push a closure into the scheduling queue. Note any control dependency added between `schedule` ops won’t make the execution deterministic. + +The `join` op could wait until the scheduling queue is empty. diff --git a/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png b/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png new file mode 100644 index 000000000..8ac18991e Binary files /dev/null and b/rfcs/20200306-single-client-parameter-server/rebuild_arbitrary_future.png differ diff --git a/rfcs/20200411-fuse_recv.md b/rfcs/20200411-fuse_recv.md new file mode 100644 index 000000000..ec73aefe8 --- /dev/null +++ b/rfcs/20200411-fuse_recv.md @@ -0,0 +1,232 @@ +# FuseRecv + +| Status | Proposed | +:-------------- |:---------------------------------------------------- | +| **Author(s)** | Tongxuan Liu(tongxuan.ltx@alibaba-inc.com) Peng Tao(jiankeng.pt@alibaba-inc.com) Langshi Chen (langshi.cls@alibaba-inc.com) | +| **Reviewers(s)** | Ayush Dubey(ayushd@google.com) Jeroen Bédorf(jeroen@minds.ai) Derek Murray(mrry@google.com) Bairen Yi(yibairen.byron@bytedance.com) Paul Tucker(paul.tucker@gmail.com) | +| **Sponsor** | Ayush Dubey(ayushd@google.com) | +| **Updated** | 2020-04-11 | + +## Objective +This RFC proposes a new FuseRecv Op which would receive multiple tensors with +different types through one Remote Procedure Call (RPC). This feature could +significantly reduce the number of RPC calls in most rank or match models +such as Search, Recommend or Ad systems. + +## Motivation +When very many small tensors are being transferred around the same time, +it's more efficient to transfer multiple tensors in a single RPC rather than +using a separate RPC for each of them. + +In the case the neural network graph is complicated, each iteration through +the graph may introduce tens or even hundreds of RPC operations between the running +nodes. In general, there are a large number of small tensors, such as multiple +feature columns that gather data from the same Parameter Server. These tensors +have no dependence on each other, and each feature column results in at least +one RPC operation in the forward stage. In CTR (Click Through Rate) models or +models that are mostly sparse (such as Match or Rank models that are widely +used in Recommender and Ad systems), there would be hundreds of feature columns. +In our scenario, each sample includes at least hundreds of features. +One training job normally uses thousands of workers and tens of parameter servers. +One worker generally has to get variables from all the parameter servers, and each +feature column, at least in the forward stage, receives at least one request from +the parameter server. There could be hundreds of RPC operations for these feature columns, +and even more for some of the big feature columns (such as ids). These would be partitioned +into dozens of RPCs per feature column. In summary there would be +at least hundreds of RPC per worker for these feature columns only, and +hundreds of thousands of RPCs per step, for each parameter server in the forward stage. +Most feature columns only gather very small tensors from the parameter +server, usually less than 100KB. Logically these small tensors could be +sent together (e.g. fused). Furthermore, tensors that belong to the same layer can also +be fused before transfer, which would significantly reduce the number of RPC operations. + +As we know, each RPC operations introduces some satellite overhead besides the +actual tensor data transfer, which includes: +* Serialization/Deserialization which introduces additional overhead for each RPC operation. +* The execution engine overhead for executing a Recv node operation, and the corresponding thread pool + action required to execute the RPC callback function. + +## User Benefit + +Performance improvement: From performance benchmarking of the feature during large +(end-user) training jobs (> 400 workers), we normally see that the training speed would +be 1.5-2x timer faster in the parameter-server/worker setup. + +## Design Proposal + +![Figure 1: Current graph partition strategy](20200411-fuse_recv/current_graph_partition_strategy.png "Current graph partition strategy") +![Figure 2: Graph partition strategy with FuseRecv](20200411-fuse_recv/graph_partition_strategy_with_fuse_recv.png "Graph partition strategy with FuseRecv") + +In the original Recv/Send design, each Recv node only receives one tensor +even if there are Recv Ops that output to the same destination Op. Moreover each +Recv node would trigger one RPC operation even if the received tensor is a scalar. + +In the proposed design, we traverse (partitioned) graphs according to +its topology and iteratively replace Recv nodes with the new FuseRecv nodes. +Please refer to the details in Section [FuseRecv Optimizer in Grappler](#FuseRecv Optimizer in Grappler) + +As shown in Figures 1 and 2, instead of adding a Recv node for each tensor +‘a’ and ‘x’, we use only one FuseRecv node to replace the two Recv nodes which +fetches two tensors together. The FuseRecv node will have two output +‘slots’ (‘ports’): slot 0 feeds input ‘b’ and ‘c’ and slot 1 feeds ‘y’. +Notice that, because the RPC operation is Recv driven, there is no need +to fuse the send node. + +A new RPC method ‘FuseRecvTensorAsync’ and its Handler (FuseRecvTensorHandlerRaw) +is added into WorkInterface and WorkerService. FuseRecvTensor follows similar +optimization steps as RecvTensor to avoid copying the response buffer. + +### Alternatives Considered +#### Fuse the tensors into a single Send/Recv Solution 1 (Derek Murray) +Pack the N tensors to be sent into a length-N DT_VARIANT vector. + +Pros: Reuse currently RPC, avoid potential intricate changes in zero-copy +response buffer code. + +Cons: Introduce memcopy overhead. + +#### Fuse the tensors into a single Send/Recv Solution 2 (Derek Murray) +Pack the tensor contents into a single flattened buffer. This would be very +similar to the ScopedAllocator optimization that +ayushd@google.com and ++tucker@google.com implemented for collectives, and it might be possible +to reuse some of the graph analysis code + +Pros: Reuse currently RPC, avoid potential intricate changes in zero-copy +response buffer code. + +Cons: The fused tensors could be of different types and dynamic shapes, +which couldn't be handled by this solution. + +#### Dynamic Fusion in runtime (Paul Tucker) +Instead of adding a new FuseRecvTensor method to the Worker interface, +we add a slightly different RecvSomeTensors method. The client sends a +list of keys for which it's ready to receive values to the server and the +server streams back one or more when it's ready. It's the responsibility of +the client to retry any key that was not included in the response. + +To make this work well there needs to be some dynamic bundling on each side. +For example, on the client side a call to RecvTensor on the local Rendezvous +for a remote value does not necessarily result in an immediate RPC. It might +if the value is expected to be large, but it might also just add the key to +a ready set associated with the remote host. An RPC may not be sent until +the ready set reaches a certain size, or a minimum time has elapsed since the +last RPC against that host was started. When the response is received any +missing keys go back in the ready set. + +On the server side there could be some logic to decide for a RecvSomeTensors +method whether to wait for more of the requested values to be ready or just +immediately send what's available now and let the client re-request anything +missing. + +Pros: Dynamic fusion in runtime seems get better result, and also brings +ability to control priority of tensors (which Recv is more important). + +Cons: Potential bottleneck of the solution is the time window of ready set. +For different models it would be much different, manually setting the value +would be hard. This solution is another good candidate of FuseRecv. + +### Performance Implications +With a wide and deep model, the number of RPCs calls per step has been reduced +by 55%, and the overall training throughput has increased by 40%. +![Figure 3: performance_result](20200411-fuse_recv/performance_result.png "Performance result") + +### Dependencies +* None + +### Engineering Impact +* Engineering impact: Once the feature is (manually) enabled (in ConfigProto.GraphOptions.do_fuse_recv), the test times would be longer because the FuseRecv post-partitioned optimizer would traverse and update the graph. +* Maintenance: Minimal maintenance overhead. The TensorFlow team and contributors will maintain the documentation and keep it up to date. Changes should be reviewed and approved by the TensorFlow team leads. + +### Platforms and Environments +* Platforms: The feature is independent of platforms. +* Execution environments (Cloud services, accelerator hardware): The first stage would support CPU & GPU device. We consider supporting +additional devices as much as possible. + +### Best Practices +* We strongly suggest to enable FuseRecv in rank or match models such as [W&DL](https://arxiv.org/abs/1606.07792), [Dien](https://arxiv.org/abs/1809.03672). + +### Tutorials and Examples +Example of how to enable the FuseRecv feature: + +``` + >>> tf.config.optimizer.set_experimental_options({"do_fuse_recv": True}) +``` + +### Compatibility +* This feature works with the ParameterServerStrategy. +* This feature considers tensors on difference devices such as CPU, GPU and TPU. +* Independent of SavedModel or checkpoint. + +### User Impact +* None + +## Detailed Design + +### FuseRecv Op +We introduce the _RecvV2 Op and an RPC operation named FuseRecvTensorAsync in +RemoteWorker and WorkerService. The _RecvV2 Op definition is as follows: + +``` + >>> REGISTER_OP("_RecvV2") + >>> .Output("tensor: tensor_type") + >>> .Attr("tensor_type: list(type)") + >>> .Attr("tensor_name: list(string)") + >>> .Attr("send_device: string") + >>> .Attr("send_device_incarnation: int") + >>> .Attr("recv_device: string") + >>> .Attr("client_terminated: bool = false") + >>> .SetIsStateful() + >>> .SetShapeFn(shape_inference::UnknownShape); +``` + +FuseRecv requests a list of tensors with different types from remote devices, generally +we only fuse the Recv ops in the same recv device and on the same send device. + +### FuseRecv Optimizer in Grappler +During the post partition phase, we add a new pass to the post-partitioning optimizer +called “FuseRecv” to fuse Recv ops together. We traverse partitioned graphs & +the whole graph, replace Recv ops by FuseRecv ops in the partitioned graphs according +to its topology while iteratively searching and fusing potential Recv +operations. See Figure 4 for the formal algorithm definition. + +![Figure 4: fuse_recv_procedure](20200411-fuse_recv/fuse_recv_procedure.png "Fuse Recv Procedure") + +The procedure RECVFUSE takes two input arguments: 1) the TF computation +graph g, 2) a Partitioned graph. It is worth noting that the iteration of +all nodes shall start from the `root` nodes, which do not have any +source edge (node). The process between line 17 and 37 would be iteratively +executed and output key-value pairs (value: a group of edges could be fused +into one FuseRecv node). Then based on the grouped edges, we find out Recv +nodes in partitioned graph which could be replace by FusedRecv nodes. Besides +RECVFUSE also makes sure that no deadlock exists after the change to the +original graph. Also, the RPC operation of FuseRecvTensor is able to overlap +the computation and communication by using the graph topology. + +### FuseRecv RPC Method and Handler +A new RPC method ‘FuseRecvTensorAsync’ is added to the WorkerInterface. +We extend the ‘FuseRecvTensorAsync’ method with the ability to handle +multi rendezvous keys and fetch multi key tensors. + +At the server side, we add a ‘FuseRecvTensorHandlerRaw’, which handles +the multi rendezvous key for the ‘local recv’ instantiated by the local +tensor operations. As mentioned before, the sending nodes are not fused +and we therefore must do multiple local recvs corresponding to the +multi send nodes. + +Because the ‘FuseRecvTensorAsync’ handler might be executed before +the send operations happen, a call back wrapper is required. We use +a counter, initialized with the fuse count, and each send action triggers +the call back wrapper and performs an atomic decrease of the counter, +when the counter reaches 0, the real callback is executed and the tensors +are sent to the Recv node. + +### Dead Tensor Handling +We treat the output of the FuseRecv node as dead if and only if all the +fused tensors are dead. + +### FuseRecv Error Handling +The status of the FuseRecv node would be similar as the Recv node, which +include additional information for every Recv tensor. + +## Questions and Discussion Topics + diff --git a/rfcs/20200411-fuse_recv/current_graph_partition_strategy.png b/rfcs/20200411-fuse_recv/current_graph_partition_strategy.png new file mode 100644 index 000000000..1d882cd96 Binary files /dev/null and b/rfcs/20200411-fuse_recv/current_graph_partition_strategy.png differ diff --git a/rfcs/20200411-fuse_recv/fuse_recv_procedure.png b/rfcs/20200411-fuse_recv/fuse_recv_procedure.png new file mode 100644 index 000000000..359b006ee Binary files /dev/null and b/rfcs/20200411-fuse_recv/fuse_recv_procedure.png differ diff --git a/rfcs/20200411-fuse_recv/graph_partition_strategy_with_fuse_recv.png b/rfcs/20200411-fuse_recv/graph_partition_strategy_with_fuse_recv.png new file mode 100644 index 000000000..58c7887eb Binary files /dev/null and b/rfcs/20200411-fuse_recv/graph_partition_strategy_with_fuse_recv.png differ diff --git a/rfcs/20200411-fuse_recv/performance_result.png b/rfcs/20200411-fuse_recv/performance_result.png new file mode 100644 index 000000000..54100ba5a Binary files /dev/null and b/rfcs/20200411-fuse_recv/performance_result.png differ diff --git a/rfcs/20200420-tfx-tuner-component.md b/rfcs/20200420-tfx-tuner-component.md new file mode 100644 index 000000000..1196f26ec --- /dev/null +++ b/rfcs/20200420-tfx-tuner-component.md @@ -0,0 +1,383 @@ +# TFX Tuner Component + +| Status | Approved | +| :------------ | :-------------------------------------------------------- | +| **Author(s)** | Jiayi Zhao (jyzhao@google.com), Amy Wu (wuamy@google.com) | +| **Sponsor** | Zhitao Li (zhitaoli@google.com), Tom O'Malley (omalleyt@google.com), Matthieu Monsch (mtth@google.com), Makoto Uchida (muchida@google.com), Goutham Bhat (goutham@google.com) | +| **Updated** | 2020-04-20 | + +## Objective + +### Goal + +* A new Tuner component in TFX for automated hyper-parameter tuning, which is + based on abstractions from + [KerasTuner library](https://github.com/keras-team/keras-tuner), in order to + reuse abstractions and algorithms from latter. + +### Non Goal + +* Natively support multi-worker tuning by the system. As TFX doesn't have + ability to manage multi-worker clusters, running multiple trials in parallel + (parallel tuning) and running each trial in distributed env (distributed + training) are not supported natively. Parallel tuning may instead be + realized by a particular implementation of TFX Tuner (custom Executor), + e.g., in Google Cloud environment. +* Implementation of custom tuner for + [KerasTuner library](https://github.com/keras-team/keras-tuner) is out of + scope of this design discussion, e.g., a built-in EstimatorTuner support. + However, user project can still implement a tuner that inherits from + [`kerastuner.BaseTuner`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/base_tuner.py) + and provide it to the proposed TFX Tuner component. + +## Background and Motivation + +A hyperparameter is a parameter whose value is used to control the learning +process of a model or the model itself (e.g., layers and number of nodes). By +contrast, the values of other parameters (typically node weights) are learned. + +Hyperparameter optimization is a critical part of many machine learning +pipelines. Thus we propose a new TFX component, with the given search space +which specifies the hyperparameter configuration (name, type, range etc.). TFX +will optimize the hyperparameters based on the tuning algorithm. + +## User Benefit + +This document proposes a built-in TFX Tuner component, which works seamlessly +with Trainer and other TFX components. As the Tuner component will utilize the +[KerasTuner library](https://github.com/keras-team/keras-tuner), all supported +tuning methods will be available to TFX, including custom implementation of +KerasTuner. + +## Design Proposal + +TFX Tuner component will be built with the +[KerasTuner library](https://github.com/keras-team/keras-tuner). In the +following sections, we will first briefly go over the KerasTuner library and +several concepts in hyperparameter optimization. Then we will focus on our Tuner +component interface and how we utilize the KerasTuner library. After that, we +will discuss parallel tuning and our plan on Google Cloud integration. + +### KerasTuner Library + +The following graph shows a typical workflow of hyperparameter tuning under the +KerasTuner framework: + +
+ +Given the user provided model which accepts a hyperparameter container, tuner +can search optimization through trials created by the tuning algortihm. For each +trial, values within search spaces will be assigned to hyperparameter +containers, and the user model will be trained with these hyperparameter values +and evaluated based on the objective provided to the tuner. The evaluation +results will be reported back to tuner and the tuning algorithm will decide the +hyperparameter values for the next trial. After reaching certain conditions, +e.g., max trials, the tuner will stop iteration and return the optimal +hyperparameters. + +KerasTuner library provides above tuning functionality, here are some +abstractions in KerasTuner: + +* [`HyperParameters`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/hyperparameters.py): + Hyperparameter container for both search space, and current values. +* [`Oracle`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/oracle.py): + Implementation of a hyperparameter tuning algorithm, e.g., random search, + including state management of the algorithm’s progress. +* [`Trial`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/trial.py): + Provided by the Oracle, contains information about Hyperparameter values for + the current iteration. +* [`BaseTuner`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/base_tuner.py): + a base tuner interface for above tuning workflow, responsible for the + iteration of trial execution: + * Generates Trial using Oracle. + * Trains user model with the HyperParameters in the current Trial. + * Evaluates metrics and reports back to Oracle for next Trial. +* [`Tuner`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/tuner.py): + An implementation of BaseTuner, for Keras model tuning. + +Note: Other than the Tuner, abstractions defined by `HyperParameters`, `Oracle`, +`Trial` and `BaseTuner` are not restricted to Keras models, although the library +is called KerasTuner. + +For more details and code examples, please refer to +[here](https://github.com/keras-team/keras-tuner). + +### Component Interface + +Tuner component takes raw or transformed examples as input, along with schema or +transform_graph for the feature specification, and outputs the hyperparameter +tuning results, below shows the specification of Tuner component: + +```python +class TunerSpec(ComponentSpec): + """ComponentSpec for TFX Tuner Component.""" + + PARAMETERS = { + # Specify a python module file which contains a UDF `tuner_fn`. + 'module_file': ExecutionParameter(type=(str, Text), optional=True), + # Specify the steps for the training stage of each trial’s execution. + 'train_args': ExecutionParameter(type=trainer_pb2.TrainArgs), + 'eval_args': ExecutionParameter(type=trainer_pb2.EvalArgs), + } + + INPUTS = { + 'examples': ChannelParameter(type=standard_artifacts.Examples), + 'schema': ChannelParameter( + type=standard_artifacts.Schema, optional=True), + 'transform_graph': + ChannelParameter( + type=standard_artifacts.TransformGraph, optional=True), + } + + OUTPUTS = { + 'best_hyperparameters': + ChannelParameter(type=standard_artifacts.HyperParameters), + } +``` + +Trainer has an optional hyperparameters input; tuning result can be fed into it +so that Trainer can utilize best hyperparameters to construct the model. Below +shows an example about how tuner and trainer are chained in the pipeline: + +```python +# TrainerSpec: + INPUTS = { + ... + 'hyperparameters': + ChannelParameter( + type=standard_artifacts.HyperParameters, optional=True), + } + +# Pipeline DSL Example: + tuner = Tuner( + examples=example_gen.outputs['examples'], + schema=schema_gen.outputs['schema'], + module_file=module_file, + train_args=trainer_pb2.TrainArgs(num_steps=1000), + eval_args=trainer_pb2.EvalArgs(num_steps=500)) + + trainer = Trainer( + module_file=module_file, + examples=example_gen.outputs['examples'], + schema=schema_gen.outputs['schema'], + hyperparameters=tuner.outputs['best_hyperparameters'], + train_args=trainer_pb2.TrainArgs(num_steps=10000), + eval_args=trainer_pb2.EvalArgs(num_steps=5000)) +``` + +For Trainer, users need to define model code and training logic +([Generic Trainer](https://github.com/tensorflow/tfx/blob/r0.21.2/docs/guide/trainer.md#generic-trainer)) +in the module_file. For Tuner, in addition to model code, users also need to +define hyperparameters, search space and a tuning algorithm in the module_file. +A `tuner_fn` with the following signature is required for Tuner: + +```python +from kerastuner.engine import base_tuner +import tensorflow as tf +from tfx.components.trainer.executor import TrainerFnArgs + +# Current TrainerFnArgs will be renamed to FnArgs as a util class. +FnArgs = TrainerFnArgs +TunerFnResult = NamedTuple('TunerFnResult', + [('tuner', base_tuner.BaseTuner), + ('fit_kwargs', Dict[Text, Any])]) + +def tuner_fn(fn_args: FnArgs) -> TunerFnResult: + """Build the tuner using the KerasTuner API. + + Args: + fn_args: Holds args as name/value pairs. + working_dir: working dir for tuning. Automatically set by Executor. + train_files: List of file paths containing training tf.Example data. + eval_files: List of file paths containing eval tf.Example data. + train_steps: number of train steps. + eval_steps: number of eval steps. + schema: optional schema file of the input data. + transform_graph: optional transform graph produced by TFT. + + Returns: + A namedtuple contains the following: + - tuner: A BaseTuner that will be used for tuning. + - fit_kwargs: Args to pass to tuner’s run_trial function for fitting the + model , e.g., the training and validation dataset. Required + args depend on the above tuner’s implementation. + """ +``` + +The TunerFnResult returned by the above tuner_fn contains an instance that +implements the +[`BaseTuner`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/base_tuner.py) +interface, that’s the contract required by Tuner for tuning. The model code, +hyperparameters, search space and tuning algorithm are hidden under the +BaseTuner abstraction so the Tuner itself is generic and agnostic to the model +framework and tuning logic. Below shows an example module file with Keras model: + +```python +import kerastuner +import tensorflow as tf +... + +def _input_fn(file_pattern: Text, ...) -> tf.data.Dataset: + ... + +# Model code for Trainer and Tuner. +def _build_keras_model(hp: kerastuner.HyperParameters) -> tf.keras.Model: + ... + for _ in range(hp.get('num_layers')): + ... + ... + model = tf.keras.Model(...) + model.compile( + optimizer=tf.keras.optimizers.Adam(hp.get('learning_rate')), + loss='sparse_categorical_crossentropy', + metrics=[tf.keras.metrics.Accuracy()]) + return model + +# This will be called by TFX Tuner. +def tuner_fn(fn_args: FnArgs) -> TunerFnResult: + hp = kerastuner.HyperParameters() + # Defines search space. + hp.Choice('learning_rate', [1e-1, 1e-3]) + hp.Int('num_layers', 1, 5) + + # RandomSearch is a subclass of Keras model Tuner. + tuner = kerastuner.RandomSearch( + _build_keras_model, + max_trials=5, + hyperparameters=hp, + allow_new_entries=False, + objective='val_accuracy', + directory=fn_args.working_dir, + project_name='test') + + train_dataset=_input_fn(fn_args.train_files, ...) + eval_dataset=_input_fn(fn_args.eval_files, ...) + + return TunerFnResult( + tuner=tuner, + fit_kwargs={'x': train_dataset, + 'validation_data': eval_dataset, + 'steps_per_epoch': fn_args.train_steps, + 'validation_steps': fn_args.eval_steps}) + +# This will be called by TFX Generic Trainer. +def run_fn(fn_args: FnArgs) -> None: + hp = kerastuner.HyperParameters.from_config(fn_args.hyperparameters) + model = _build_keras_model(hp) + model.fit(...) + model.save(...) +``` + +In Tuner’s executor, `tuner_fn` will be called with information resolved from +component inputs, then we call the `search` function of the returned tuner with +`fit_kwargs` to launch trials for tuning, and finally emit the best trial’s +hyperparameters: + +```python +# Executor of Tuner Component: +class Executor(base_executor.BaseExecutor): + + def Do(self, + input_dict: Dict[Text, List[types.Artifact]], + output_dict: Dict[Text, List[types.Artifact]], + exec_properties: Dict[Text, Any]) -> None: + ... + tuner_spec = tuner_fn(self._create_fn_args(input_dict, exec_properties)) + tuner_spec.tuner.search(**tuner_spec.fit_kwargs) + # Output file contains json format string of hyperparameters.get_config(). + self._emit_best_hyperparameters( + output_dict, tuner_spec.tuner.get_best_hyperparameters()[0]) +``` + +### Parallel Tuning + +In parallel tuning, multiple trials are executed in parallel. In this section, +we will discuss how distribution works for KerasTuner library and the status of +TFX. + +In the `search` function of tuner, trials will be run in sequence instead of in +parallel. To support parallel tuning, we need to launch multiple tuners (the +tuner here refers to the one in KerasTuner library, not TFX Tuner component), +and have an optimization service for managing the state of the tuning algorithm, +with which oracle of each tuner communicates, and retrieves the trials for each +tuner. + +
+ +The above graph shows a parallel tuning of three tuners. Each tuner runs as a +different worker, and it retrieves trials from its own oracle, which talks to +optimization service. Trials of different tuners can run in parallel but trials +within the same tuner will still execute in sequence. When launching tuners, the +same identifier will be assigned to each oracle, thus the optimization service +knows they are in the same tuning job group and will assign hyperparameter +values for their trials based on the algorithm. + +The number of parallel tuners can be passed to component by the `TuneArgs` as +shown below: + +```python +# Args specific to tuning. +message TuneArgs { + # Number of trials to run in parallel. + # Each trial will be trained and evaluated by separate worker jobs. + int32 num_parallel_trials = 1; +} + +class TunerSpec(ComponentSpec): + + PARAMETERS = { + ... + 'tune_args': ExecutionParameter(type=tuner_pb2.TuneArgs), + } +``` + +The KerasTuner library allows users to config +[`tf.distribute.Strategy`](https://www.tensorflow.org/tutorials/distribute/kerass) +if they are using +[`kerastuner.Tuner`](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/tuner.py) +class (or subclasses of it). In above parallel tuning, each trial (each model +training) is executed in a single worker, as such only single machine strategy +is allowed. To support multi-worker distributed training, we need to be able to +execute the trial (training) on different workers. + +At the time of writing, KerasTuner library can be used for parallel tuning with +single machine `tf.distribute.Strategy`, e.g., +[`MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) +, multi-worker strategy (distributed training for trial) support is on the +roadmap (note that cluster managing is not part of the library). + +At the time of writing, TFX doesn’t have the ability to manage the multi-worker +cluster and the centralized optimization service, so parallel tuning or +distributed training is not supported natively in TFX (local or on-prem), but in +the next section, we will discuss the integration for Google Cloud. Similar +parallel tuning support can be built for other execution environments. + +### Google Cloud Integration + +In this section, we discuss the Tuner component with +[Google Cloud AI Platform](https://cloud.google.com/ai-platform) (CAIP), +specifically, an implementation of KerasTuner Oracle that talks to the +[AI Platform Optimizer](https://cloud.google.com/ai-platform/optimizer/docs/overview) +as the centralized optimization service, and a custom Tuner executor +implementation that makes use of the Cloud Optimizer-based Oracle (symbol names +are subject to change). + +As mentioned above in the parallel tuning section, KerasTuner uses a centralized +optimization service that manages states of a tuning study and trials. In +addition to that, we will create a `CloudOracle` as a client to the AI Platform +Optimizer service, and a `CloudTuner` which inherits from Keras +[Tuner](https://github.com/keras-team/keras-tuner/blob/1.0.0/kerastuner/engine/tuner.py). +In the module file, users create the `tuner_fn` with `CloudTuner`, and then +users configure the TFX Tuner component to use the a custom Tuner executor +(`CloudExecutor`), which launches multiple `CloudTuner`s on a Google Cloud AI +Platform Training job with possibly multiple worker machines running various +trials concurrently. Below shows the workflow for in process tuning and Cloud +tuning. + +
+ +## Future work + +* Native support for multi-worker parallel tuning. +* Custom Tuner (inherits from BaseTuner) examples, e.g., for Estimator support + or Keras custom training loop support. diff --git a/rfcs/20200420-tfx-tuner-component/cloud.png b/rfcs/20200420-tfx-tuner-component/cloud.png new file mode 100644 index 000000000..09559da71 Binary files /dev/null and b/rfcs/20200420-tfx-tuner-component/cloud.png differ diff --git a/rfcs/20200420-tfx-tuner-component/parallel_tuning.png b/rfcs/20200420-tfx-tuner-component/parallel_tuning.png new file mode 100644 index 000000000..efd62b113 Binary files /dev/null and b/rfcs/20200420-tfx-tuner-component/parallel_tuning.png differ diff --git a/rfcs/20200420-tfx-tuner-component/workflow.png b/rfcs/20200420-tfx-tuner-component/workflow.png new file mode 100644 index 000000000..4f8bd89da Binary files /dev/null and b/rfcs/20200420-tfx-tuner-component/workflow.png differ diff --git a/rfcs/20200505-transactional-fs.md b/rfcs/20200505-transactional-fs.md new file mode 100644 index 000000000..9ff1b7378 --- /dev/null +++ b/rfcs/20200505-transactional-fs.md @@ -0,0 +1,850 @@ +# Transactional File Systems Support + +| Status | Accepted | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [945](https://github.com/tensorflow/community/pull/945) | +| **Author(s)** | Sami Kama (kamsami@amazon.com) | +| **Sponsor** | Mihai Maruseac (mihaimaruseac@google.com) | +| **Updated** | 2020-05-23 | + +## Objective + +The aim of this RFC to extend filesystem access support to persistent storage that +provide transactional access and eventual consistency. + +## Motivation + +Current persistent storage implementation in Tensorflow relies on certain guarantees that +existing local file systems provide. But in the age of big data, local filesystems are not always +sufficient and/or prefetching the data to local system and then uploading after processing can be error-prone +and harder to implement for end users. Direct access to persistent storage that provides different gurantees is desirable. +Cloud storage solutions offered by Amazon, Google and others and databases can be examples of such persistent storages. +Moreover even though local file systems provide certain atomic-like transactions, they are on file level. +For the use cases like checkpointing, transactions are emulated through creating a temp directory, adding files to there and then +renaming/moving temp directory to final location. Such operations would also benefit from enhancements proposed in this RFC. + +Transactions can also help with some filesystem access inconsistencies that can happen while reading and writing checkpoints. For example in while on thread reading files from a directory other may be modifying the underlying files. This could lead to reading an inconsistent or corrupt set of files by the reader. With transactions, each thread can have different transaction tokens and underlying file system can choose to postpone modification of files by redirecting them to a temporary location and then moving it in place when transactions end. + +## User Benefit + +With this extension proposal, users will have a more stable access to cloud storage systems as well as checkpointing redundancy can improve. + +## Design Proposal + +This RFC proposes to extend the [filesystem plugin rfc][filesystem_plugin] api with transaction markers. There can be different levels of transactions. First level can be global transactions, user starts a transaction scope. Any operations done within this scope will be grouped in this transaction. This is easiest to implement but have drawbacks such as only one transaction can be present at a time and different type of transactions may not always be reordered. +Alternative to this approach is having multiple transaction scopes. User can create a transaction token and pass it to filesystem operations for plugin to differentiate among independent transactions. This token can be per file or per directory level granularity. Even though per file would give most flexibility, per directory level of transaction detail is most likely sufficient. + +Filesystem plugins may choose to ignore the transaction scopes or can delay the operations until the termination of transaction scope. + +### Extension to existing filesystem implementation + +Existing filesystem C++ api can easily be expanded by addition of three methods, an opaque structure and possibly a helper class to support transactions. + +```cpp +struct TransactionToken{ + FileSystem* owner; + void* id; +}; +// C++ helper class for transaction scoping. +template +class TokenScope{ + public: + // Transaction name can be filename or directory name + TokenScope(T* fs, const string& transaction_name) : fs(fs_) { + auto status = fs->StartTransaction(transaction_name, &token); + } + ~TokenScope(){ + token.owner->EndTransaction(token); + } + TokenScope(const TokenScope&) = delete; + const TransactionToken* GetToken() const {return &token;} + private: + TransactionToken token; +}; +``` + +For a coarse granularity adding `StartTransaction` `EndTransaction` and `GetTransactionTokenForFile` methods will be sufficient. However this will prevent having multiple simultaneous transactions per file system and limit the flexibility. Thus we propose extending signature of each method with a unique pointer to `TransactionToken` structure, defaulting to `nullptr` for minimizing the impact on the existing code and allow incremental migration to implementation of transactions. + +```cpp +class Filesystem { + // Transaction Token API extensions + virtual Status GetTransactionTokenForFile(const string& file_name,TransactionToken* token) = 0; + virtual Status StartTransaction(const string& transaction_name, TransactionToken* token) = 0; + virtual Status EndTransaction(TransactionToken* token) = 0; + + // File creation + virtual Status NewRandomAccessFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr) = 0; + virtual Status NewWritableFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr) = 0; + virtual Status NewAppendableFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr) = 0; + virtual Status NewReadOnlyMemoryRegionFromFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr) = 0; + + // Creating directories + virtual Status CreateDir(const string& dirname, TransactionToken* token=nullptr) = 0; + virtual Status RecursivelyCreateDir(const string& dirname, TransactionToken* token=nullptr); + + // Deleting + virtual Status DeleteFile(const string& fname, TransactionToken* token=nullptr) = 0; + virtual Status DeleteDir(const string& dirname, TransactionToken* token=nullptr) = 0; + virtual Status DeleteRecursively(const string& dirname, int64* undeleted_files, int64* undeleted_dirs, TransactionToken* token=nullptr); + + // Changing directory contents + virtual Status RenameFile(const string& src, const string& target, TransactionToken* token=nullptr) = 0; + virtual Status CopyFile(const string& src, const string& target, TransactionToken* token=nullptr); + + // Filesystem information + virtual Status FileExists(const string& fname, TransactionToken* token=nullptr) = 0; + virtual bool FilesExist(const std::vector& files, std::vector* status,TransactionToken* token=nullptr); + virtual Status GetChildren(const string& dir, std::vector* result, TransactionToken* token=nullptr) = 0; + virtual Status Stat(const string& fname, FileStatistics* stat, TransactionToken* token=nullptr) = 0; + virtual Status IsDirectory(const string& fname, TransactionToken* token=nullptr); + virtual Status GetFileSize(const string& fname, uint64* file_size, TransactionToken* token=nullptr) = 0; + + // Globbing + virtual Status GetMatchingPaths(const string& pattern, std::vector* results, TransactionToken* token=nullptr) = 0; + + // Misc + virtual void FlushCaches(); + virtual string TranslateName(const string& name) const; +}; +``` + +Transaction token will be owned by the Filesystem and use of it after `EndTransaction` will be an invalid operation. + +File classes can be modified to keep TransactionToken, assigned by the filesystem on their construction using given scope, or default scope if not given. Filesystems may ignore it if transaction at that level doesn't make sense. + +```cpp +class RandomAccessFile { + virtual Status Name(StringPiece* result) const; + virtual Status Read(uint64 offset, size_t n, StringPiece* result, char* scratch) const = 0; + private: + TransactionToken token; +}; + +class WritableFile { + virtual Status Name(StringPiece* result) const; + virtual Status Append(StringPiece data) = 0; + virtual Status Append(const absl::Cord& cord); + virtual Status Tell(int64* position); + virtual Status Close() = 0; + virtual Status Flush() = 0; + virtual Status Sync() = 0; + private: + TransactionToken token; +}; + +class ReadOnlyMemoryRegion { + virtual const void* data() = 0; + virtual uint64 length() = 0; + private: + TransactionToken token; +}; +``` + +Then respective `Env` class methods needs to receive transaction tokens to relay on the file system. Arguments are defaulted to nullptr, indicating use of default transaction. +Transaction tokens should be taken from respective filesystems. Alternatively, they can be constructed with an `UNINITIALIZED` token and then respective filesystem can populate it. + +```cpp +class Env { + // Filesystem registration + virtual Status GetFileSystemForFile(const string& fname, FileSystem** result); + virtual Status GetRegisteredFileSystemSchemes(std::vector* schemes); + virtual Status RegisterFileSystem(const string& scheme, FileSystemRegistry::Factory factory); + + // Transaction Token related + virtual Status GetTransactionTokenForFile(const string& file_name, TransactionToken** token) = 0 + virtual Status StartTransaction(const string& transaction_name, TransactionToken** token) = 0; + virtual Status EndTransaction(TransactionToken* token) = 0; + + // Creating files, including memory mapped + Status NewRandomAccessFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr); + Status NewWritableFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr); + Status NewAppendableFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr); + Status NewReadOnlyMemoryRegionFromFile(const string& fname, std::unique_ptr* result, TransactionToken* token=nullptr); + + // Creating directories + Status CreateDir(const string& dirname, TransactionToken* token=nullptr); + Status RecursivelyCreateDir(const string& dirname, TransactionToken* token=nullptr); + + // Deleting + Status DeleteFile(const string& fname, TransactionToken* token=nullptr); + Status DeleteDir(const string& dirname, TransactionToken* token=nullptr); + Status DeleteRecursively(const string& dirname, int64* undeleted_files, int64* undeleted_dirs,TransactionToken* token=nullptr); + + // Changing directory contents + Status RenameFile(const string& src, const string& target, TransactionToken* token=nullptr); + Status CopyFile(const string& src, const string& target, TransactionToken* token=nullptr); + + // Filesystem information + Status FileExists(const string& fname, TransactionToken* token=nullptr); + bool FilesExist(const std::vector& files, std::vector* status); + Status GetChildren(const string& dir, std::vector* result, TransactionToken* token=nullptr); + Status Stat(const string& fname, FileStatistics* stat, TransactionToken* token=nullptr); + Status IsDirectory(const string& fname, TransactionToken* token=nullptr); + Status GetFileSize(const string& fname, uint64* file_size, TransactionToken* token=nullptr); + + // Globbing + virtual bool MatchPath(const string& path, const string& pattern, TransactionToken* token=nullptr) = 0; + virtual Status GetMatchingPaths(const string& pattern, std::vector* results, TransactionToken* token=nullptr); + + // Misc + Status FlushFileSystemCaches(); + string GetExecutablePath(); + virtual string GetRunfilesDir() = 0; + bool LocalTempFilename(string* filename); + bool CreateUniqueFileName(string* prefix, const string& suffix); + virtual void GetLocalTempDirectories(std::vector* list) = 0; + static Env* Default(); + + // Other methods of the class, not relevant here +}; +``` + +Since `Env` resolves underlying filesytem from the URI, `StartTransaction` requires its argument to be similar to a URI that could be parsed to identify underlying file system. + +For the new proposed filesystem plugin mechanism, two possible approaches exists. For `TF_RandomAccessFile`, `TF_WritableFile`, and `TF_ReadOnlyMemoryRegion` structures, + +- Opaque pointers stay as is, thus no changes is needed in structures. Then each filesystem attach tokens to their own internal structures pointed by `void*`. +- Structures are extended to keep a pointer `TransactionToken` structure. + +Second method is more explicit but constrains all filesystems to use same token type, which is most likely not useful for any filesystem other than the one created it. Thus first solution may allow for more complicated data structures and flexibility to filesystems. Similar to `Env` class, FilesystemOps signatures need to be expanded with `TransactionToken` pointers. + +Also in order to help debug transaction related issues, an optional `DecodeTransactionToken` function is proposed. Filesystem plugins can optionally implement this function to decode TransactionToken to human readable format for printing debug log messages. + +```cpp +// Operations on a TF_Filesystem +typedef struct TF_FilesystemOps { + // versioning information elided for now + // ... + // API information below + void (*const NewRandomAccessFile)(const TF_Filesystem*, const char*, TF_RandomAccessFile*, TransactionToken*, TF_Status*); + void (*const NewWritableFile)(const TF_Filesystem*, const char*, TF_WritableFile*, TransactionToken*, TF_Status*); + void (*const NewAppendableFile)(const TF_Filesystem*, const char*, TF_WritableFile*, TransactionToken*, TF_Status*); + void (*const NewReadOnlyMemoryRegionFromFile)(const TF_Filesystem*, const char*, TF_ReadOnlyMemoryRegion*, TransactionToken*, TF_Status*); + + void (*const CreateDir)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + void (*const RecursivelyCreateDir)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + + void (*const DeleteFile)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + void (*const DeleteDir)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + void (*const DeleteRecursively)(const TF_Filesystem*, const char*, int64*, int64*, TransactionToken*, TF_Status*); + + void (*const RenameFile)(const TF_Filesystem*, const char*, const char*, TransactionToken*, TF_Status*); + void (*const CopyFile)(const TF_Filesystem*, const char*, const char*, TransactionToken*, TF_Status*); + + void (*const FileExists)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + bool (*const FilesExist)(const TF_Filesystem*, const char**, TransactionToken*, int, TF_Status**); + int (*const GetChildren)(const TF_Filesystem*, const char*, TransactionToken*, char***, TF_Status*); + + void (*const Stat)(const TF_Filesystem*, const char*, TransactionToken*, TF_FileStatistics*, TF_Status*); + void (*const IsDirectory)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + uint64 (*const GetFileSize)(const TF_Filesystem*, const char*, TransactionToken*, TF_Status*); + int (*const GetMatchingPaths)(const TF_Filesystem*, const char*, TransactionToken*, char***, TF_Status*); + + void (*const FlushCaches)(const TF_Filesystem*); + const char* (*const TranslateName)(const TF_Filesystem*, const char*); + + // Transaction management + void (*const StartTransaction)(TF_Filesystem*, TransactionToken**); + void (*const EndTransaction)(TF_Filesystem*, TransactionToken*); + void (*const GetTransactionTokenForFile)(TF_Filesystem*, const char* file_name, TransactionToken** token); + + // Optional Transaction Debugging + void (*const DecodeTransactionToken)(const TF_Filesystem*, const TransactionToken*, char**); + + // misc + void (*const Init)(TF_Filesystem*); + void (*const Cleanup)(TF_Filesystem*); +} TF_FilesystemOps; +``` + +The exposure of this api to the Python layer will be through file_io module. Similar to C++ API changes, file_io module will be exended to contain transaction related api. Proposed change involves addition of two new methods namely `StartTransaction(URI)` and `EndTransaction(token)`. First method will return an opaque python object that is holding transaction token from respective Filesystem. Furthermore, this token needs to be passed to respective methods. Python API will be extended with the `transaction_token` arguments, that defaults to `None` such that existing code will function as before. Additionally a helper scope will be added to `file_io` module to simplify use of transactions. Like C++ implementations of File structures, `FileIO` class will take an optional token in its constructor. +A listing of proposed changes is below. + +```python +# +@tf_contextlib.contextmanager +def transaction_scope(URI): + token = StartTransaction(name) + try: + yield token + finally: + EndTransaction(token) + +class FileIO(object): + def __init__(self, name, mode, transaction_token=None): + self._token = transaction_token + #..... rest left out for brevity +def file_exists(filename, transaction_token=None) +def file_exists_v2(path, transaction_token=None) +def delete_file(filename, transaction_token=None) +def delete_file_v2(path, transaction_token=None) +def read_file_to_string(filename, binary_mode=False, transaction_token=None) +def write_string_to_file(filename, file_content, transaction_token=None) +def get_matching_files(filename, transaction_token=None) +def get_matching_files_v2(pattern, transaction_token=None) +def create_dir(dirname, transaction_token=None) +def create_dir_v2(path, transaction_token=None) +def recursive_create_dir(dirname, transaction_token=None) +def recursive_create_dir_v2(path, transaction_token=None) +def copy(oldpath, newpath, overwrite=False, transaction_token=None) +def copy_v2(src, dst, overwrite=False, transaction_token=None) +def rename(oldname, newname, overwrite=False, transaction_token=None) +def rename_v2(src, dst, overwrite=False, transaction_token=None) +def atomic_write_string_to_file(filename, contents, overwrite=True, transaction_token=None) +def delete_recursively(dirname, transaction_token=None) +def delete_recursively_v2(path, transaction_token=None) +def is_directory(dirname, transaction_token=None) +def is_directory_v2(path, transaction_token=None) +def has_atomic_move(path, transaction_token=None) +def list_directory(dirname, transaction_token=None) +def list_directory_v2(path, transaction_token=None) +def walk(top, in_order=True, transaction_token=None) +def StartTransaction(dir_name="") +def EndTransaction(token) +def walk_v2(top, topdown=True, onerror=None, transaction_token=None) +def stat(filename, transaction_token=None) +def stat_v2(path, transaction_token=None) +def filecmp(filename_a, filename_b, transaction_token=None) +def file_crc32(filename, block_size=_DEFAULT_BLOCK_SIZE, transaction_token=None) + +``` + +### Alternatives Considered + +It is also possible to limit transactions one active transaction per filesystem at a given time. Advantage of this approach is it can simplify the API to just addition of `StartTransaction` and `EndTransaction` calls. Disadvantage is that it will not be possible to use overlapping transactions which may limit the efficiency of the transactions and prevent approaches similar to one that is discussed in motivation section. + +Another approach is to drop transactions and rely on `Flush()` and `Close()` methods/functions to mimic transactions. Although this seems simpler, it doesn't cover all possible benefits of transactions. For example for some Filesystem implementations append operation is actually involves creation of a new record and can potentially lead to many copy operations. Another shortcoming of aforementioned method is some files can stay open for a long while and lead to inconsistencies for reader due to delayed action or incomplete state. With transactions, Filesystem implmentations can make more informed decisions on how to handle cases like these. + +### Performance Implications + +- This will allow filesystem plugin implementations to optimize access to non-local file systems and likely improve performance. For filesystems that will not implement transactions, it will have no effect on performance or operation. + +### Dependencies + +- This proposal do not require any additional dependencies, but may lead to implementation of more persistent storage access plugins. + +### Engineering Impact + +- The expected engineering impact is minimal. Required changes involve grouping filesystem i/o operations in to transaction groups that will likely be no-ops for traditional file systems. It is estimated that making API changes would not take more than an hour. Implementation of transactions in plugins will depend on the complexity of the plugin. + +### Platforms and Environments + +- Proposed changes are platform independent and should not affect code generation or execution environments. + +### Best Practices + +Transactions provide a means to inform the filesystem about the intent of the user and group similar transactions for filesystems to implement expected behavior. Grouping similar transactions would be beneficial. Since not many components access the filesystems directly, improving the documentation about relevant API should be sufficient for most cases. + +### Tutorials and Examples + +Since new arguments have default values, existing code will work without any change in behavior. However it would be rather easy to make use of the transactions. For example FileIO python tests can be modified make use new api and transactions such as below. + +```python +#... other imports +from tensorflow.python.lib.io import file_io +from tensorflow.python.platform import test +from tensorflow.python.framework import errors + + +class FileIoTest(test.TestCase): + + def setUp(self): + self._base_dir = os.path.join(self.get_temp_dir(), "base_dir") + # Start a new transaction + self._token = file_io.StartTransaction(self._base_dir) + file_io.create_dir(self._base_dir, self._token) + + def tearDown(self): + file_io.delete_recursively(self._base_dir, self._token) + # finalize the transaction. + file_io.EndTransaction(self._token) + + def testEmptyFilename(self): + # Files created with transaction will belong to transaction + # they will use same token for all operations on them + f = file_io.FileIO("", mode="r", transaction_token=self._token) + with self.assertRaises(errors.NotFoundError): + _ = f.read() + + def testFileDoesntExist(self): + file_path = os.path.join(self._base_dir, "temp_file") + self.assertFalse(file_io.file_exists(file_path, self._token)) + with self.assertRaises(errors.NotFoundError): + _ = file_io.read_file_to_string(file_path, self._token) + + def testWriteToString(self): + # Use transaction toke created at setup time. Shared between many tests + file_path = os.path.join(self._base_dir, "temp_file") + file_io.write_string_to_file(file_path, "testing", self._token) + self.assertTrue(file_io.file_exists(file_path, self._token)) + file_contents = file_io.read_file_to_string(file_path, self._token) + self.assertEqual("testing", file_contents) + + def testWriteToStringNoTransaction(self): + file_path = os.path.join(self._base_dir, "temp_file") + # if no transaction token passed, operations default to pre-transaction behavior + file_io.write_string_to_file(file_path, "testing") + self.assertTrue(file_io.file_exists(file_path)) + file_contents = file_io.read_file_to_string(file_path) + self.assertEqual("testing", file_contents) + + def testScopedTransaction(self): + file_path = os.path.join(self._base_dir, "temp_file") + # Transactions can be used with scopes. + # Below a new transaction will be started on file path. + with file_io.transaction_scope(file_path) as token: + file_io.write_string_to_file(file_path, "testing", token) + self.assertTrue(file_io.file_exists(file_path, token)) + file_contents = file_io.read_file_to_string(file_path, token) + self.assertEqual("testing", file_contents) + + +# ... omitted for brevity + +if __name__ == "__main__": + test.main() +``` + + + +### Compatibility + +Since the changes are in the framework level, there is no compatibility issue foreseen. Existing code would work as is as it was. Users can augment their code with transaction scopes to improve the performance or solve the issues they are having with non-local file systems. + +## Questions and Discussion Topics + +## Alternatives brought up during RFC review + +- @alextp suggested wrapping token operations into file system object wrappers, that would be returned from start transaction like operations. This is approach minimizes the change to the api and hides transactions from users. However, it requires users doing file system operations through `Env` api modify their code to use `Filesystem` classes directly. Another implication of this is that all filesystem plugins are accessed through C++ wrappers, and direct access through C layer should be limited if not forbidden. Regardless, Filesystem wrapper idea might be a good approach. This would imply following changes. + +```cpp +class WrappedFileSystem : public Filesystem { + WrappedFileSystem(Filesystem* base_fs, TransactionToken* token) + : fs_(base_fs), token_(token) {} + virtual Status NewRandomAccessFile(const string& fname, + std::unique_ptr* result, + TransactionToken* token = nullptr) { + return fs_->NewRandomAccessFile(fname, result, (token ? token : token_)); + } + virtual Status NewWritableFile(const string& fname, + std::unique_ptr* result, + TransactionToken* token = nullptr) { + return fs_->NewWritableFile(fname, result, (token ? token : token_)); + } + virtual Status NewAppendableFile(const string& fname, + std::unique_ptr* result, + TransactionToken* token = nullptr) { + return fs_->NewAppendableFile(fname, result, (token ? token : token_)); + } + virtual Status NewReadOnlyMemoryRegionFromFile( + const string& fname, std::unique_ptr* result, + TransactionToken* token = nullptr) { + return fs_->NewReadOnlyMemoryRegionFromFile(fname, result, + (token ? token : token_)); + } + + // Creating directories + virtual Status CreateDir(const string& dirname, + TransactionToken* token = nullptr) { + return fs_->CreateDir(dirname, (token ? token : token_)); + } + virtual Status RecursivelyCreateDir(const string& dirname, + TransactionToken* token = nullptr) { + return fs_->RecursivelyCreateDir(dirname, (token ? token : token_)); + } + + // Deleting + virtual Status DeleteFile(const string& fname, + TransactionToken* token = nullptr) { + return fs_->DeleteFile(fname, (token ? token : token_)); + } + virtual Status DeleteDir(const string& dirname, + TransactionToken* token = nullptr) { + return fs_->DeleteDir(dirname, (token ? token : token_)); + } + virtual Status DeleteRecursively(const string& dirname, + int64* undeleted_files, + int64* undeleted_dirs, + TransactionToken* token = nullptr) { + return fs_->DeleteRecursively(dirname, undeleted_files, undeleted_dirs, + (token ? token : token_)); + } + + // Changing directory contents + virtual Status RenameFile(const string& src, const string& target, + TransactionToken* token = nullptr) { + return fs_->RenameFile(src, target, (token ? token : token_)); + } + virtual Status CopyFile(const string& src, const string& target, + TransactionToken* token = nullptr) { + return fs_->CopyFile(src, target, (token ? token : token_)); + } + + // Filesystem information + virtual Status FileExists(const string& fname, + TransactionToken* token = nullptr) { + return fs_->FileExists(fname, (token ? token : token_)); + }; + virtual bool FilesExist(const std::vector& files, + std::vector* status, + TransactionToken* token = nullptr) { + return fs_->FilesExist(files, status, (token ? token : token_)); + } + virtual Status GetChildren(const string& dir, std::vector* result, + TransactionToken* token = nullptr) { + return fs_->GetChildren(dir, result, (token ? token : token_)); + } + virtual Status Stat(const string& fname, FileStatistics* stat, + TransactionToken* token = nullptr) { + return fs_->Stat(fname, stat, (token ? token : token_)); + } + virtual Status IsDirectory(const string& fname, + TransactionToken* token = nullptr) { + return fs_->IsDirectory(fname, (token ? token : token_)); + } + virtual Status GetFileSize(const string& fname, uint65* file_size, + TransactionToken* token = nullptr) { + return fs_->GetFileSize(fname, file_size, (token ? token : token_)); + } + + // Globbing + virtual Status GetMatchingPaths(const string& pattern, + std::vector* results, + TransactionToken* token = nullptr){ + return fs_->GetMatchingPaths(pattern, results, (token ? token : token_))}; + + // Misc + virtual void FlushCaches(TransactionToken* token = nullptr){ + return fs_->FlushCaches((token ? token : token_))}; + virtual string TranslateName(const string& name, + TransactionToken* token = nullptr) const { + fs_->TranslateName(name, (token ? token : token_)); + }; + virtual Status EndTransaction(TransactionToken* token=nullptr){ + return fs_->EndTransaction((token ? token : token_)); + } +}; + +class Env{ + // Other methods as described above + Status StartTransactionForURI(const string& fname,std::unique_ptr* TransactionalFS){ + Filesystem* fs; + auto status=GetFileSystemForFile(fname,&fs); + if(status.ok()){ + TransactionToken* token; + status=fs->StartTransaction(fname,&token); + *TransactionalFS=std::make_unique(fs,token) + } + return status; + } +} +``` + +- @mihaimarueac proposed filesystems to keep the state and adding files to tokens manually. This would require less changes in the filesystem API but introduces more bookkeeping requirements to filesystems. Also it excludes a file being part of multiple transactions. On the other hand, may simplify some patterns. So for this proposal, existing API is just extended with few methods. For brevity, unchanged API is ignored. + +```cpp +class Env { + Status GetTokenForURI(const string& uri, TransactionToken** token); + Status AddToTransaction(TransactionToken* token, const string& object_name); + Status EndTransaction(TransactionToken* token); +} + +class Filesystem { + Status GetTokenForURI(const string& uri, TransactionToken** token); + Status AddToTransaction(TransactionToken* token, const string& object_name); + Status EndTransaction(TransactionToken* token); +} + +typedef struct TF_FilesystemOps { + void (*const GetTokenForURI)(const TF_Filesystem*, const char*, + TransactionToken**, TF_Status*); + void (*const AddToTransaction)(const TF_Filesystem*, TransactionToken*, + const char*, TF_Status*); + void (*const EndTransaction)(const TF_Filesystem*, TransactionToken*, + TF_Status*); +} + +``` + +## Example Uses + +This section contains a possible use example and potential modifications by each proposal. A typical use pattern for filesystem access could be as follows + +```cpp +Status MergeFiles(const string& fname, const string& dirname, + const vector input_files) { + auto status = Env::Default()->IsDirectory(dirname); + if (!status.ok()) { + status = Env::Default()->CreateDirectory(dirname); + if (!status.ok()) { + return status; + } + } + std::unique_ptr output; + status = Env::Default()->NewAppendableFile(fname, &output); + if (!status.ok()) return status; + for (const auto& inp : input_files) { + status = AppendToFile(output, inp); + if (!status.ok()) return status; + } + return status; +} +``` + +This example would be modified as below to work with transactions as described in this proposal. + +```cpp +Status MergeFiles(const string& fname, const string& dirname, + const vector input_files) { + TransactionToken* token = nullptr; + auto status = Env::Default()->StartTransaction(fname, &token); + if (!status.ok()) { + LOG(WARNING) << "Starting transaction for " << fname << " failed with \"" + << status << "\". Continuing without transactions"; + } + status = Env::Default()->IsDirectory(dirname, token); + if (!status.ok()) { + status = Env::Default()->CreateDirectory(dirname, token); + if (!status.ok()) { + return status; + } + } + std::unique_ptr output; + status = Env::Default()->NewAppendableFile(fname, &output, token); + if (!status.ok()) return status; + for (const auto& inp : input_files) { + // read file inp and append to output after processing it + status = AppendToFile(output, inp); + if (!status.ok()) return status; + } + return status; +} +``` + +With the Wrapped filesytems proposal it would be like + +```cpp +Status MergeFiles(const string& fname, const string& dirname, + const vector input_files) { + std::unique_ptr FS; + auto status = Env::Default()->StartTransactionForURI(fname, &FS); + if (!status.ok()) { + LOG(WARNING) << "Starting transaction for " << fname << " failed with \"" + << status << "\". Continuing without transactions"; + } + status = FS->IsDirectory(dirname); + if (!status.ok()) { + status = FS->CreateDirectory(dirname); + if (!status.ok()) { + return status; + } + } + std::unique_ptr output; + status = FS->NewAppendableFile(fname, &output); + if (!status.ok()) return status; + for (const auto& inp : input_files) { + status = AppendToFile(output, inp); + if (!status.ok()) return status; + } + return status; +} +``` + +And with the stateful tokens proposal would be + +```cpp +Status MergeFiles(const string& fname, const string& dirname, + const vector input_files) { + TransactionToken* token = nullptr; + auto status = Env::Default()->GetTokenForURI(fname, &token); + if (!status.ok()) { + LOG(WARNING) << "Starting transaction for " << fname << " failed with \"" + << status << "\". Continuing without transactions"; + } + Env::Default()->AddToTransaction(token, dirname); + status = Env::Default()->IsDirectory(dirname); + if (!status.ok()) { + status = Env::Default()->CreateDirectory(dirname); + if (!status.ok()) { + return status; + } + } + Env::Default()->AddToTransaction(token, fname); + std::unique_ptr output; + status = Env::Default()->NewAppendableFile(fname, &output; + if (!status.ok()) return status; + for (const auto& inp : input_files) { + Env::Default()->AddToTransaction(token, inp); + status = AppendToFile(output, inp); + if (!status.ok()) return status; + } + return status; +} +``` + +### Changes during review process + +- Changed `std::unique_ptr*` arguments to `TransactionToken*` +- Added Alternatives and Changes during review sections +- Added optional `DecodeTransactionToken` method + +## Final design proposal + +During the review meeting it has been decided to merge wrapped filesystem approach and stateful token approach. Then the final proposal is as shown below. The `WrappedFileSystem` and `Filesystem` classes described above are to be extended with four new methods, + +```cpp +class WrappedFileSystem : public Filesystem { + // Other methods are ommited for brevity + virtual Status AddToTransaction(const string& uri, + TransactionToken* token = nullptr) { + return fs_->AddToTransaction(uri, (token ? token : token_)); + } + + virtual Status GetTransactionForPath(const string& uri, + TransactionToken*& token) { + return fs_->GetTransactionForPath(uri, token); + } + + virtual Status GetTokenOrStartTransaction(const string& uri, + TransactionToken*& token) { + return fs_->GetTokenOrStartTransaction(uri, token); + } + + virtual Status DecodeTransaction(const TransactionToken* token = nullptr, + string* decoded_string) { + return fs_->DecodeTransaction((token ?: token : token_), decoded_string); + } + //... +}; +``` + +Then the current C API's `TF_FilesystemOps` table is to be extended by 6 new function pointers. + +```cpp +struct TF_FilesystemOps{ + // Existing members are not modified + // Transaction management + void (*start_transaction)(TF_Filesystem*, TF_TransactionToken**, TF_Status ); + void (*end_transaction)(TF_Filesystem*, TF_TransactionToken*); + void (*add_to_transaction)(TF_Filesystem* fs, const char* path, TF_TransactionToken* token); + void (*get_transaction_for_path)(TF_Filesystem* fs, const char* path, TF_TransactionToken** token); + void (*get_or_start_transaction_for_path)(TF_Filesystem* fs, const char* path, TF_TransactionToken** token); + // Optional Transaction Debugging + char* (*decode_transaction_token)(const TF_Filesystem*, const TF_TransactionToken*); +}; +``` + +The new functions will be null pointers until respective plugins implement them. `ModularFilesystem` implementation will check whether a plugin implements the transactions and will ignore the transaction if it is not implemented, possibly after producing a log message, thus falling back to +current transactionless state. Since these function pointers will be added after existing pointers, already compiled plugins will keep functioning and they can be gradually start supporting transactions. Any filesystem plugin that start supporting transaction will be used by the framework. + +## Example use + +With these final modifications, there is no need to carry transaction tokens through different compilation units or ops in the graph. For example current checkpointing logic involve adding one or more `SaveV2` ops and a `MergeV2Checkpoints` op to the graph. In order to prevent +corrupt checkpoints in case of errors and optimize i/o, SaveV2 ops write their outputs to the temporary directories, given as constant input arguments, and MergeV2Checkpoints op is given the names of the temporary save files generated by all SaveV2 ops which reads and merges +them into a final file and then deletes temporary files. Just by adding a few lines to SaveV2 to and MergeV2Checkpoints ops, these operations can be completed transactionally. In this case overview of the operations would be, + +- SaveV2 op uses `Env::GetTokenOrStartTransaction(base_dir)` call to start a new transaction or get an existing transaction on the base output directory. +- SaveV2 op adds files it generates to the transaction. +- MergeV2Checkpoint op uses `Env::GetTransactionTokenForPath(base_dir)` to get the transaction started by SaveV2 ops and adds its output to the same transaction. +- MergeV2Checkpoint op removes intermediate files and calls `EndTransaction()` to finalize the transaction. + +Then a transactional file system may operate in a cache and move files to the final destination only at the end of the transaction which would ensure that the files in the checkpoint directory are consistent. The +implementation of how this is achieved may be different for each filesystem but the end result should be consistent. + +An example implementation of this could be as shown in the diff set below. +```diff +--- a/tensorflow/core/kernels/save_restore_v2_ops.cc ++++ b/tensorflow/core/kernels/save_restore_v2_ops.cc +@@ -237,6 +237,17 @@ class MergeV2Checkpoints : public OpKernel { + gtl::ArraySlice(checkpoint_prefixes.flat()); + Env* env = Env::Default(); + const string& merged_prefix = destination_prefix.scalar()(); ++ auto token_deleter = [env](TransactionToken* token) { ++ if (token) { ++ env->EndTransaction(token); ++ } ++ }; ++ TransactionToken* token = nullptr; ++ env->GetTokenOrStartTransaction(string(io::Dirname(input_prefixes[0])), &token); ++ auto token_scope = ++ std::unique_ptr( ++ token, token_deleter); ++ env->AddToTransaction(string(io::Dirname(merged_prefix)), token); + OP_REQUIRES_OK( + context, tensorflow::MergeBundles(env, input_prefixes, merged_prefix)); + +--- a/tensorflow/core/util/tensor_bundle/tensor_bundle.cc ++++ b/tensorflow/core/util/tensor_bundle/tensor_bundle.cc +@@ -421,9 +421,10 @@ BundleWriter::BundleWriter(Env* env, StringPiece prefix, const Options& options) + if (!status_.ok() && !errors::IsAlreadyExists(status_)) { + return; + } +- ++ TransactionToken* token=nullptr; ++ env->GetTokenOrStartTransaction(string(io::Dirname(prefix_)),&token).IgnoreError(); + std::unique_ptr wrapper; +- status_ = env_->NewWritableFile(data_path_, &wrapper); ++ status_ = env_->NewWritableFile(data_path_, &wrapper,token); + if (!status_.ok()) return; + out_ = std::unique_ptr( + new FileOutputBuffer(wrapper.release(), 8 << 20 /* 8MB write buffer */)); +@@ -527,7 +528,9 @@ Status BundleWriter::Finish() { + if (!status_.ok()) return status_; + // Build key -> BundleEntryProto table. + std::unique_ptr file; +- status_ = env_->NewWritableFile(metadata_path_, &file); ++ TransactionToken* token; ++ env_->GetTokenOrStartTransaction(string(io::Dirname(prefix_)),&token).IgnoreError(); ++ status_ = env_->NewWritableFile(metadata_path_, &file, token); + if (!status_.ok()) return status_; + { + // N.B.: the default use of Snappy compression may not be supported on all +@@ -554,7 +557,7 @@ Status BundleWriter::Finish() { + } + status_.Update(file->Close()); + if (!status_.ok()) { +- Env::Default()->DeleteFile(metadata_path_).IgnoreError(); ++ Env::Default()->DeleteFile(metadata_path_, token).IgnoreError(); + return status_; + } else if (use_temp_file_) { + status_ = Env::Default()->RenameFile(metadata_path_, MetaFilename(prefix_)); +@@ -590,6 +593,8 @@ static Status MergeOneBundle(Env* env, StringPiece prefix, + MergeState* merge_state) { + VLOG(1) << "Merging bundle:" << prefix; + const string filename = MetaFilename(prefix); ++ TransactionToken* token=nullptr; ++ env->GetTokenOrStartTransaction(string(filename),&token).IgnoreError(); + uint64 file_size; + TF_RETURN_IF_ERROR(env->GetFileSize(filename, &file_size)); + std::unique_ptr file; +@@ -690,7 +695,9 @@ Status MergeBundles(Env* env, gtl::ArraySlice prefixes, + // Merges all metadata tables. + // TODO(zhifengc): KeyValue sorter if it becomes too big. + MergeState merge; +- Status status = env->CreateDir(string(io::Dirname(merged_prefix))); ++ TransactionToken *token=nullptr; ++ env->GetTokenOrStartTransaction(string(io::Dirname(prefixes[0])),&token); ++ Status status = env->CreateDir(string(io::Dirname(merged_prefix)),token); + if (!status.ok() && !errors::IsAlreadyExists(status)) return status; + for (int i = 0; i < prefixes.size(); ++i) { + TF_RETURN_IF_ERROR(MergeOneBundle(env, prefixes[i], &merge)); +@@ -708,7 +715,7 @@ Status MergeBundles(Env* env, gtl::ArraySlice prefixes, + // Writes the final metadata table under the merged prefix. + std::unique_ptr merged_metadata; + TF_RETURN_IF_ERROR( +- env->NewWritableFile(MetaFilename(merged_prefix), &merged_metadata)); ++ env->NewWritableFile(MetaFilename(merged_prefix), &merged_metadata,token)); + { + table::TableBuilder builder(TableBuilderOptions(), merged_metadata.get()); + // Header entry. + +``` + +[filesystem_plugin]: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md diff --git a/rfcs/20200519-csr-sparse-matrix.md b/rfcs/20200519-csr-sparse-matrix.md new file mode 100644 index 000000000..402179653 --- /dev/null +++ b/rfcs/20200519-csr-sparse-matrix.md @@ -0,0 +1,335 @@ +# CSR Sparse Matrix + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | 246 | +| **Author(s)** | Penporn Koanantakool (penporn@google.com) | +| **Sponsor** | Rasmus Larsen (rmlarsen@google.com), Tatiana Shpeisman (shpeisman@google.com)| +| **Updated** | 2019-05-28 | + +## Objective + +Support the [compressed sparse row (CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) sparse matrix format in TensorFlow. Implement linear algebra and other common neural networks operations on both CPU and GPU devices for seamless user experience. + +### Goals +* Enable storing batched sparse matrices in CSR format in TensorFlow. +* Provide efficient linear algebra and other common neural network kernels for the CSR format on both CPU and GPU. +* Provide clean and easy to use Python APIs. +* Support backpropagation. + +### Non-goals +* Support for other sparse formats such as Block CSR, CSC, etc. Block CSR is a future work. +* Implement XLA / TPU kernels. + + +## Motivation + +Sparse tensor representation has a significant impact on performance. Modern architectures are most suited for non-random, bulk memory accesses. Sparse formats that optimize for data locality and reuse can achieve large performance gains. TensorFlow currently stores sparse tensors in [coordinate (COO)](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor) format, which works well for tensors with very few nonzeroes, and is inefficient otherwise. Deep learning [[1](https://dl.acm.org/doi/pdf/10.1145/3140659.3080254), [2](https://arxiv.org/abs/2006.10901)] and sparse linear algebra applications typically do not have sufficient sparsities to benefit from the COO format. The compressed sparse row (CSR) format is one of the most commonly used formats. It generally requires less storage and is faster than COO, sometimes by up to orders of magnitude. + +We propose supporting the CSR format in TensorFlow to accelerate sparse linear algebra and applicable deep learning applications in TensorFlow. + + +## User Benefits + +* Fast sparse linear algebra routines such as matrix multiplications, tensor contractions (convolutions), Cholesky, LU, and QR factorizations on CPU and GPU. +* Users can add existing CSR kernels from other libraries and directly use it without format conversion. + + +## Design Proposal + +A k-dimensional sparse tensor is stored as a batch of sparse CSR matrices. The k-2 outermost dimensions are batch dimensions, and the two innermost dimensions are matrix dimensions. All batches share the same values, column indices, and row pointers arrays. The figure below shows how `CSRSparseMatrix` stores a 2x3x4 sparse tensor A. `Values` stores all nonzero values in rowmajor order. `Column indices` stores the corresponding column indices of each nonzero in Values. `Row pointers` stores the position of the beginning of each matrix row in Values. `Batch pointers` stores the position of the beginning of each batch in `Values`. + +![CSRSparseMatrix format](20200519-csr-sparse-matrix/format.png) + +All kernel implementations are in C++, with a Python wrapper for Python APIs. During the experimental phase, `CSRSparseMatrix` APIs will be in the `tf.linalg.experimental.sparse` package. + + +### Supported Operations +See APIs in the Detailed Design section. +* Construction ops: + * From a dense tensor. + * From a SparseTensor. + * From given `batch_pointers`, `row_pointers`, `col_indices`, and `values` arrays. +* Op that returns CSR components. +* Conversions ops: + * Convert to and from dense tensor + * Convert to and from SparseTensor. +* Sparse linear algebra ops: + * Sparse matrix-vector multiplication (SpMV) + * Sparse-dense matrix multiplication (SpMM) + * Sparse-sparse matrix multiplication (SpGEMM) + * Sparse-sparse matrix addition (SpGEAM) + * Sparse matrix transpose + * Sparse Cholesky factorization + * Sparse LU factorization + * Sparse QR factorization + +General ops, borrowing APIs from numpy and scipy ([sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html), [sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)) packages. +* Generation ops: [eye](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.eye.html) and [rand](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.rand.html) +* Unary ops that preserve sparsity structure: + * [sin](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html), [cos](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cos.html), [tan](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tan.html), [sinh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sinh.html), [cosh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cosh.html), [tanh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.tanh.html) + * [arcsin](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arcsin.html), [arcsinh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arcsinh.html), [arccos](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arccos.html), [arccosh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arccosh.html), [arctan](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctan.html), [arctanh](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctanh.html) + * [conj](https://docs.scipy.org/doc/numpy/reference/generated/numpy.conj.html), [sign](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sign.html) + * [ceil](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ceil.html), [floor](https://docs.scipy.org/doc/numpy/reference/generated/numpy.floor.html), [trunc](https://docs.scipy.org/doc/numpy/reference/generated/numpy.trunc.html), [rint](https://docs.scipy.org/doc/numpy/reference/generated/numpy.rint.html) + * [deg2rad](https://docs.scipy.org/doc/numpy/reference/generated/numpy.deg2rad.html), [rad2deg](https://docs.scipy.org/doc/numpy/reference/generated/numpy.rad2deg.html) + * [expm1](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expm1.html), [log1p](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html), [power](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html), [sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html) + * [astype](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.astype.html) +* Unary ops that change sparsity structure: + * [softmax](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html), [clip](https://docs.scipy.org/doc/numpy/reference/generated/numpy.clip.html), [threshold](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.threshold.html) + * [eliminate_zeros](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.eliminate_zeros.html) + * [sum_duplicates](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.sum_duplicates.html) + * [transpose](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.transpose.html) +* Binary element-wise ops that preserve sparsity structure: + * [with_values](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor#with_values) +* Binary element-wise ops (may change sparsity structure): + * [maximum](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.maximum.html), [minimum](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.minimum.html) + * add, sub, [multiply](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.multiply.html), divide, [dot](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.dot.html) +* Binary element-wise ops that changes sparsity structure: + * [setdiag](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.setdiag.html) + * [kron](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.kron.html), [kronsum](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.kronsum.html) +* Reduction ops: + * [max](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.max.html), [count_nonzero](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.count_nonzero.html), [getnnz](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.getnnz.html), [min](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.min.html), [mean](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.mean.html), [sum](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.sum.html) +* Shape manipulation ops: + * [getcol](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.getcol.html), [getrow](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.getrow.html), [reshape](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.reshape.html), slice, [concatenate](https://numpy.org/devdocs/reference/generated/numpy.concatenate.html), [stack](https://docs.scipy.org/doc/numpy/reference/generated/numpy.stack.html) +* Ops that returns indices: + * [argmax](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.argmax.html), [argmin](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.argmin.html), [diagonal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.diagonal.html), [nonzero](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html) + +The ops will support broadcasting when possible. + +### Alternatives Considered +We have considered several other sparse formats in addition to CSR. +* [Compressed Sparse Column (CSC)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)): CSC is another common and performant format. But CSR still has more kernel implementations in existing libraries. CSC is a column-major format and is less suitable for TensorFlow which uses row-major storage, similar to CSR. +* [Doubly-Compressed Sparse Row (DCSR) and Doubly-Compressed Sparse Column (DCSC)](https://people.eecs.berkeley.edu/~aydin/hypersparse-ipdps08.pdf): DCSR and DCSC compress both dimensions of a matrix. DCSR is CSR with row pointers compressed. DCSC is CSC with column pointers compressed. These formats are more beneficial than CSR and CSC when matrices are hypersparse, i.e., have much fewer than one nonzero per row or column. Hypersparse matrices are not very common in TensorFlow’s current workload. +* [Block Compressed Sparse Row (BSR or BCSR)](https://scipy-lectures.org/advanced/scipy_sparse/bsr_matrix.html): BSR is the CSR format with fixed-size dense blocks for each nonzero position instead of just a scalar value. + * It is more efficient than CSR, but matrices with block sparsity occur less often in scientific computing. There aren’t as many efficient kernels provided in existing libraries yet. Our immediate goal is to enable efficient sparse linear algebra processing, so we picked CSR first. + * We plan to support BSR in a future proposal, as block sparsity is used increasingly in neural network weight pruning. +* [Compressed Sparse Fiber (CSF)](https://www.cs.umn.edu/sites/cs.umn.edu/files/tech_reports/15-015.pdf): CSF is a generalization of CSR/CSC to multi-dimensional tensors. CSF can compress one user-specified dimension of the tensor. For example, on a matrix, CSF is CSR when compressing the row dimension, and CSC when compressing the column dimension. Even though CSF is used often in tensor factorizations in scientific computing, it does not have highly optimized kernels in vendor libraries yet. +* [Tensor algebra compiler (Taco)](http://tensor-compiler.org/kjolstad-oopsla17-tensor-compiler.pdf): Taco is a compiler for sparse tensor algebra. Its storage format is exponentially more generic compared to CSF. + * Taco supports any traversal order, and can compress any number of dimensions at once. + * For a rank k tensor, Taco can express k! * 2^k storage formats, even more so if the tensor is blocked. All the aforementioned formats (CSR, CSC, DCSR, DCSC, BSR, and CSF) can be represented with Taco. + * Taco has also been [extended](http://tensor-compiler.org/chou-oopsla18-taco-formats.pdf) to support COO and other niche formats such as DIA, ELLPACK, etc. + * Taco can generate fast single-threaded CPU code for arbitrary kernels comprised of basic tensor algebra. + * Taco is not quite ready for production use yet. There aren’t vendor-supplied kernels and Taco’s efficient parallel CPU/GPU code generation is still under the work. We are keeping an eye on Taco as a candidate for TensorFlow’s generic sparse tensor support in the future. +* Hierarchical formats such as [Hierarchically Semi-Separable (HSS)](https://arxiv.org/pdf/1803.10274.pdf) and [Hierarchical COOrdinate (HiCOO)](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8665782): These format stores matrix/tensor partitions recursively. They are too application-specific for our scope. + +We picked the CSR format because +* It is generally suitable for the sparsity levels and patterns observed in TensorFlow workloads. +* It has a vast amount of vendor-optimized kernels readily available. + + +### Performance Implications +* We expect significant speed improvements from CSRSparseMatrix over SparseTensor in general. We observed from 2x to over 100x speedups in a multithreaded SpMV benchmark on CPUs. Even in the rare cases where CSRSparseMatrix is slower, there shouldn’t be regressions as CSRSparseMatrix doesn’t automatically replace SparseTensor. (Users can keep using SparseTensor in these cases.) +* There will be microbenchmarks for each sparse linear algebra op. +* We will create a few end-to-end tests and benchmarks once the library is ready. + + +### Dependencies +* CSRSparseMatrix does not add any new dependencies to TensorFlow. +* It depends on two existing dependencies: Eigen and CUDA. +* It wouldn’t affect any TensorFlow components or its dependents. + + +### Engineering Impact +* CSRSparseMatrix would increase the binary size, build and test times, but we don’t expect the increase to be significant. The library shouldn’t affect TensorFlow’s startup time. +* The library will be maintained by penporn@, with possible help from rmlarsen@, ebrevdo@, and collaborators from other teams outside of TensorFlow. + + +### Platforms and Environments +CSRSparseMatrix is compatible with all platforms supported by TensorFlow. + + +### Best Practices +* Make sure all the ops that CSRSparseMatrix flows through support CSRSparseMatrix. Avoid converting back-and-forth between dense and sparse tensors. +* If a tensor is not sparser than a certain threshold (usually over 90%, depending on tensor size, sparsity distribution, and hardware), keeping it as a dense tensor will achieve better performance. + + +### Tutorials and Examples +Once the implementation is complete, we plan to post a tutorial on the TensorFlow blog or the TensorFlow Tutorial section. All Python and C++ APIs will be documented on the TensorFlow website just like other TensorFlow ops. + +The following snippet shows how CSRSparseMatrix can be used in Python. +```Python +import tf.linalg.experimental.sparse as csr + +# CSR matrix creation. A can be a dense tensor or SparseTensor (COO format). +A_csr = csr.CSRSparseMatrix(A) + +# Conversion to other formats. +A_dense = A_csr.to_dense() # To dense tensor, using a member method. +A_dense = csr.to_dense(A_csr) # To dense tensor, using API. +A_coo = A_csr.to_sparse_tensor() # To COO, using a member method. +A_coo = csr.to_sparse_tensor(A_csr) # To COO, using API. + +# Generic matrix multiplication interface. +C = csr.matmul(A_csr, B) # Sparse-dense matmul. +C = csr.matmul(A, B_csr) # Dense-sparse matmul. +C_csr = csr.matmul(A_csr, B_csr) # Sparse-sparse matmul. +y = csr.matmul(A_csr, x) # Sparse matrix-vector multiplication (SpMV). +C_csr = A_csr * B_csr # Operator overloading. + +# Matrix addition. +C = csr.add(A_csr, B) # A and B can be dense or sparse. +C = A + B_csr # Operator overloading. + +# Transpose. +A_tranpose_csr = A_csr.transpose() # Through a member method of CSRSparseMatrix. +A_transpose_csr = csr.transpose(A_csr) # Directly through API. + +# Sparse Cholesky. +L_csr = csr.cholesky(A_csr) + +# Matrix creation. +I = csr.eye(5) # Creates a 5x5 Identity matrix. +A_csr = csr.rand(rows, cols, density) # Generates a random matrix. + +# Set diagonal. +A_csr.setdiag(diag) # Through a member method. +C_csr = csr.setdiag(A_csr, diag) # Through API. + +# Element-wise processing. +T_csr = csr.threshold(A_csr, threshmin=0) # Thresholding. +C_csr = csr.maximum(A_csr, B_csr) # Element-wise max between two matrices. +nonzero_idx = csr.nonzeros(A_csr) # Returns indices of nonzeros. + +# Reduction. +s = csr.sum(A_csr, axis=0) # Sums A_csr along the row dimension. + +# Concatenates matrices along a specified axis. +D_csr = csr.concat([A_csr, B_csr, C_csr], axis=1) +``` + +### Compatibility +* This design will conform to the backward and forward compatibility requirements once it is moved outside the experimental package. It only adds new functionalities without making changes to existing features. +* How this proposal interacts with other parts of the TensorFlow Ecosystem: + * TFLite: TFLite already supports the CSR format. TensorFlow should be able to pass the format to TFLite without problems. + * Distribution strategies: Don’t plan on interacting with this in this initial phase. + * tf.function: Can be made to work with tf.function. Will work straightforwardly if CSRSparseMatrix is a [CompositeTensor](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/python/framework/composite_tensor.py;l=31). + * GPU: We plan to make all CSRSparseMatrix operations work on GPUs. + * TPU: We don’t plan on supporting CSRSparseMatrix on TPUs yet. + * SavedModel: Should work just like any other ops/tensors. + + +### User Impact +* The library will be rolled out as an experimental package first (tf.linalg.experimental.sparse). +* There might be backward incompatible changes while the library is in the experimental phase. +* Once out of the experimental phase, the package will have official TensorFlow APIs and will conform to TensorFlow’s backward and forward compatibility requirements. + + + +## Detailed Design + +There are CSRSparseMatrix classes on both C++ and Python sides. The C++ CSRSparseMatrix object is stored as a blob in TensorFlow’s Variant tensor. The Python CSRSparseMatrix class is a wrapper of the Variant tensor with some basic matrix manipulation functions such as conversions. The figure below shows the relationship between both classes. + +![CSRSparseMatrix classes](20200519-csr-sparse-matrix/classes.png) + + +### C++ Layer +The C++ [CSRSparseMatrix](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/core/kernels/sparse/sparse_matrix.h;l=35) class has the following properties: + + + + + + + + + + + + + + + + + + + + + + + + +
PropertyDescription
dtypeThe data type of the values.
dense_shape + The shape of the tensor. +
    +
  • Host int64 vector of size k >= 2.
  • +
  • Takes on values: (batch_dim_0, …, batch_dim_k-2, rows, cols).
  • +
  • The batch dimensions are optional.
  • +
+
batch_pointers + Batch offset pointers into col_indices and values. +
    +
  • Host int32 vector of size (batch_size + 1), where batch_size = batch_dim_0 * … * batch_dim_k-2.
  • +
  • Takes on values: (0, nnz[0], nnz[0] + nnz[1], …, total_nnz).
  • +
+
row_pointers + Row offset pointers into col_indices and values. +
    +
  • Device int32 vector of size ((rows + 1) * batch_size).
  • +
  • Each batch b of size (rows + 1) takes on values: (0, num_rows[b][0], num_rows[b][0] + num_rows[b][1], …, nnz[b]).
  • +
+
col_indices + Column indices of each nonzero. +
    +
  • Device int32 vector of size nnz.
  • +
  • Takes on values: (col_index_of_the_first_nonzero, …, col_index_of_the_last_nonzero).
  • +
+
values + Value of each nonzero. +
    +
  • Device dtype vector of size total_nnz.
  • +
  • Takes on values: (value_of_the_first_nonzero, …, value_of_the_last_nonzero).
  • +
+
+ + +### Python Layer +The Python [CSRSparseMatrix](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/python/ops/linalg/sparse/sparse_csr_matrix_ops.py;l=315) class is a subclass of [SparseMatrix](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/python/ops/linalg/sparse/sparse_csr_matrix_ops.py;l=248), which stores common sparse matrix properties. Other new sparse formats can be added as subclasses of SparseMatrix. CSRSparseMatrix has the following properties: + +| Property | Description | +| :-------- | :----------------------------------- | +| shape | The shape of the tensor. | +| dtype | The data type of the content values. | +| csr_matrix | A DT_VARIANT tensor storing the C++ CSRSparseMatrix object blob. Auto-generated Python APIs for CSRSparseMatrix kernels take this as input. | + + +### Shape Inference +`Variant` tensors are perceived as scalars in TensorFlow. For proper shape inference, we store `CSRSparseMatrix`’s shape and data type in a shape inference primitive, [ShapeAndType](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/core/framework/shape_inference.h;l=133), and access them through [input_handle_shapes_and_types](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/core/framework/shape_inference.h;l=584) and [set_output_handle_shapes_and_types](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/core/framework/shape_inference.h;l=588) during shape inference. + + +### APIs +`CSRSparseMatrix` class APIs +* `__init__(input, indices=None, name=None)`: `input` can be SparseTensor or dense tensor. +* `to_dense()`: Convert to dense tensor. +* `to_sparse_tensor()`: Convert to SparseTensor. +* `conj()`: Returns the conjugated matrix. +* `transpose()`: Returns the transposed matrix. +* `hermitian_transpose()`: Returns the Hermitian transpose of the matrix. + +Generic Linear algebra API +* `matmul(a, b, transpose_a, transpose_b)` + * `a` and `b` can be sparse or dense. + * Also handles SpMV + +`CSRSparseMatrix`-specific APIs +* `sparse_matrix_add(a, b, alpha, beta)` +* `sparse_matrix_ordering_amd(a)`: Returns the approximate minimum degree (AMD) permutation vector of sparse matrix `a`. Used in sparse Cholesky factorization. +* `sparse_matrix_sparse_cholesky(a, ordering_amd)`: Cholesky factorization. +* `sparse_matrix_sparse_qr(a)`: Returns QR factorization and possibly a permutation matrix `P`. +* `sparse_matrix_sparse_solve(A, b, triangular=false)`: Solves `Ax = b`. +* `sparse_matrix_softmax(logits)` +* `sparse_matrix_softmax_grad(softmax, softmax_grad)` + +Other sparse APIs follow NumPy and SciPy APIs. See links in [Supported Operations](#supported-operations). + + +## Questions and Discussion Topics + +* Any comments on the APIs? +* Are there any more operations that we should support? +* Should we add `CSRSparseMatrix` support to existing standard ops as well, e.g., `tf.math.{add,asin,atan,ceil}`, etc? +* Would love to hear about more use cases. +* For neural networks, would CSR be useful for you (while Block CSR is still a future work)? +* Should we make CSRSparseMatrix a [CompositeTensor](https://cs.opensource.google/tensorflow/tensorflow/+/v2.2.0:tensorflow/python/framework/composite_tensor.py;l=31)? Would the effort be worth it since we will transition to the new TFRT/MLIR backend soon? How should this be prioritized? +* Should `SparseMatrix` replace `SparseTensor`? + + diff --git a/rfcs/20200519-csr-sparse-matrix/classes.png b/rfcs/20200519-csr-sparse-matrix/classes.png new file mode 100644 index 000000000..f1aa30f55 Binary files /dev/null and b/rfcs/20200519-csr-sparse-matrix/classes.png differ diff --git a/rfcs/20200519-csr-sparse-matrix/format.png b/rfcs/20200519-csr-sparse-matrix/format.png new file mode 100644 index 000000000..3dd116f70 Binary files /dev/null and b/rfcs/20200519-csr-sparse-matrix/format.png differ diff --git a/rfcs/20200520-tensor-float-32.md b/rfcs/20200520-tensor-float-32.md new file mode 100644 index 000000000..f66864536 --- /dev/null +++ b/rfcs/20200520-tensor-float-32.md @@ -0,0 +1,84 @@ +# TensorFloat-32 in TensorFlow + +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [247](https://github.com/tensorflow/community/pull/247) | +| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com) | +| **Sponsor** | Sanjoy Das (sanjoy@google.com) | +| **Updated** | 2020-06-10 | + +## Objective + +Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance. + +## Motivation + +[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short. +TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa). +It is not an in-memory format, but tensor cores natively support it as a computation format. +TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations. +NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models. + +Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores. This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster. Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16. + +As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions. + +TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. + +## Design Proposal + +In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call: + +```python +tf.config.allow_tensor_float_32_execution(True) +``` + +The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops. + +Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled. This is useful if multiple models are run sequentially in the same process, where only some should use TF32. It is also useful for tests, as it allows a test class to test both TF32 being enabled and disabled. + +We call the function "allow_tensor_float_32_execution" instead of the more concise "allow_tf32_execution" because people may mistakenly interpret the phrase "tf32" to refer to TensorFlow instead of TensorFloat. + +The following can be used to query whether TF32 is enabled. The function returns a bool. + +```python +tf.config.tensor_float_32_execution_allowed() +``` + +Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32. + +Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, downstream repackagers of TensorFlow (such as Google Cloud) are encouraged to cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3. + + +### Turning TF32 on by default + +Numerical studies by NVIDIA covering many common models suggest that TF32 is numerically robust for deep learning applications. In order to take advantage of these new accelerations in Ampere hardware for float32 models, we would like to enable TF32 by default. However, since the TensorFlow 2.4 release is still months away and we intend to use that time to further test and evaluate TF32, it is too early to decide in this RFC whether TF32 execution will be enabled or disabled by default. Here we begin a discussion by listing the most likely scenarios. Comments are also welcome. The scenarios are: + +1. Turn it on by default in 2.4, the first release with the TF32 API. +2. Turn it on by default in 2.5, the second release with the TF32 API. +3. Do not turn it on by default. + + +The advantage of (1) is that all Ampere float32 users get the performance benefit unless they opt out. Additionally, Ampere numerics will not be loosened in a new release: TensorFlow 2.4 will be the first release with Ampere support, and it will immediately default to TF32 being enabled. The disadvantage is that we cannot collect as much feedback from users before defaulting to TF32, because no stable version of TensorFlow will support TF32 but not have it enabled by default. + +The advantage of (2) is that it allows users to test and give feedback on TF32 with a stable version of TensorFlow before we decide whether it should be default. The disadvantage is it’s possible we break Ampere users who relied on the full float32 precision in 2.4 when they upgrade to 2.5 + +The advantage of (3) is that a user’s model will never break due to using reduced precision, even if they upgrade from an earlier GPU to Ampere. The disadvantage is that many Ampere users would not get the performance benefit from TF32 as they would not know about the API to enable it. + +Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled. + +**If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support. + +### Remote devices + +Enabling TF32 will affect remote Ampere GPUs in addition to local Ampere GPUs. In particular, it will affect devices on hosts connected to via [`tf.config.experimental_connect_to_host`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_host) or [`tf.config.experimental_connect_to_cluster`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_cluster). The initial, unexposed version of the function in TensorFlow 2.3 will likely only support local devices, not remote devices, since we will probably not have time to implement remote device support. + +We will need to issue an RPC to remote devices when TF32 is enabled or disabled. This means calling `allow_tensor_float_32_execution` will be a fairly heavy function call. It should only be used at the beginning of the program, or in between executing two models or tests. It is not intended to be used within a single model to make parts of it run in TF32 and parts of it run in float32, especially considering that approach would also not work within a `tf.function`. + +### Alternatives considered + +We could have an API to enable TF32 on a per-op basis, to allow users to run only part of their model in TF32. This would be useful if they discover certain TF32 ops in their model need the full float32 precision. However, we anticipate that almost every model can run safely in TF32, so we do not think this alternative is necessary. If we discover specifying TF32 on a per-op basis is useful, we can later add a TF32 scope or some other mechanism to do this. + +We could disallow enabling/disabling TF32 once a tensor has been created. This makes dealing with remote devices simpler, since we would only have to modify an RPC to create a context with TF32 enabled. We would not have to support updating a context to enable/disable TF32 after the context has been created. `tf.config.set_visible_devices` has this behavior. However, this is more limiting, and it will be non obvious to users that they have to enable TF32 before creating any tensors. + +We could export this API in TensorFlow 2.3. The issue is we don’t plan on building TensorFlow 2.3 with CUDA 11. Without CUDA 11 support, TF32 cannot be used, so the API would not be usable except by those who build TensorFlow from source. diff --git a/rfcs/20200525-gelu-migration.md b/rfcs/20200525-gelu-migration.md new file mode 100644 index 000000000..bd6271dee --- /dev/null +++ b/rfcs/20200525-gelu-migration.md @@ -0,0 +1,80 @@ +# Migrate gelu activation from TensorFlow Addons to TensorFlow Core + +| Status | Accepted | +| :---------- | :------------------------------------------------------------------------------------------------- | +| **RFC #** | [252](https://github.com/tensorflow/community/pull/252) | | +| **Authors** | Tzu-Wei Sung (@WindQAQ) & Sean Morgan (@seanpmorgan) | +| **Sponsor** | @alextp | +| **Updated** | 2020-07-15 | +| **Sponsorship Deadline** | 2020-07-17 (45 Days after submission) | + +## Rationale for Migration +* [Gaussian Error Linear Units (GELUs)](https://arxiv.org/pdf/1606.08415.pdf) cited 600+ times +* Used in BERT and other influential architectures +* Multiple approvals from TF side: + * https://github.com/tensorflow/tensorflow/pull/33945#issuecomment-617832325 + * https://github.com/tensorflow/tensorflow/issues/32783#issuecomment-537284266 + +## Historical Information +* Have there been signifiant issues reported to Addons that need to be adressed? + * Only ABI incompatibilities for the custom-op (not an issue if built along with core TF) +* When was it implemented in Addons? + * C++ custom-op added **2019-08-2019 (TFA 0.5.0)** + * Python composite op added **2020-02-26 (TFA 0.9.0)** +* We have [performed benchmarking of the GELU activation](https://colab.research.google.com/drive/1rLb4EuydbFg9PbhboXhCDqopcl6BmphG#scrollTo=0GL2x2S4zxW3) +which shows it may be beneficial to retain the custom-ops, but the maintence burden has grown +to much for us to continue to support it in Addons. +* This migration is long over-due but we've struggled with finalizing the migration process. + +## Implementation Details +* Link to implementation in Addons: + * Python: https://github.com/tensorflow/addons/blob/r0.10/tensorflow_addons/activations/gelu.py + * C++ : https://github.com/tensorflow/addons/blob/r0.10/tensorflow_addons/custom_ops/activations/cc/kernels/gelu_op.h +* Does this include custom-op kernels? + * Yes, but currently proposing to just migrate the python composite op. This may + change with discussion in the RFC. + * Are they CPU/GPU/TPU compatible? + * CPU/GPU compatible. No support for TPU. +* What is the pytest coverage of the addon? + * `tensorflow_addons/activations/gelu.py 89%` +## Changes to Implementation (If Needed) +``` +def gelu(x: types.TensorLike, approximate: bool = True) -> tf.Tensor: + x = tf.convert_to_tensor(x) + if approximate: + pi = tf.cast(math.pi, x.dtype) + coeff = tf.cast(0.044715, x.dtype) + return 0.5 * x * (1.0 + tf.tanh(tf.sqrt(2.0 / pi) * (x + coeff * tf.pow(x, 3)))) + else: + return 0.5 * x * (1.0 + tf.math.erf(x / tf.cast(tf.sqrt(2.0), x.dtype))) +``` +The above implementation would only bring over the python composite op. Since there is +[no way for us to build tfxla kernels](https://github.com/tensorflow/tensorflow/pull/33945#issuecomment-617842977) +we had no support for TPUs in Addons. [There were comments](https://github.com/tensorflow/tensorflow/pull/33945#issuecomment-625380208) +about using a "selector", but we would need guidance on how to implement that. + +We may also want to discuss the `approximate` bool and if it should be included in the +upstream version. + + +## Transition Plan +* The activation would land in [nn_ops.py](https://github.com/tensorflow/tensorflow/blob/r2.2/tensorflow//python/ops/nn_ops.py), [keras activations](https://github.com/tensorflow/tensorflow/blob/r2.2/tensorflow/python/keras/activations.py), + and possibly in [keras advaced_activation layers](https://github.com/tensorflow/tensorflow/blob/r2.2/tensorflow/python/keras/layers/advanced_activations.py) +* No planned changes to the parameter signatures at this time +* Addons would deprecate our activation and make a call to the core functionality. +* After merging to TF Core: + * Consolidate/remove https://github.com/tensorflow/models/blob/r2.2.0/official/modeling/activations/gelu.py + * Consolidate/remove https://github.com/tensorflow/models/blob/r2.2.0/official/modeling/activations/gelu_test.py + * Consolidate/remove https://github.com/tensorflow/models/blob/r2.2.0/official/nlp/xlnet/xlnet_modeling.py#L29 + +## Relevant GitHub Issues +* https://github.com/tensorflow/tensorflow/pull/33945 +* https://github.com/tensorflow/addons/issues/550 +* https://github.com/tensorflow/tensorflow/issues/32783 + +## Questions and Discussion Topics +* Whom from the TF core team would sponsor this migration and ownership of the API? +* Is it worth bringing over the custom-op kernels for CPU/GPU? + +## Final Decision +TBD diff --git a/rfcs/20200601-tfx-udsl-semantics.md b/rfcs/20200601-tfx-udsl-semantics.md new file mode 100644 index 000000000..30bc028e2 --- /dev/null +++ b/rfcs/20200601-tfx-udsl-semantics.md @@ -0,0 +1,426 @@ +# Advanced TFX DSL semantics + +| Status | Accepted | +| :------------ | :---------------------------------------------------------- | +| **Author(s)** | Ruoyu Liu (ruoyu@google.com), Konstantin Shtoyk (kostik@google.com), Mitch Trott (trott@google.com), Zhitao Li (zhitaoli@google.com) | +| **Sponsor** | Konstantinos Katsiapis (katsiapis@google.com) | +| **Updated** | 2020-04-08 | + +## Background + +The existing TFX DSL mainly focuses on one-shot pipelines with static execution +plan. While it is good enough for many use cases, there are some scenarios that +the current DSL fails to support. Some of those scenarios are becoming more and +more crucial to modern ML production pipelines. The rest of this section will go +through some scenarios that this proposal is trying to address. + +### Components with different schedules + +Until now, all components in a TFX pipeline are expected to be executed the same +number of times if no failure occurs. For example, in the +[Chicago Taxi](https://github.com/tensorflow/tfx/tree/v0.21.2/tfx/examples/chicago_taxi_pipeline) +example, the `ExampleGen` component will be executed no more or no less than the +`Trainer` component or any other component if every pipeline run finishes +successfully. This is a strong guarantee but also a strict limitation. The use +cases shown below are common patterns that require relaxation of this +limitation. + +#### Different input sources + +It is common that a ML pipeline uses more than one input sources. It is also +common that different input sources are generated with different frequencies. In +this case, it is undesirable to fit the components consuming different data +sources into the same pipeline with the same schedule since it prevents the +optimization between data freshness and resources efficiency. On the other hand, +it is also undesirable to break a logical pipeline into several pieces and run +them separately since the 'wait for all inputs available' semantics in the +original pipelines is broken which might result in unintended behaviors such as +consuming partially ready data. Thus, it is important to be able to author a +pipeline that consists of components with different schedules and we want such +experience to be well defined and intuitive. + +#### 'Best effort' components + +There are some components that are not regarded as _essential_ to a pipeline but +take very long to run. Thus, we do not want to put them into the critical path +of the pipeline nor spend too much resource to make sure they are executed +successfully for every pipeline run. A possible approach to address these +requirements is to make these components optional and do not tie them to the +same schedule as other components in the pipeline. Similar to the previous use +case, this also requires us to support components with different schedules in a +pipeline. + +#### Limited computation resource, too much data + +Data is the most critical piece to ML engineering. And the recent advanced +technologies in data collection and log processing enable us to get more data +faster. Sometimes the data volume is too high and arriving speed is too fast +that the data exceeds the computation resource limit of a system. For example: + +* New data arrives every hour but the entire ML pipeline takes one day to run + end-to-end. + +In most cases, there is one (or a few components) that take(s) significantly +more time than other components. Indeed, from what we observed within Google, +`Trainer` is more likely to be the one that cannot catch up with the speed of +other components such as `ExampleGen`. We have a couple of options in the +existing TFX DSL under this context: + +1. Adapt the pipeline scheduling to the data arriving frequency. In the context + of the previous example, the pipeline will be scheduled to run every hour. +2. Adapt the pipeline scheduling to the pipeline running time. In the context + of the previous example, the pipeline will be scheduled to run every day. +3. Split the pipeline into multiple parts, each of which employs a different + schedule. + +Option 1 is fine in an ideal world, where the computation resource is unlimited. +However in real world, this option will likely cause pipeline runs to pile up in +a certain stage which is problematic due to resource contention. Option 2 will +not have the resource contention problem but it gives up the benefit of the +timely-arriving data. It is likely that the `Trainer` is training on some old +data despite of more recent data being available, which is a compromise of model +freshness and is not ideal for scenarios that are sensitive to model freshness. +Option 3 can potentially solve the problem but is not user-friendly and is prone +to future changes of individual components and pipeline shape. + +On the other hand, if we are able to support components with different schedules +in a pipeline, the problem can be naturally solved. + +### Synchronization barrier + +It is not recommended to have concurrent pipeline runs since it will cause +problems for components that need to be guarded by a 'synchronization barrier' +such as `Pusher`. However, this is not avoidable sometimes even if the intention +is not to have pipeline runs concurrently. + +
dj_normal
+ +The figure above shows a daily ML pipeline in an ideal world. For +simplification, we combine all components but `Pusher` into a single stage. As +you can see, we expect a pipeline run to finish within a day so it will not have +overlap with other runs. However, there might be cases that the pipeline +overruns and result in concurrent pipeline runs. The figure below shows an +extreme but possible case, where the Tuesday run actually finishes after the +Wednesday run. This is problematic since Tuesday run will push an older model +into production which is likely to cause regression. + +
dj_abnormal
+ +To address the problem, we need to guarantee that: + +* There is only one instance of `Pusher` running at any time. +* Each `Pusher` run always pushes a model that is considered better than any + previously pushed model. + +However, there is no good solution to provide such guarantees without compromise +in the existing TFX DSL. + +## Design proposal + +### A new way to categorize ML pipelines + +Before going into the design details, we would like to introduce a new way to +categorize ML pipelines. Understanding these concepts is critical to evaluate +the rest of this RFC. + +#### Synchronous execution & Asynchronous execution + +Synchronous execution refers to the pipeline execution style we have seen so far +in the TFX DSL. It has several properties: + +* A synchronous execution pipeline can be represented as a _Directed Acyclic + Graph_ (DAG) where any edge in the DAG represents the task dependency + relationship between two components in the pipeline. +* A synchronous execution pipeline is scheduled in the unit of pipeline runs. + In each pipeline run, components in the pipeline will be visited in + topological order according to the pipeline DAG. +* Every component in the pipeline will be executed *exactly once* for each + pipeline run. For example, `ExampleGen` (represented by `Eg`), `Transform` + (represented by `Tx`) and `Trainer` (represented by `Tr`) in the figure + below will share the same number of executions if no error occurs. +* A component in the pipeline will only be triggered when all its upstream + components finish. For example, `Trainer` will only be triggered when + `ExampleGen` and `Transform` all finish in the figure below. + +
+ +Asynchronous execution refers to the pipeline execution style where each +component in the pipeline is a stand-alone job (usually a long-running job). It +has the following properties: + +* There is no explicit task dependencies in asynchronous execution pipelines. + Components are loosely connected through data dependencies. For example in + the figure below, we considered `ExampleGen` and `Transform` is connected + (through dashed line) only because that `Transform` will consume the output + from `ExampleGen`. +* There is no 'pipeline run' for asynchronous execution pipelines. Each + component in the pipeline runs on its own schedule. +* Each component in an asynchronous execution pipeline can be triggered by + newly available data of any of its data dependencies. For example, `Trainer` + can be triggered by either new example data produced by `ExampleGen` or a + new transform graph produced by `Transform`. +* At a certain time, there is at most one instance per component that is + running. + +More details about the usage of asynchronous execution pipelines within Google +can be found in this +[paper](https://www.usenix.org/system/files/opml19papers-baylor.pdf). + +
+ +#### Synchronous data & Asynchronous data + +Under the context of TFX, a component is running in synchronous data mode if it +only consumes the immediate outputs of its upstream components in the **same** +pipeline run. This is **only** possible if the pipeline is in synchronous +execution mode mentioned in previous section. + +On the other hand, if a component is able to consume more than the immediate +outputs of its upstream components but also the historical outputs of its +upstream components in previous pipeline runs, it is running in asynchronous +data mode: + +- In synchronous execution pipelines, this means that a component consumes not + only the outputs of its direct upstream component runs, but also the + historical data from previous runs. This can be achieved by leveraging + [Resolvers](https://github.com/tensorflow/community/blob/master/rfcs/20190828-tfx-resolver.md) + in the existing TFX DSL [^1]. +- In asynchronous execution pipelines, all components are running in + asynchronous data mode naturally. + +[^1]: Existing examples of + [warm start](https://github.com/tensorflow/tfx/blob/r0.21.4/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_warmstart.py#L102-L115) + and + [base model selection](https://github.com/tensorflow/tfx/blob/r0.21.4/tfx/examples/iris/iris_pipeline_native_keras.py#L111-L139) + all use Resolvers to achieve asynchronous data. + +### Support asynchronous execution pipelines + +The table below lists all possible combinations of execution mode and data mode +for an ML pipeline. As we discussed previously, TFX DSL already supports both +combinations with synchronous execution in it. We propose to support +asynchronous execution, which will help us to cover all combinations in the +table. + +| | Synchronous execution | Asynchronous execution| +| :-------------------: | :-------------------: | :-------------------: | +| **Synchronous data** | Default mode in existing TFX DSL | Not meaningful | +| **Asynchronous data** | Supported through [Resolvers](https://github.com/tensorflow/community/blob/master/rfcs/20190828-tfx-resolver.md) | Introduced in this RFC | + +By supporting asynchronous execution pipelines, we are able to address all use +cases mentioned in the [Background](#Background) section: + +* Components are naturally running with different frequencies in asynchronous + execution pipelines. The running frequencies are decided by the combination + of the following: + * The frequencies of the new data arrival that can trigger a run. + * The time needed to finish a run. + * Potentially scheduling optimizer. + +Note: Self-determined running frequency, as explained above, is not the same as +statically-defined running schedule that normally happens in the synchronous +execution pipeline world. However it can achieve similar goal in this context. + +* Synchronization barrier is also conveniently available since it is + guaranteed that only one instance of a component will be running at any + time. + +The rest of this section will go into details about what additional semantics +and syntax will be added to TFX DSL to support asynchronous execution pipelines. + +### Execution mode + +A piece of good news is that components in the existing TFX DSL are already +connected through data dependencies, instead of direct explicit task +dependencies. The only place we need to change in the syntax is to add an +`execution_mode` option to the pipeline constructor interface. + +```python +def create_pipeline(): + eg = ExampleGen(...) + tr = Trainer(...) + p = Pusher(...) + + return Pipeline.pipeline( + components=[eg, tr, p], + # The only difference compared with the existing DSL. Also note that this + # field is optional and default to `SYNC` for backward compatibility. + execution_mode=ASYNC, + ...) +``` + +### Sub-pipeline + +Asynchronous execution pipelines are able to provide a lot flexibility and +unblock many use cases. However, vanilla asynchronous execution pipelines have +their own problems. + +Consider the example below, for which there is no good way to express the intent +that `Trainer` needs to read examples and transform graph that satisfy: + +- If the transform graph produced by a `Transform` execution E1 is + used by a `Trainer` execution E2, then E1 and + E1 should use the same examples. + +This is a typical data synchronization problem inside an asynchronous execution +pipeline. For simple cases like the one above, it is still possible (although +strongly not recommended) to workaround by hardcoding the synchronization logic +into a specific component (in this case, `Trainer`) and some special stamping on +related artifacts. However it will soon become unmanageable when the number of +components involved the data synchronization problem increases. What we need is +a mechanism to make part of the asynchronous execution pipelines to run +synchronously. + +
+ +To address the problem, we propose 'sub-pipeline', which refers to a synchronous +execution pipeline inside a parent pipeline. In this case, we can have a mix of +synchronous execution and asynchronous execution together in one pipeline +definition. There are several attributes related to sub-pipeline: + +1. If we view sub-pipeline as a node in its parent pipeline, there is only one + execution mode (synchronous vs asynchronous) in the parent pipeline. +2. A sub-pipeline is **always** in synchronous execution mode, i.e., all nodes + inside the sub-pipeline are executed synchronously in topological order. +3. Each node inside the sub-pipeline can be configured to run with synchronous + data mode or asynchronous data mode. +4. Sub-pipeline inputs and outputs can be wired in either synchronous or + asynchronous fashion (introduced below). + +We will use the scenario represented by the figure below to better demonstrate +the proposal. There are 5 nodes in the parent pipeline: + +- An `ExampleGen` component, represented by `Eg`. +- An `EmbeddingGenerator` component, represented by `Eb`. +- A sub-pipeline that consists of three nodes: + - A `Transform` component, represented by `Tx`. It will take the examples + produced by `ExampleGen` and output a transform graph. + - A `Trainer` component, represented by `Tr`. It will take three inputs: + (1) the examples produced by `ExampleGen`; (2) the transform graph + produced by `Transform`; (3) the embedding produced by + `EmbeddingGenerator`. It will output a model artifact. There are two + special requirements for these inputs: + - `Trainer` and `Transform` should use the same examples. + - The embedding used by a `Trainer` execution should be as fresh as + possible. + - An `InfraValidator` component, represented by `Iv`. It will take the + model produced by `Trainer` and evaluate whether the model can be + deployed without correctness or performance issue. The `InfraValidator` + component will output a validation result artifact. +- A `Pusher` component, represented by `P`. It will take the model produced by + `Trainer` in the sub-pipeline and push the model to the model server. For a + model to be regarded as 'valid', there is a requirement that the `Pusher` + will only read a model that has gone through infra validation. +- A `TFLiteConverter` component, represented by `Lt`. It will take the model + produced by `Trainer` in the sub-pipeline and convert it into a mobile + friendly model. Since the conversion does not rely on server side infra + validation, we want it to start process as soon as a new model is produced + by `Trainer`. + +
+ +#### Inputs to sub-pipelines + +There are two flavors for a node inside the sub-pipeline to get input artifacts +that are **NOT** produced inside the sub-pipeline: + +- Asynchronous inputs. This is the same behavior as a normal node in an + asynchronous pipeline. If there are two nodes inside a sub-pipeline that are + trying to read the output of a node outside of the sub-pipeline + asynchronously, they might get different results. This is demonstrated by + `Trainer` reading the output of `EmbeddingGenerator`. Note that this is only + available when the parent pipeline is in asynchronous execution mode. +- Synchronous inputs. As a comparison, if two nodes in a sub-pipeline are + reading from the same synchronous input, it is guaranteed that they will get + the same set of artifacts. This is done through snapshotting the inputs at + the beginning of the pipeline (as represented by the small barnacle attached + to the left side of the sub-pipeline box in the figure above). In our + example, `Trainer` and `Transform` are all reading the output of + `ExampleGen` as synchronous input. Note that if the parent pipeline is in + synchronous execution mode, the nodes inside the sub-pipeline will always + read synchronous inputs. + +NOTE: By default, a sub-pipeline can be triggered by any newly available +synchronous input but it will not be triggered by any newly available +asynchronous input. We will also discuss and provide custom triggering options +in future designs. + +#### Outputs of sub-pipelines + +As a symmetry, there will be two flavors for a node outside the sub-pipeline to +read the outputs of a node inside the sub-pipeline: + +- Asynchronous outputs. This is the same behavior as normal asynchronous data + fetching. `TFLiteConverter` above demonstrates this behavior: it can be + triggered as soon as `Trainer` in the sub-pipeline produces a new model. + Note that this is only available when the parent pipeline is in asynchronous + execution mode. +- Synchronous outputs. As a comparison, when a node outside of the + sub-pipeline tries to read the synchronous outputs of a node inside the + sub-pipeline, the outside node will not get the artifacts until all the + nodes inside the sub-pipeline finish execution. This is demonstrated by + `Pusher` in the example above: It will be able to read a model produced by a + `Trainer` execution only when all nodes in the sub-pipeline finish execution + for that sub-pipeline run. Similar to synchronous inputs to a sub-pipeline, + this is achieved by snapshotting the outputs of nodes inside a sub-pipeline + after all nodes finish execution (as represented by the small barnacle + attached to the right side of the sub-pipeline box in the figure above). + +### Syntax + +The example code below demonstrates the syntax for all proposed semantics in +this RFC. The example code also adopts the same example in the figure above. + +```python +def create_subpipeline(eg, eb): + b = tfx.experimental.SubpipelineInputs( + inputs={'examples': eg.outputs['examples']}, + async_inputs={'embedding': eb.outputs['embedding']}) + tx = tfx.Transform( + examples=b.inputs['examples']) + tr = tfx.Trainer( + examples=b.inputs['examples'], + embedding=b.async_inputs['embedding'], + transform_graph=tx.outputs['transform_graph']) + iv = tfx.InfraValidator(model=tr.outputs['model']) + + return tfx.experimental.Subpipeline( + components=[tx, tr, iv], + inputs=b, + outputs={ + 'model': tr.outputs['model'], + 'validation_result': iv.outputs['validation_result'] + }, + async_outputs={'model': tr.outputs['model']}) + + +eg = tfx.ExampleGen(...) # Irrelevant parts omitted +eb = tfx.EmbeddingGenerator(...) # Irrelevant parts omitted +sp = create_subpipeline(eg, eb) +p = tfx.Pusher( + model=sp.outputs['model'], + validation_result=sp.outputs['validation_result']) +lt = tfx.TFLiteConverter(model=sp.async_outputs['model']) + +return pipeline.Pipeline( + components=[eg, eb, sp, p, lt], execution_mode=ASYNC) +``` + +## Future work + +There are multiple topics that we would like to address in future design +proposals. + +Most importantly, we will introduce the data model and a serialization format of +ML pipelines that will uniformly support: + +- Both synchronous execution mode and asynchronous execution mode. +- Both synchronous data mode and asynchronous data mode. + +Beside that, we will also explore: + +- Formalizing `TriggerPolicy` as part of the node abstraction. This also + include exploring options and APIs to support custom triggering logic for + single node as well as sub-pipelines. +- Options and APIs to support parallel executor runs within a component to add + more efficiencies and flexibilities. diff --git a/rfcs/20200601-tfx-udsl-semantics/async_pipeline_example.png b/rfcs/20200601-tfx-udsl-semantics/async_pipeline_example.png new file mode 100644 index 000000000..03c769d26 Binary files /dev/null and b/rfcs/20200601-tfx-udsl-semantics/async_pipeline_example.png differ diff --git a/rfcs/20200601-tfx-udsl-semantics/daily_job_abnormal.png b/rfcs/20200601-tfx-udsl-semantics/daily_job_abnormal.png new file mode 100644 index 000000000..37a43f045 Binary files /dev/null and b/rfcs/20200601-tfx-udsl-semantics/daily_job_abnormal.png differ diff --git a/rfcs/20200601-tfx-udsl-semantics/daily_job_normal.png b/rfcs/20200601-tfx-udsl-semantics/daily_job_normal.png new file mode 100644 index 000000000..46e043bab Binary files /dev/null and b/rfcs/20200601-tfx-udsl-semantics/daily_job_normal.png differ diff --git a/rfcs/20200601-tfx-udsl-semantics/sub_pipeline.png b/rfcs/20200601-tfx-udsl-semantics/sub_pipeline.png new file mode 100644 index 000000000..17a486d16 Binary files /dev/null and b/rfcs/20200601-tfx-udsl-semantics/sub_pipeline.png differ diff --git a/rfcs/20200601-tfx-udsl-semantics/sync_pipeline_example.png b/rfcs/20200601-tfx-udsl-semantics/sync_pipeline_example.png new file mode 100644 index 000000000..7d7b39a3b Binary files /dev/null and b/rfcs/20200601-tfx-udsl-semantics/sync_pipeline_example.png differ diff --git a/rfcs/20200616-keras-multihead-attention.md b/rfcs/20200616-keras-multihead-attention.md new file mode 100644 index 000000000..c930fbfe1 --- /dev/null +++ b/rfcs/20200616-keras-multihead-attention.md @@ -0,0 +1,356 @@ +# RFC: Multihead Attention and EinsumDense on Keras + +| Status | Accepted | +| :------------ | :------------------------------------------------------ | +| **RFC #** | [260](https://github.com/tensorflow/community/pull/260) | +| **Author(s)** | Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com) | +| **Sponsor** | Francois Chollet (fchollet@google.com) | +| **Updated** | 2020-06-16 | + +## Objective + +Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras. + +## Motivation + +MultiHeadAttention is very popular and has become standard for deep learning +libraries. We propose to contribute a flexible well-defined implementation +inside Keras absorbing common best practices from reference libraries. + +## User Benefit + +We can standardize the implementation of Transformer layers and use the best +practice. We offer a rich set of functionalities to different use cases, e.g. +different project spaces, outputing multi-head attention scores for analysis, +etc. We also modularize computations to make the MultiHeadAttention layer +extensible to variants. + +## Design Proposal + +### Key Features + +* Returns multi-headed attention scores, which is commonly useful for + attention visualization and analysis. +* Supports query (Q), key (K), value (V) tensors as individual inputs and + supports projecting Q, K, V to different dimensions. +* Final outputs projects to user specified dimensions. +* Using tf.einsum to express high-dimensional computation and adopts + [tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense) + layer. +* Supports high-dimension attention when target and source are 2D, 3D, etc. + +### Code Examples + +* How to write a TransformerBlock for an encoder. + +```python +class TransformerBlock(tf.keras.layers.Layer): + def __init__(self, embed_dim, num_heads, ff_dim): + super(TransformerBlock, self).__init__() + self.att = attention.MultiHeadAttention(embed_dim, num_heads) + self.ffn = tf.keras.Sequential( + [tf.keras.layers.Dense(ff_dim, activation="relu"), + tf.keras.layers.Dense(embed_dim),] + ) + self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6) + self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6) + + def call(self, inputs, attention_mask=None): + attn_output = self.att([inputs, inputs], attention_mask=attention_mask) + out1 = self.layernorm1(inputs + attn_output) + ffn_output = self.ffn(out1) + return self.layernorm2(out1 + ffn_output) +``` + +* Use attention mask to avoid performing attention on padding token indices. + +```python +test_layer = TransformerBlock( + embed_dim=2, + num_heads=2, + ff_dim=4) +query = np.array([[[0.1, 0.2], [0.0, 0.0]]]) +mask = np.array([[[1, 0], [1, 0]]], dtype='bool') +output = test_layer(query, mask) +``` + +* Inside a Transformer decoder, we often want to output the cross-attention + scores to analyze how the target sequence attend to the source sequence. We + are able to visualize the alignment according to attention scores. + +```python +test_layer = MultiHeadAttention( + num_heads=2, key_size=2, return_attention_scores=True) +target = np.array([[[0.1, 0.2], [0.0, 0.0]]]) +source = np.array([[[0.1, 0.2], [3.0, 1.0]]]) +output, scores = test_layer(query=target, value=source) +scores = tf.math.reduce_sum(scores, axis=1) # shape = (1, 2, 2) +``` + +* Attention beyound sequences. Taking 2D, 3D target and source. + +```python +query_shape = [2, 3, 4, 4] # batch, target, target, embedding. +value_shape = [2, 3, 2, 4] # batch, source, source, embedding. +mask_shape = [2, 3, 4, 3, 2] +query = 10 * np.random.random_sample(query_shape) +value = 10 * np.random.random_sample(value_shape) +mask_data = np.random.randint(2, size=mask_shape).astype("bool") +output = test_layer(query=query, value=value, attention_mask=mask_data) +``` + +### Interface + +```python +class MultiHeadAttention(tf.keras.layers.Layer): + """MultiHeadAttention layer. + + This is an implementation of multi-headed attention based on "Attention + is all you Need". If `query`, `key,` `value` are the same, then + this is self-attention. Each timestep in `query` attends to the + corresponding sequence in `key`, and returns a fixed-width vector. + + This layer first projects `query`, `key` and `value`. These are + (effectively) a list of tensors of length `num_attention_heads`, where the + corresponding shapes are [batch_size, , key_size], + [batch_size, , key_size], + [batch_size, , value_size]. + + Then, the query and key tensors are dot-producted and scaled. These are + softmaxed to obtain attention probabilities. The value tensors are then + interpolated by these probabilities, then concatenated back to a single + tensor. + + Finally, the result tensor with the last dimension as value_size can take an + linear projection and return. + + Examples: + + Performs 1D cross-attention over two sequence inputs with an attention mask. + Returns the additional attention weights over heads. + + >>> layer = MultiHeadAttention(num_heads=2, key_size=2, + ... return_attention_scores=True) + >>> target = tf.keras.Input(shape=[8, 16]) + >>> source = tf.keras.Input(shape=[4, 16]) + >>> mask_tensor = tf.keras.Input(shape=[8, 4]) + >>> output_tensor, weights = layer(query=target, value=source + ... attention_mask=mask_tensor) + >>> print(output_tensor.shape), print(weights.shape) + (None, 8, 16) (None, 2, 8, 4) + + Performs 2D self-attention over a 5D input tensor on axes 2 and 3. + + >>> layer = MultiHeadAttention(num_heads=2, key_size=2, attention_axes=(2, 3)) + >>> input_tensor = tf.keras.Input(shape=[5, 3, 4, 16]) + >>> output_tensor = layer(query=input_tensor, value=input_tensor) + >>> print(output_tensor.shape) + (None, 5, 3, 4, 16) + + Arguments: + num_heads: Number of attention heads. + key_size: Size of each attention head for query and key. + value_size: Size of each attention head for value. + dropout: Dropout probability for a Dropout layer on attention_scores. + use_bias: Boolean, whether the dense layers use bias vectors/matrices. + output_shape: The expected shape of an output tensor, besides the batch and + sequence dims. If not specified, projects back to the key feature dim. + attention_axes: axes over which the attention is applied. `None` means + attention over all axes, but batch, heads, and features. + return_attention_scores: bool, if `True`, returns the multi-head + attention scores as an additional output argument. + kernel_initializer: Initializer for dense layer kernels. + bias_initializer: Initializer for dense layer biases. + kernel_regularizer: Regularizer for dense layer kernels. + bias_regularizer: Regularizer for dense layer biases. + activity_regularizer: Regularizer for dense layer activity. + kernel_constraint: Constraint for dense layer kernels. + bias_constraint: Constraint for dense layer kernels. + """ + + def call(self, query, value, key=None, attention_mask=None): + """Implements the forward pass. + + Size glossary: + * Number of heads (H): the number of attention heads. + * Value size (V): the size of each value embedding per head. + * Key size (K): the size of each key embedding per head. Equally, the size + of each query embedding per head. Typically K <= V. + * Batch dimensions (B). + * Query (target) attention axes shape (T). + * Value (source) attention axes shape (S), the rank must match the target. + + Args: + query: Query `Tensor` of shape `[B, T, dim]`. + value: Value `Tensor` of shape `[B, S, dim]`. + key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will + use `value` for both `key` and `value`, which is the most common case. + attention_mask: a boolean mask of shape `[B, T, S]`, that prevents + attention to certain positions. + + Returns: + attention_output: The result of the computation, of shape [B, T, E], + where `T` is for target sequence shapes and `E` is the query input last + dimension if `output_shape` is `None`. Otherwise, the multi-head outputs + are project to the shape specified by `output_shape`. + attention_scores: [Optional] multi-head attention coeffients over + attention axes. + """ +``` + +### Auxiliary Layers and Changes + +* EinsumDense layer + +We use `tf.einsum` to implement a dense layer can perform einsum calculations of +arbitrary dimensionality. This example shows how to instantiate a layer that +applies the same dense operation to every element in a sequence. Here, the +'output_shape' has two values (since there are two non-batch dimensions in the +output); the first dimension in the output_shape is `None`, because the sequence +dimension `b` has an unknown shape. + +```python +layer = EinsumDense("abc,cd->abd", output_shape=(None, 64), bias_axes="d") +input_tensor = tf.keras.Input(shape=[32, 128]) +output_tensor = layer(input_tensor) # output shape is (None, 32, 64) +``` + +* Masked Softmax + +Inside the attention computation, we need to mask logits before softmax and it +has become a common treatment in many applications. We propose to add an +optional `mask` argument to `tf.nn.softmax`. The downstream keras `Softmax` +layer will also take an optional `mask` tensor. This `mask` tensor should have +the same rank as the input tensor and mask elements on the axis which will +perform softmax. + +Inside `MultiHeadAttention` keras layer, we will use the keras `Softmax` layer +with mask and adjust attention mask shape to match the inputs. The dimension +expension logic and multi-axes softmax will be handled locally in +`MultiHeadAttention` layer. + +* Keras Dense Attention + +We have two changes proposed to +[tf.keras.layers.Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention). +(1) The layer call method takes an optional argument, `mask`, which requires two +tensors, `q_mask` and `v_mask`. They are following keras framework requirements +with (batch_size, target_length) and (batch_size, source_length) as shapes. This +limits the flexibility of masking and `MultiHeadAttention` layer generalize the +attention mask to be (batch dims, target dims, source dims). To be consistent, +we would like to introduce an optional argument `attention_mask` for +`tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`, +the shape is (batch_size, target_length, source_length). Whenever +`attention_mask` is specified, the `mask` argument is OK to be skipped. +(2) The layer does not return attention scores. We will add the bool argument, +`return_attention_scores` to the __init__ and return the attention score tensor if +it is true. + +* TFA `MultiHeadAttention` Deprecation and Re-mapping + +[MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py) +has been released. The proposed `MultiHeadAttention` has similar `__init__` +arguments and `call` interface, where the minor differences are argument names +and the attention `mask` shape. We expect the new `MultiHeadAttention` keras +layer will cover the functionalities. Once the implementation are merged as +experimental layers, we will work with TF Addons team to design the deprecation +and re-mapping procedure. + +### Alternatives Considered + +We examined multi-head attention layer implemented in various libraries. There +are a few features that we do not include inside this keras layer and we feel it +is better to subclass the `MultiHeadAttention` layer to fulfill the needs. + +* Attention caching for decoding. Implemented in + [Flax](https://github.com/google/flax/blob/master/flax/nn/attention.py#L301). + The caching is a special treatment for inference and we noticied that + different treatments are required for dynamic or static shape programs. + Thus, subclassing as a + [CachedAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention.py) + layer is the solution inside the model garden. +* [MultiHeadAttention](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py) + keras layer is also implemented in TF-Addons. The design in this doc covers + the features in TF-addons implementation but generalizes to more use cases. + +### Performance Implications + +* We will add microbenchmarks following the common practices of keras layers. +* We have end-to-end integration/regression tests for models using this layer, + e.g. BERT. + +### Dependencies + +No dependencies. + +### Engineering Impact + +* The keras layer can be tested inside the package. +* TensorFlow team will maintain the code. + +### Platforms and Environments + +* Work for all platforms and environments + +### Best Practices + +* No change for Tensorflow best practices. + +### Tutorials and Examples + +* Code examples can be found inside Tensorflow Model Garden. For example, an + encoder + [Transformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer.py). + +* 2D attention example in the + [unit test](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention_test.py#L135). + +### Compatibility + +* This is a new layer without compatibility concerns. +* The proposal works with TFLite, distribution strategy, tf.function, GPU/TPU + and serializable to SavedModel. These are tested inside TensorFlow Model + Garden applications. + +### User Impacteisum + +* We will first introduce the layer as + `tf.keras.layers.experimental.MultiHeadAttention` and + `tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and + functionalities are fully verified, the next step is to graduate as core + keras layers by removing `experimental` scope. + +## Detailed Design + +The layer has been implemented as the +[MultiHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/attention.py#L116) +inside TensorFlow Model Garden. + +First, as we rely on `tf.einsum` to define projections and attention +computation, we need to figure out the einsum notation of each computation. +Furthermore, to make the layer generalize to high-dimension cases, i.e. there +are more than one batch dimensions and attention softmax can be performed on +multiple axes, we need to track the batch axes and attention axes inside einsum +notations. We use a vector of chars and use two local methods to generate einsum +notations for projections and attentions. + +Second, the layer by default implements the most common dot-product attention. +There are various ways to implement the attention computation, so we modulize it +as two methods `build_attention` and `compute_attention`. Thus, users will be +able to just override them to get a new keras layer with a novel attention +method. For example, we implemented +[TalkingHeadAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) +introduced by ["Talking-Heads Attention "](https://arxiv.org/abs/2003.02436) +paper. Using the keras Attention layer as another example, since it supports the +basic single-head case 1-D attention, we can use it inside `build_attention` +and `compute_attention`. + +## Questions and Discussion Topics + +- cuDNN has the + [multi-head attention](https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnMultiHeadAttnForward) + function. How do we incorporate it? A: we modularize the attention + computation components in order to support new low-level functions without + changing this layer interface. The cuDNN function supports the classic + dot-product attention with classic input dimensions. We will be able to use + it once TensorFlow add an op to use it. diff --git a/rfcs/yyyymmdd-rfc-template.md b/rfcs/yyyymmdd-rfc-template.md index c647c8eb3..ca386f181 100644 --- a/rfcs/yyyymmdd-rfc-template.md +++ b/rfcs/yyyymmdd-rfc-template.md @@ -2,6 +2,7 @@ | Status | (Proposed / Accepted / Implemented / Obsolete) | :-------------- |:---------------------------------------------------- | +| **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)| | **Author(s)** | My Name (me@example.org), AN Other (you@example.org) | | **Sponsor** | A N Expert (whomever@tensorflow.org) | | **Updated** | YYYY-MM-DD | @@ -33,15 +34,50 @@ idea, and list pros/cons to each approach. If there are alternatives that you have eliminated, you should also list those here, and explain why you believe your chosen approach is superior. -Factors to consider include: +Make sure you’ve thought through and addressed the following sections. If a section is not relevant to your specific proposal, please explain why, e.g. your RFC addresses a convention or process, not an API. -* performance implications -* dependencies -* maintenance -* platforms and environments impacted (e.g. hardware, cloud, other software - ecosystems) -* [compatibility](https://www.tensorflow.org/programmers_guide/version_compat) -* how will this change impact users, and how will that be managed? + +### Alternatives Considered +* Make sure to discuss the relative merits of alternatives to your proposal. + +### Performance Implications +* Do you expect any (speed / memory)? How will you confirm? +* There should be microbenchmarks. Are there? +* There should be end-to-end tests and benchmarks. If there are not (since this is still a design), how will you track that these will be created? + +### Dependencies +* Dependencies: does this proposal add any new dependencies to TensorFlow? +* Dependent projects: are there other areas of TensorFlow or things that use TensorFlow (TFX/pipelines, TensorBoard, etc.) that this affects? How have you identified these dependencies and are you sure they are complete? If there are dependencies, how are you managing those changes? + +### Engineering Impact +* Do you expect changes to binary size / startup time / build time / test times? +* Who will maintain this code? Is this code in its own buildable unit? Can this code be tested in its own? Is visibility suitably restricted to only a small API surface for others to use? + +### Platforms and Environments +* Platforms: does this work on all platforms supported by TensorFlow? If not, why is that ok? Will it work on embedded/mobile? Does it impact automatic code generation or mobile stripping tooling? Will it work with transformation tools? +* Execution environments (Cloud services, accelerator hardware): what impact do you expect and how will you confirm? + +### Best Practices +* Does this proposal change best practices for some aspect of using/developing TensorFlow? How will these changes be communicated/enforced? + +### Tutorials and Examples +* If design changes existing API or creates new ones, the design owner should create end-to-end examples (ideally, a tutorial) which reflects how new feature will be used. Some things to consider related to the tutorial: + - The minimum requirements for this are to consider how this would be used in a Keras-based workflow, as well as a non-Keras (low-level) workflow. If either isn’t applicable, explain why. + - It should show the usage of the new feature in an end to end example (from data reading to serving, if applicable). Many new features have unexpected effects in parts far away from the place of change that can be found by running through an end-to-end example. TFX [Examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples) have historically been good in identifying such unexpected side-effects and are as such one recommended path for testing things end-to-end. + - This should be written as if it is documentation of the new feature, i.e., consumable by a user, not a TensorFlow developer. + - The code does not need to work (since the feature is not implemented yet) but the expectation is that the code does work before the feature can be merged. + +### Compatibility +* Does the design conform to the backwards & forwards compatibility [requirements](https://www.tensorflow.org/programmers_guide/version_compat)? +* How will this proposal interact with other parts of the TensorFlow Ecosystem? + - How will it work with TFLite? + - How will it work with distribution strategies? + - How will it interact with tf.function? + - Will this work on GPU/TPU? + - How will it serialize to a SavedModel? + +### User Impact +* What are the user-facing changes? How will this feature be rolled out? ## Detailed Design diff --git a/sigs/addons/RELEASE.md b/sigs/addons/RELEASE.md index 75753f20c..02973bb1b 100644 --- a/sigs/addons/RELEASE.md +++ b/sigs/addons/RELEASE.md @@ -4,14 +4,10 @@ SIG Addons release process consists of the folowing steps: 1. Create new rX.X branch on tensorflow/addons 2. Create and merge a new PR into the release branch * Set the correct version and suffix in [version.py](https://github.com/tensorflow/addons/blob/master/tensorflow_addons/version.py) - * Freeze the tensorflow version in [requirements.txt](https://github.com/tensorflow/addons/blob/master/requirements.txt) - * Remove `--nightly` flag from [release scripts](https://github.com/tensorflow/addons/tree/master/tools/ci_build/builds) - * Compile the docs: [instructions](https://github.com/tensorflow/addons/tree/master/tools/docs) -3. Trigger [Travis build](https://travis-ci.org/tensorflow/addons) - * This will test and build linux+macos wheels and publish to PyPi -4. Publish and tag a [release on Github](https://github.com/tensorflow/addons/releases) +3. Publish and tag a [release on Github](https://github.com/tensorflow/addons/releases) * Add updates for new features, enhancements, bug fixes * Add contributors using `git shortlog ..HEAD -s` + * **NOTE: This will trigger a GitHub action to release the wheels on PyPi** ## SIG Addons Release Team @@ -19,3 +15,4 @@ SIG Addons release process consists of the folowing steps: Current Release Team: - Sean Morgan - GitHub: [@seanpmorgan](https://github.com/seanpmorgan) - PyPI: [seanmorgan](https://pypi.org/user/seanmorgan/) - Yan Facai(颜发才) - GitHub: [@facaiy](https://github.com/facaiy) - PyPI: [facaiy](https://pypi.org/user/facaiy/) + \ No newline at end of file diff --git a/sigs/build/CHARTER.md b/sigs/build/CHARTER.md index 9df6ecc2a..3531475ca 100644 --- a/sigs/build/CHARTER.md +++ b/sigs/build/CHARTER.md @@ -19,7 +19,7 @@ Archives of the mailing list will be publicly accessible. ## Contacts -* Project leads: Jason Zaman [@perfinion](https://github.com/perfinion), Austin Anderson [@angersson](https://github.com/angersson) +* Project leads: Jason Zaman [@perfinion](https://github.com/perfinion), Austin Anderson [@angerson](https://github.com/angerson) * For administrative questions, contact Edd Wilder-James @ewilderj - ewj at google diff --git a/sigs/build/tensorflow-testing.md b/sigs/build/tensorflow-testing.md index a25ee06bf..896e610d4 100644 --- a/sigs/build/tensorflow-testing.md +++ b/sigs/build/tensorflow-testing.md @@ -12,11 +12,12 @@ TensorFlow is truly a community effort, and **we would love to have your feedbac ### 🐞 Report a Bug -Please submit all bugs, errors, and pecularities on GitHub. Differences between documentation and implementation, lack of +Please submit all bugs, errors, and peculiarities on GitHub. Differences between documentation and implementation, lack of documentation, performance issues, or compatibility problems are all fair game. Please be specific and include all information that would be helpful to debug the issue using our issue templates: -* **[Bug / Performance Issue](https://github.com/tensorflow/tensorflow/issues/new?template=00-bug-performance-issue.md)** +* **[Bug Issue](https://github.com/tensorflow/tensorflow/issues/new?labels=type%3Abug&template=00-bug-issue.md)** +* **[Performance Issue](https://github.com/tensorflow/tensorflow/issues/new?labels=type%3Aperformance&template=80-performance-issue.md)** * **[Build / Installation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=10-build-installation-issue.md)** * **[Documentation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=20-documentation-issue.md)** * **[Other Issue - Not Listed](https://github.com/tensorflow/tensorflow/issues/new?template=50-other-issues.md)** @@ -37,7 +38,7 @@ If you would like to submit general feedback about TensorFlow (and in particular **Friction logs** are documents that describe the frustrations and delights of a product, focused around a specific use case (for example, creating an LSTM model for text classification). They're also intended to be brutally honest - feel free to vent or to praise! 😊 -An template and example of a TensorFlow friction log can be found [here](https://docs.google.com/document/d/1HVG3t-mgGZKU4iMeguTWGejbnQ54qUTXwdCFkA5xHG0/edit?usp=sharing). +A template and example of a TensorFlow friction log can be found [here](https://docs.google.com/document/d/1HVG3t-mgGZKU4iMeguTWGejbnQ54qUTXwdCFkA5xHG0/edit?usp=sharing). Once you have completed such a document, please email it to our [testing team](mailto:testing@tensorflow.org). diff --git a/sigs/graphics/CHARTER.md b/sigs/graphics/CHARTER.md new file mode 100644 index 000000000..79d4bad13 --- /dev/null +++ b/sigs/graphics/CHARTER.md @@ -0,0 +1,65 @@ +# SIG Graphics + +## Goal + +Facilitate community-contributed graphics, 3D and geometric components to [tensorflow/graphics](http://github.com/tensorflow/graphics), and enable researchers in the differentiable 3D, differentiable computer graphics, and neural rendering fields. + +## Objectives + +* Foster research on differentiable 3D, differentiable computer graphics and neural rendering +* Create a surface of collaboration for expert practitioners and enthusiasts excited to push the state-of-the-art in those areas. +* Build standard structures and implementations for 3D and graphics components +* Provide a central place to contribute and access tooling and implementations for the computer vision, computer graphics communities +* Host academic work in a sub-folder of the repository, where contributors can have their work available to the public and integrated with the rest of the TensorFlow Graphics repository. +* Make it easier for research results to be implemented in the industry. + +## Resources + +* [Tensorflow Graphics code Repository](https://github.com/tensorflow/graphics) +* [Tensorflow Graphics documentation](https://www.tensorflow.org/graphics/overview) +* [Original SIG proposal](https://docs.google.com/document/d/1RBnBuTb0eZropAeawwQKNwqE94mtp7InAvM2y_hitIk/edit#) + +## Contacts + +* Project leads: + * Julien Valentin [@julienvalentin](https://github.com/julienvalentin), Google + * Andrea Tagliasacchi [@ataiya](https://github.com/ataiya/), Google + * Sofien Bouaziz [@sofienbouaziz](https://github.com/sofienbouaziz), Google + * [Derek Nowrouzezahrai](http://www.cim.mcgill.ca/~derek/), McGill + +* For Tensorflow questions, contact Paige Bailey, [dynamicwebpaige](https://github.com/dynamicwebpaige) - webpaige at google dot com +* For administrative questions, contact Thea Lamkin,[@theadactyl](https://github.com/ewilde) - thealamkin at google dot com + +## Membership + +We encourage any researchers and developers working in the computer graphics and computer vision communities to participate in this SIG. Whether you are conducting academic research or building industry applications in this space, we welcome your feedback on and contributions to Tensorflow Graphics’ components and central tooling, and are eager to hear about any downstream research results and implementations of our code. +We have multiple channels for participation, and publicly archive discussions in graphics@ and graphics-dev@: + +* graphics@tensorflow.org -- our general mailing list that all are welcome to join +* graphics-dev@tensorflow.org -- mailing list for active contributors to TF Graphics tooling +* graphics-core@tensorflow.org -- mailing list for core SIG leaders in academia + +### Founding Members + +* Paige Bailey (Google) +* Luke Barrington (Google) +* Thabo Beeler (Google) +* Sofien Bouaziz (Google) +* Fredo Durand (MIT) +* Leonidas Guibas (Stanford) +* Otmar Hilliges (ETH Zurich) +* Tzu-Mao Li (MIT) +* Matthias Niessner (TUM) +* Derek Nowrouzezahrai (McGill) +* Avneesh Sud (Google) +* Andrea Tagliasacchi (Google) +* Christian Theobalt (MPI-Inf) +* Justus Thies (TUM) +* Julien Valentin (Google) +* Thomas Vetter (Unibas) + +## Code of Conduct + +As with all forums and spaces related to TensorFlow, SIG Graphics is subject to +the [TensorFlow Code of +Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). diff --git a/sigs/io/CHARTER.md b/sigs/io/CHARTER.md index ffe98ee15..90913ceec 100644 --- a/sigs/io/CHARTER.md +++ b/sigs/io/CHARTER.md @@ -33,7 +33,7 @@ Information about SIG IO releases and the release team could be found in [RELEAS * Project leads: - Yong Tang [@yongtang](https://github.com/yongtang) - yong.tang.github@outlook.com - Anthony Dmitriev [@dmitrievanthony](https://github.com/dmitrievanthony) - dmitrievanthony@gmail.com -* TensorFlow technical contact [@mrry](https://github.com/mrry) - mrry@google.com +* TensorFlow technical contact [@jsimsa](https://github.com/jsimsa) - jsimsa@google.com * For administrative questions, contact Edd Wilder-James [@ewilderj](https://github.com/ewilderj) - ewj at google diff --git a/sigs/io/RELEASE.md b/sigs/io/RELEASE.md index aefe82b8b..f8647e18e 100644 --- a/sigs/io/RELEASE.md +++ b/sigs/io/RELEASE.md @@ -23,18 +23,18 @@ At the moment Python package (whl files) is created automatically, upon each successful Travis CI on master branch. At the end of each Travis CI build on master branch, all whl files (2.7, 3.4, 3.5, 3.6, 3.7 on Linux and 2.7 on macOS) are pushed to -Bintray and are available in: +Dropbox and are available in: -https://dl.bintray.com/tensorflow-io/tensorflow-io-nightly/ +https://www.dropbox.com/sh/dg0npidir5v1xki/AACor-91kbJh1ScqAdYpxdEca?dl=0 To perform a release in PyPI, first make sure the binary whl files are the correct one from corresponding Travis CI build number. This could be verified by checking the Travis CI history where at the end of the log, the sha256 of all whl files are calculated and displayed. The sha256 of each file displayed on Travis CI log should match the sha256 -of the files downloaded from Bintray. +of the files downloaded from Dropbox. -Once sha256 are verified against every whl files on Bintray, perform +Once sha256 are verified against every whl files on Dropbox, perform a sanity check, then upload all of the whl files (2.7, 3.4, 3.5, 3.6, 3.7 on Linux and 2.7 on macOS) to PyPI.org: @@ -45,10 +45,11 @@ twine upload *.whl ## CRAN R Package Release Before submitting the R package to CRAN, manually perform and check the following items: -* Make sure the documentation in `README.md` and `vignettes` is up-to-date +* Make sure the documentation in `README.md`, `docs/`, and `vignettes/` is up-to-date * Update `Version` field in `DESCRIPTION` file * Update `NEWS.md` to include items for this new release -* Run `devtools::check()` and fix all the notable issues, especially warnings and errors +* Run `devtools::check()` and fix all the notable issues, especially warnings and errors that appear +either locally or in [CRAN package check result](https://cran.r-project.org/web/checks/check_results_tfio.html) * Update `cran-comments.md` to include any unsolvable issues from `devtools::check()` and other comments/responses to CRAN maintainers * Run checks on R-hub via `devtools::check_rhub()` and on win-builder via `devtools::check_win_devel()`. This is @@ -82,3 +83,4 @@ Current Release Team: - Anthony Dmitriev - GitHub: [@dmitrievanthony](https://github.com/dmitrievanthony) - PyPI: [dmitrievanthony](https://pypi.org/user/dmitrievanthony) - Yuan (Terry) Tang - GitHub: [@terrytangyuan](https://github.com/terrytangyuan) - PyPI: [terrytangyuan](https://pypi.org/user/terrytangyuan) - Bryan Cutler - GitHub: [@BryanCutler](https://github.com/BryanCutler) - PyPI: [cutlerb](https://pypi.org/user/cutlerb) +- Aleksey Vlasenko - GitHub: [@vlasenkoalexey](https://github.com/vlasenkoalexey) - PyPI: [vlasenkoalexey](https://pypi.org/user/vlasenkoalexey) diff --git a/sigs/jvm/rfcs/20190802-java-repositories.md b/sigs/jvm/rfcs/20190802-java-repositories.md new file mode 100644 index 000000000..953744976 --- /dev/null +++ b/sigs/jvm/rfcs/20190802-java-repositories.md @@ -0,0 +1,77 @@ +# Java Repositories +| Status | Accepted | +:-------------- |:---------------------------------------------------- | +| **Author** | Karl Lessard (karl.lessard@gmail.com) | +| **Sponsor** | James Ring (Google) | +| **Updated** | 2019-08-02 | + +## Objective + +Create new repositories under the `github.com/tensorflow` organization to host the code supported by SIG JVM, including the +actual Java client found in TensorFlow core repository. + +## Motivation + +In the spirit of TensorFlow modularization, one main goal of SIG JVM is to migrate [TensorFlow Java client](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java) +to its own repository so it can evolve and be released independently from TensorFlow core libraries (internally, the SIG calls this migration the *Jexit*, which is self-explanatory). + +Additionally, some repositories are also requested to distribute high-level abstractions of TensorFlow in Java that will also evolve independently +from the client and have their own release cycles. + +## User Benefit + +Having repositories outside the TensorFlow core will help in the development of some major changes to the client architecture, including +that might include the whole replacement of its native binding layer. Doing such experimentations in the main repository is certainly not advised. + +Also, having distinct repositories should allow the SIG to take part in the code review process that could unblock +more quickly new features developed by its members and distribute them as soon as the communitiy agrees. + +It is also important to note the *Jexit* is a good candidate to start TensorFlow modularization because it already relies heavily +on the C ABI for its interaction with TensorFlow core libraries. + +## Design Proposal + +The current request focuses on the creation of the two following repositories: + +### /tensorflow/java + +This is the main repository for hosting TF Java code. It will consist of multiple modules that will be all released altogether and build with Maven. + +Right now, the list of modules that will take place in this repository is: + +#### core + +All artifacts composing the actual Java client, including the Java code, its native layer and different generators used to create Java classes dynamically at compile time, including TF operations wrappers. Each of these components will be also released as seperate modules + +#### nio + +A self-contained Java library that provides advanced support for large buffers I/O operations (exceeding 232 - 1 bytes) and for n-dimensional data structures + +At some point, the Java client core will be also based on this library to improve I/O performances and usage. The `nio` +name comes from the similarities between this library and the [`java.nio`](https://docs.oracle.com/javase/8/docs/api/java/nio/package-summary.html) +package found in the JDK, that is unfortunately lacking the support of 64-bits indexation. + +#### model-framework + +A proper abstraction API (e.g. GraphRunner) that hides the raw tensors and so can be used by non-machine learning experts. +In the future those libraries will allow using the models in a transfer learning setting with TensorFlow Java as well. + +More details in the next section. + +#### keras + +An adaptation of the Keras library to Java, that will serve as the main API for training on TF Java. + +### /tensorflow/java-models + +The java-models will contain Java inference libraries for various pre-trained TensorFlow models, based on the Java +TF model framework. + +This repository hosts a set of Java libraries for loading and inferring various pre-trained TensorFlow models. +It provides a quick reference integrating for some of the popular TensorFlow models such as object detection, pose estimation, face detection and alike. + +The java-models will provide OOTB utilities for Java developers to jump start using various pre-trained models, archived locally and hosted online. +For example they can use any of the object-detection models in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md by +just passing in the URI. + +We will try to add models that complement the existing set of models and can be used as building blocks in other apps. diff --git a/sigs/micro/CHARTER.md b/sigs/micro/CHARTER.md index c7188aa62..0db1068a3 100644 --- a/sigs/micro/CHARTER.md +++ b/sigs/micro/CHARTER.md @@ -12,8 +12,10 @@ Archives of the mailing list will be publicly available. ## Resources -* [sig-micro mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/micro) -* [TensorFlow Micro code](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro) +* [SIG Micro mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/micro) +* [SIG Micro Gitter chat channel](https://gitter.im/tensorflow/sig-micro) +* [TensorFlow Micro code](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro) +* [SIG Micro monthly meeting notes and agenda](https://goo.gle/tf-micro-notes) ## Contacts @@ -23,6 +25,6 @@ Archives of the mailing list will be publicly available. ## Code of Conduct -As with all forums and spaces related to TensorFlow, SIG-Micro is subject to +As with all forums and spaces related to TensorFlow, SIG Micro is subject to the [TensorFlow Code of Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). diff --git a/sigs/mlir/CHARTER.md b/sigs/mlir/CHARTER.md index 7543d35fa..38c82bfad 100644 --- a/sigs/mlir/CHARTER.md +++ b/sigs/mlir/CHARTER.md @@ -1,4 +1,4 @@ -# Proposed name: SIG MLIR +# MLIR ## Objective @@ -17,6 +17,7 @@ Anyone involved in or interested in high performance compilers and their applica * Mailing list: [mlir@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/mlir) * GitHub repo: [tensorflow/mlir](https://github.com/tensorflow/mlir) +* [Agenda document](https://docs.google.com/document/d/1y_9f1AbfgcoVdJh4_aM6-BaSHvrHl8zuA5G4jv_94K8) containing the meeting notes and recordings. ## Collaboration @@ -29,7 +30,7 @@ We plan to start by focusing on the Graph Compiler. As such, we’ll have a 45 m ## Contacts -* Project lead: Tatiana Shpeisman [@tatianashp](https://github.com/tatianashp) - speisman at google +* Project lead: Tatiana Shpeisman [@tatianashp](https://github.com/tatianashp) - shpeisman at google * Project manager: Pankaj Kanwar [@pkanwar23](https://github.com/pkanwar23) - pkanwar at google * For administrative questions, contact Edd Wilder-James [@ewilderj](https://github.com/ewilderj) - ewj at google diff --git a/sigs/testing/README.md b/sigs/testing/README.md index dfad336f6..920ce68bc 100644 --- a/sigs/testing/README.md +++ b/sigs/testing/README.md @@ -6,9 +6,11 @@ A key element of the evolution of TensorFlow is TF 2.0, which is primarily focus * Making TensorFlow more intuitive and easier to debug; and * Continuing to enable scalable production deployment. -**TF 2.0 is just in the beginning of this transition. ** +**TF 2.0 is just in the beginning of this transition.** -You can experiment with the alpha release today! Please let us know what you create, and what the experience is like. Over the next few months, we will be focused on making it RC/production ready both internally and externally; making TF 2.0 compatible with the rest of TensorFlow ecosystem; and sharing that journey with you. _We’d love for you to join us and help us!_ +You can experiment with the alpha release today! Please let us know what you create, and what the experience is like. Over the next few months, we will be focused on making it RC/production ready both internally and externally; making TF 2.0 compatible with the rest of TensorFlow ecosystem; and sharing that journey with you. + +_We’d love for you to join us and help us!_ ## Installation Instructions diff --git a/sigs/testing/faq.md b/sigs/testing/faq.md index 615158b97..d19633807 100644 --- a/sigs/testing/faq.md +++ b/sigs/testing/faq.md @@ -1,6 +1,6 @@ # TensorFlow 2.0: An Overview -Last Updated: _Mar 6, 2019_ +Last Updated: _September 10, 2019_ A key element of the evolution of TensorFlow (TF) is TF 2.0, which is primarily focused on: @@ -103,7 +103,7 @@ Yes. Use tf.disable_eager_execution() or tf.compat.v1.disable_eager_execution(). **Where can I find a style guide for TensorFlow 2.0?** -There are multiple changes in TensorFlow 2.0 to help support end-user productivity. For a style guide including best practices for API clean-up, @tf.function, see [Effective TF 2.0 Style Guide](https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/effective_tf2.md) and this accompanying [blog post](https://medium.com/tensorflow/effective-tensorflow-2-0-best-practices-and-whats-changed-a0ca48767aff). +There are multiple changes in TensorFlow 2.0 to help support end-user productivity. For a style guide including best practices for API clean-up, @tf.function, see [Effective TF 2.0 Style Guide](https://www.tensorflow.org/beta/guide/effective_tf2) and this accompanying [blog post](https://medium.com/tensorflow/effective-tensorflow-2-0-best-practices-and-whats-changed-a0ca48767aff). **Where can I find a mapping of all API symbols in TensorFlow 1.x to their equivalents in TF 2.0, TensorFlow Probability, addons, etc.?** @@ -122,11 +122,11 @@ An upgrade utility called tf_upgrade_v2 is included with every install of Tensor **How do I convert my code from tf.Session, tf.cond, etc., to @tf.function?** -See [Effective TF 2.0 Style Guide](https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/effective_tf2.md). +See [Effective TF 2.0 Style Guide](https://www.tensorflow.org/beta/guide/effective_tf2). **Where can I find a list of all of the changes in TensorFlow 2.0?** -You can find the API symbol 1:1 map [here](https://docs.google.com/spreadsheets/d/1FLFJLzg7WNP6JHODX5q8BDgptKafq_slHpnHVbJIteQ/edit#gid=0), RFCs on Github, and the [Effective TF 2.0 Style Guide](https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/effective_tf2.md). +You can find the API symbol 1:1 map [here](https://docs.google.com/spreadsheets/d/1FLFJLzg7WNP6JHODX5q8BDgptKafq_slHpnHVbJIteQ/edit#gid=0), RFCs on Github, and the [Effective TF 2.0 Style Guide](https://www.tensorflow.org/beta/guide/effective_tf2). **How long will TensorFlow 1.x be supported?** @@ -159,8 +159,8 @@ TensorFlow 2.0 API documentation can be found [here](https://www.tensorflow.org/ We recommend that greenfield projects should begin using TensorFlow 2.0. Here’s how to get started: - [Udacity Course](https://www.udacity.com/course/intro-to-tensorflow-for-deep-learning--ud187) -- [DeepLearning.ai Course](https://www.deeplearning.ai/tensorflow-specialization/) -- [TensorFlow Documentation](https://www.tensorflow.org/alpha) +- [DeepLearning.ai Course](https://www.deeplearning.ai/tensorflow-in-practice/) +- [TensorFlow Documentation](https://www.tensorflow.org/beta/) **I’ve noticed a problem with TensorFlow 2.0. How can I file an issue?** @@ -231,7 +231,7 @@ The original TensorFlow API’s approach to variables had many drawbacks. As det **What’s the deal with collections?** -Global collections have been removed in TensorFlow 2.0, in favor of variable garbage collecting. For more on variables in TF 2.0, and how they’ve changed since TF 1.x, please refer to the [Effective TF 2.0 Style Guide](https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/effective_tf2.md). +Global collections have been removed in TensorFlow 2.0, in favor of variable garbage collecting. For more on variables in TF 2.0, and how they’ve changed since TF 1.x, please refer to the [Effective TF 2.0 Style Guide](https://www.tensorflow.org/beta/guide/effective_tf2). **I use PyTorch, but would like to try TF 2.0. Is there a migration guide?** @@ -250,10 +250,10 @@ No, but we are in the process of creating one. Please reach out through the [Ten We will have several models ready for the alpha release (some CPU, some single-node GPU, and some available on a cluster of GPUs). You can track the bugs listed below for more information about timelines and implementation details. - [ResNet50 v1.5 & Resnet56 CIFAR-10](https://github.com/tensorflow/tensorflow/issues/25340) -- [NMT Model](https://github.com/tensorflow/tensorflow/issues/25343) ([Example Colab](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb)) -- [Pix2Pix](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/pix2pix) ([Example Colab](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/generative/pix2pix.ipynb)) +- [NMT Model](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/nmt_with_attention) ([Example Colab](https://www.tensorflow.org/tutorials/text/nmt_with_attention)) +- [Pix2Pix](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/pix2pix) ([Example Colab](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/generative/pix2pix.ipynb)) - [DenseNet](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/densenet) -- [Dcgan](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/dcgan) ([Example Colab](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/generative/dcgan.ipynb)) +- [Dcgan](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/dcgan) ([Example Colab](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/generative/dcgan.ipynb)) - [NCF Model](https://github.com/tensorflow/tensorflow/issues/25344) **Will static graphs still be supported in TensorFlow 2.0?** @@ -262,13 +262,13 @@ Yes. For an example, please refer to testing estimator ResNet56 Cifar-10 with co **While Keras is exciting, what other options will I have for building fully-customizable models?** -You can do a lot with Keras, including [subclassing layers](https://medium.com/tensorflow/what-are-symbolic-and-imperative-apis-in-tensorflow-2-0-dfccecb01021), or writing your own training logic by subclassing Model. If you are a framework developer and you need to be free of the conventions Keras’ classes impose, take a look at tf.Module. ([Variables](https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/variables.md), [Custom Layers](https://critique.corp.google.com/#review/231992098/depot/google3/third_party/py/tensorflow_docs/g3doc/en/tutorials/eager/custom_layers.ipynb)) +You can do a lot with Keras, including [subclassing layers](https://medium.com/tensorflow/what-are-symbolic-and-imperative-apis-in-tensorflow-2-0-dfccecb01021), or writing your own training logic by subclassing Model. If you are a framework developer and you need to be free of the conventions Keras’ classes impose, take a look at tf.Module. ([Variables](https://www.tensorflow.org/guide/variable), [Custom Layers](https://www.tensorflow.org/tutorials/customization/custom_layers)) **What options do I have for distributed training with TensorFlow 2.0?** You can train your TensorFlow 2.0 models with multiple GPUs today, using distribution strategies. For more information on distributed training, be sure to check out the [TensorFlow 2.0 Project Tracker](https://github.com/orgs/tensorflow/projects/4) on Github, and search for the keyword “dist-strat”. -For further information, see our tutorials [here](https://github.com/tensorflow/docs/tree/master/site/en/r2/tutorials/distribute) and [here](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/densenet). +For further information, see our tutorials [here](https://github.com/tensorflow/docs/tree/master/site/en/tutorials/distribute) and [here](https://github.com/tensorflow/examples/tree/master/tensorflow_examples/models/densenet). **How can I deploy TF 2.0 models to other platforms (TF.js, TensorFlow Lite, etc.)?** @@ -280,7 +280,7 @@ No. **What is the preferred format for saving a TF model, going forward? (saved_model or HD5)** -Saved_model is the preferred format. For more on exporting, restoring, and running a saved model in [TensorFlow 2.0](https://www.tensorflow.org/r2/tutorials/beginner/tf2_overview#export_a_savedmodel). This format is compatible with TensorFlow.js, TFLite, and more. +Saved_model is the preferred format. For more on exporting, restoring, and running a saved model in [TensorFlow 2.0](https://www.tensorflow.org/guide/saved_model). This format is compatible with TensorFlow.js, TFLite, and more. **Will the TensorFlow team convert all checkpoints available in the tensorflow/models repo to HD5 or saved_model?**