diff --git a/governance/api-reviews.md b/governance/api-reviews.md new file mode 100644 index 000000000..61dc3e64e --- /dev/null +++ b/governance/api-reviews.md @@ -0,0 +1,202 @@ +# tensorflow/api-owners review practices + +## Overview + +This is an attempt to gather commonly discussed topics when doing API +reviews. It’ll hopefully be a useful resource to both API owners and people +proposing API changes. [TF API Owners](https://github.com/orgs/tensorflow/teams/api-owners) +meet twice weekly to discuss changes. We try to get to PRs on the next meeting, +but we don’t always make it all the way through. If your change is particularly +urgent, please ping the PR to notify us of any urgency. + +## Process + +We only look at changes which have already been approved by other reviewers. If +there are major outstanding comments, we will wait with API review until those +are resolved. If there are questions for API owners, explicitly raise this in +the comments to get an answer. + + +## High level points + +### Backward and forward compatibility +We avoid backwards-incompatible API changes. We also avoid +backwards-incompatible behavior changes, such as restricting the set of valid +inputs to a function or extending the set of valid outputs of a function. Adding +support for previously not supported behavior is okay, as are changes to +explicitly experimental APIs (within reason). When needing to provide a new or +different behavior, we strongly prefer a new version of the API over breaking +backwards compatibility. Note that we are free to deprecate APIs; we just cannot +break code which relies on their documented behavior. We need to worry about +backward compatibility both of our python APIs and of the serialized GraphDefs, +and in general breaking serialized GraphDefs is worse than breaking the python +APIs. + +Forward compatibility is more subtle: we should avoid changing the graph +produced by currently correct python code without a three weeks notice. This +comes up most frequently when adding new ops, but also applies to non-obvious +things such as the graph emitted by gradients or pfor. + +Including the name “experimental” in an API endpoint allows you to break +backwards compatibility in the future, as noted above. However, we still prefer +to mark the experimental API as deprecated for one release before removing it in +the subsequent release. Please do not use the experimental namespace as an +escape hatch for bad code or functionality you +don’t-really-want-but-need-for-now; experimental APIs should be APIs that we +intend to make standard in the near future, but have some lingering questions to +resolve first. + + +### Docstrings + +TF APIs should have comprehensive documentation in the form of docstrings. If at +all possible these docstrings should have runnable examples, and these examples +should form a doctest so they stay correct. The examples should demonstrate an +end-to-end user workflow, such that it’s clear how to generate the necessary +inputs for the API and what to do with the outputs. The docstring should be +understandable by someone who is not familiar with TF. See the [guide to writing +TF docstrings](https://www.tensorflow.org/community/contribute/docs_ref) for +more information. + +Our documentation generator for classes only sees methods, so prefer defining +members as properties instead of assigning them in `__init__`. + +Docstrings should only refer to other public TF API symbols (i.e. do not refer +to other symbols defined in the same file as a function which is just now being +made public) and should refer to public API symbols by their full exported name. + +### Common names + +Prefer keepdims over keep_dims. Prefer axis over dim. Data types are called +dtype. name is a common last argument of ops but backward compatibility mandates +that new arguments are added after the last existing argument, even if that +results in name not being the last argument. + +We generally prefer spelling things out over using abbreviations except when +abbreviations are more standard than spelling things out (i.e. don’t spell out +linalg or svd). When in doubt we ask domain experts or use web search to see +what spelling is most common. + +If possible we prefer to name things in a similar way to numpy (e.g., we would +not pick einsum as a name, but numpy and others before it have, and that +precedent is very strong). + +We prefer experimental namespaces (i.e. tf.data.experimental.foobar) over +experimental-prefixed names (i.e. tf.data.experimental_foobar) except when +adding an experimental class method, or an experimental argument. Experimental +endpoints should be deprecated in a minor release before they can be removed in +the next. We would like new experimental symbols to be things which will +eventually end up in core TF as opposed to things we expect will be phased out +with no clear replacement. The best expectation to have for an experimental +endpoint is that the “experimental” will simply be removed. If you don’t believe +that’ll work, it should probably not be added in its current form. + +### Style + +Generally, follow Google style. + +Avoid redundancy. Do not write arguments of the form `function(..., +enable_feature=False, feature_config=None)` if you can also write `function(..., +feature_config=None)`, where implicitly, `enable_feature = feature_config is not +None`. + +Try to embed well with the ambient language. Think about how your API interacts +with language idioms (e.g., in Python: can it be hashed, i.e., used as a dict +key? Is it iterable? Is it a mapping? Can it be equality compared? +Ordered?). Think about how your API interacts with other pieces of the Python +ecosystem as well— is there an analogue in Numpy or PyTorch that we should +consider aligning with? + +Use language-native constructs wherever you can. In Python, a tuple should be a +tuple. The bar for custom configuration objects is relatively high, a dict or +namedtuple goes a long way. + +In particular, do not expose protobufs directly as part of an API. You can use +protobufs for serialization or to encode network traffic. Protobufs should +always be an implementation detail, and never visible on the API surface. Use +language native constructs (dicts or classes for Python, structs for C/C++) if +you need configuration objects. + +Avoid global (or any non-local) state as much as possible (this includes Python +'with' scopes). If you need global context, consider whether it can be +thread-local. The TF API is supposed to be thread-safe. Avoid stateful operation +(mutability) if you can. Both features make it hard to reason about code, and +make composability harder to achieve. + +We prefer strings ("auto", "never", etc) over enums (tf.namespace.AUTO, +etc). Strings are easier to type, and forces us to document all possible values +and their semantics in the docstrings of all places which accept the string, as +opposed to only in the enum definition, which is a little friendlier. + +### Orthogonality and integration with the existing APIs + +Is the new API implementable in terms of existing APIs? If so, we might want to +consider pointing users to using the existing APIs. Does the new API add enough +value over a combination of existing APIs? Does the API solve only a specific +problem (that’s usually a sign combinations of existing APIs would be +preferred)? + +If not, are existing APIs implementable in terms of the new API? If this is +simple, we might want to steer users towards the new and away from the old API +(possibly, old APIs should be deprecated along with introducing the new API). + +If neither is the case, it might be possible that there is a more general API +which makes both the existing API and the new API easy to express. We try to +keep global consistency of our API in mind when reviewing new changes. + +How will this API work together with others? Does it do something slightly +differently than others? Does it expect inputs which match what other parts of +TensorFlow produce? Does its output match what other parts of TensorFlow can +consume? + +Does it do things the same way other similar pieces in TensorFlow do it? E.g., +if a common pattern to achieve a behavior is an extra argument, don't use a +function decorator to achieve the same in a different area of the API. + +Two wrongs don’t make a right. That is, if a bad API already exists in TF, that +does not give license to new APIs to be bad in the same way. Improvement must be +balanced with consistency, however, and sometimes it’s okay to carry small +imperfections into new APIs for the sake of consistency with old APIs. + +### Does it belong in TF at all? + +As TF evolves there’s a tendency to put everything inside of it, with costs +compounding over the long term. If there is a reasonable home for a new API +outside core TF (say in addons, io, TFP, or other projects entirely) that can be +strongly preferrable. If new code can be released as independent libraries, it +should be. This is especially true for APIs that are actively evolving; core TF +imposes many restrictions, so it’s far better to trial new APIs outside of the +core library. + +## Adding new ops + +Adding new ops to TF should be done with care. We generally prefer not adding +new ops if possible, but performance, hardware compatibility, and other concerns +often do require new ops. + +When adding new ops, look for: + + - closure under automatic differentiation (i.e. we avoid ops which are + differentiable but not twice-differentiable, or which are technically + differentiable but not marked as such) + - performant kernels (it’s better not to have an op than to have an op with a + suboptimal kernel; we need to make sure kernel experts have reviewed the + code) + - broadcasting (all numerical ops should broadcast using numpy rules) + - does support for this op have to be added to pfor/vectorized_map? + - dtype support (in general all numerical ops should support the common + integer, floating point, and complex dtypes, if they all make sense; we need + to watch out for int32 on GPUs though) + - device support (cuda kernels should be implemented if possible, and similarly + a tf/xla bridge entry should be added if it makes sense) + - attributes versus inputs (anything which can be an input to an operation + should be an input, and attributes should only be used to parametrize the + operation in ways that affect the output dtypes or sometimes shapes) + - state management (is the op stateful? Can it instead be made stateless and + rely on optimizations like memory reuse for performance? Can it be made to + keep its state using one of the existing mechanisms like variables? If not, + its state should be encapsulated using resource handles if at all possible) + - we generally don’t like ops which are supported only on a single device (be + it CPU, GPU, XLA, TPU, etc) and prefer to have at least a plan for writing + device-agnostic code + - should the python layer for this operation support raggedtensor/sparsetensor? diff --git a/governance/design-reviews.md b/governance/design-reviews.md new file mode 100644 index 000000000..7fe586eff --- /dev/null +++ b/governance/design-reviews.md @@ -0,0 +1,227 @@ +# tf-design-reviews criteria + +## Overview + +The TensorFlow team has run internal and public design reviews for a while +now. This document tries to capture what type of questions get asked and +concerns get addressed in TF design reviews. It is intended to be used by design +authors as a way of spot checking whether a design review will be useful for +them and by design sponsors as a way of making sure a design proposal clears the +bar for review (ideally every topic in this document should be addressed by the +design proposal for it to be considered). + +The main goal of tf-design-reviews is to socialize big changes to TF, document +them, and ensure all stakeholders get a chance to comment on planned +improvements before they’re implemented. Any time a change is made to TF that +will affect multiple aspects of its design or user interface, we should solicit +a design review. TF design reviews themselves are not binding: final approval +rests with whoever has the authority to approve the required code changes, and +the design review is a tool to get consensus and feedback on big changes before +actual approval. + +By default TF design reviews should go through the open RFC process in the +tensorflow/community repository, but we will on rare occasions accept design +reviews of google-internal TF-related infrastructure which should be kept +private due to reasons beyond our control. + +## General considerations + +Every item in this section should be addressed by a TF design review. We do not +require a solution prior to review but we do want to see that the review author +has considered these issues. It is the design sponsor’s job to ensure that +review documents have thought through these issues. + +### Performance +Performance is the core reason why most end users use TensorFlow at all; hand +writing code with the same level of performance is prohibitively expensive, and +any other similarly-performing or better-performing solution can also be +integrated in the ecosystem in principle. In that vein, all new designs to TF +should carefully consider their performance implications. + +Performance in TF is multi-faceted: we need to worry about scaling from very +small devices (including microcontrollers) to very large devices (beyond TPU +pods); we need to worry about interactive usage (so the cost of making small +changes should be small) and about batch usage (where it’s ok to sacrifice some +startup time to improve steady-state performance); we care both about throughput +(maximizing accelerator utilization saves a lot of money) and latency (as TF is +used in all parts of Google’s software stack); we also care about performance on +many types of hardware. + +Can a particular design proposal be implemented efficiently? Does it impose any +inherent limits on the performance in any of the scenarios above? How will it +interact with our other tools for performance (grappler, XLA, eigen, tf.data, +etc)? + +### Scope + +Does this proposal even belong in TF? As TF itself grows, it’s becoming +substantially more expensive to develop software inside TF itself than as a +separate TF-using project. In this light we need to evaluate whether it’s at all +possible to release a broadly new API or library as its own separate project in +the TF ecosystem. + +Even separate projects in the TF ecosystem can benefit from TF’s devrel, blog, +twitter, etc for promotion. It might be possible to share resources dedicated to +CI or github triage, or share infrastructure around syncing to/from google3. + +Ideally the only things being added to core TF at this point are things which, +if they are not in core TF, they dramatically limit the usefulness of core TF +itself. General protocols and APIs which different libraries in the TF ecosystem +can implement / accept are good examples of things which undoubtedly belong in +core TF. Ops and kernels used to need to be in core TF, but this is no longer +the case as other projects have sustainable releases of their own binary blobs +and the TF team is working to make it cheaper to release ops and kernels outside +core TF. + +Note that we also encourage using the TF design review slot for reviewing +proposals which despite not living inside core TF are expected to be a part of +the broader TF ecosystem. + +### Programmability / flexibility + +TensorFlow is fundamentally a library to be programmed, and not a collection of +packaged black-box solutions. While it’s cheaper for any individual problem to +solve it with a simple one-line push-button packaged solution this tends to work +poorly in the long run, and lead to usability cliffs and undue API pressures. + +For example, let’s think about what would happen if instead of providing tools +to build neural network layers, TF only provided a function that built an entire +network for you. At first we could have very impressively short code examples +(“switch from inception to resnet50 in one line of code!”), but over time users +whose use cases are not exactly covered by this API would either have to +reimplement substantial parts of it themselves or would (most likely) file bugs +asking for small extensions to the API (“can we make resnet52? resnet36?”). Over +time, these bottleneck APIs develop large parameter lists of mutually exclusive +parameters which amount to a poorly defined configuration language for how to +use them. + +A key consideration when evaluating a TF design proposal is what would happen +for use cases that are slightly different from the use cases covered in the +proposal itself. The goal is not that the proposal should cover everything but, +rather, that it should be possible to easily reimplement parts of the proposal +using lower level APIs already in TF. If this is not the case then instead of +first implementing the end-to-end solution we need to discuss what low-level +APIs TF should have in such that this proposal could be easily implemented, and +only then reevaluate this proposal. + +We also worry about proposals which are too device-specific (be it TPU-specific +or GPU-specific or CPU-specific). While many such things seem reasonable when +first proposed, they break down over time as the set of users for different +devices overlaps quite a bit. + +### Integration + +As TF has grown, it has sprouted an ecosystem of tools and libraries both +internal and external to TF. New entries to this ecosystem should, as much as +possible, coexist and peacefully cooperate with other entities in the TF +ecosystem. Failing that, new entries should cleanly replace existing +ones. Awkwardly coexisting is not an option we recommend. + +The ecosystem includes both things currently inside TF such as Estimator, Keras, +tf.data, tf.distribute, tf.tpu, XLA, or tf.saved_model as well as things +developed outside TF, such as TF probability, vizier, TF serving, MLIR, TFRT, +Sonnet, among others. If existing integration points do not suffice, we should +consider developing new integration points (i.e. how the Sonnet team developed +tf.Module to integrate sonnet, which lives outside TF, with tf.train.Checkpoint, +tf.keras, tf.function, or tf.saved_model). + +It is also important that new designs don’t break existing abstractions which TF +supports, such as eager execution, functions, control flow, gradients, or +automatic vectorization. In general, libraries which use simpler TF primitives +(like tf.data) are easier to integrate into the ecosystem than libraries which +try to rewrite TF programs (like TF transform v1). Similarly, we should prefer +proposals which rely on explicit APIs to accomplish things over proposals which +want to do post-hoc graph rewriting (or make converters, or exporters) as those +tend to integrate poorly with each other and tend to be hard to directly +program. + +### Maintenance + +As many proposals for TF improvements come from outside TF or from outside the +subteams in TF which currently maintain related functionality, TF design +proposals should be upfront about the maintenance story for new functionality +and code. + +It is perfectly fine (and common) to punt maintenance on the TF team, but we +should — ahead of the design review — figure out who specifically in the TF team +is signing up to maintain this specific design. + +### User groups + +While TensorFlow cannot be everything for everyone we do try to cover a broad +swath of machine learning use cases, spanning the spectrum from research to +production, from small to large devices, and from commercial to educational +uses. + +It is important for every proposal to TF to talk about which segments of our +user community’s needs are being addressed and for which ones this is expected +to be irrelevant. Specifically, consider stereotypical pure researcher in ML, +researcher applying ML to other fields, students learning ML, industry +professionals applying ML with little to no understanding, industry applied ML +developers, mobile developers, and others. + +## Specific considerations + +Some particular subsystems of TF have their own considerations which are often +relevant for TF design reviews. It is up to individual designs’ sponsors whether +any of these topics needs to be addressed in the document before review. + +### Eager/graph mode + +In TF1.x many libraries implicitly or explicitly assume graph-based +execution. As TF 2.0 has been released, eager execution is on by default. This +means that all new TF APIs should be usable from eager execution or from graphs, +and new design proposals should be implemented so they work with both. + +In practice this means we cannot rely on per-graph global state, reference +dtypes, and graph pruning to ensure program correctness. Similarly it was +possible in some cases in TF1.x to treat a Tensor as a Promise. In TF2, however, +a Tensor is an already-computed value, and if you need a promise use instead a +function which can compute a tensor on-demand. + +### Keras + +Keras has a special status existing both inside and outside TF. As such, changes +to Keras need to consider the impact on the entire Keras community. New APIs to +be added to Keras can comfortably live in tf-addons. Changes to core Keras APIs +need a review owners or sponsor from the Keras team before a TF-wide review. + +Further, changes outside the scope of Keras should address how the change will +interact with Keras users, if at all. For example, if a new CompositeTensor is +proposed, will it be a plausible input to Keras layers? If so, how will support +be implemented? + +### tf.data + +tf.data is TensorFlow recommended API for data (pre)processing and any designs +that pertain to handling data should consider how they relate to tf.data. New +designs pertaining to handling data should provide useful functionality on top +of tf.data (such as the TFDS project or a library of common transformations for +tf.data.Dataset.map) as opposed to alternatives to tf.data. + +In addition, new CompositeTensor subclasses should strongly consider +implementing the optional BatchableTypeSpec interface which is needed for +tf.data to be able to batch and unbatch instances of the subclass. + +### SavedModel + +SavedModel changes in particular need to be both forward and backwards +compatible, as SavedModel files will be written by and read from different TF +versions entirely. In general, removing things from the format is not OK but +adding things is possible if new additions are not required to correctly read +the model from older binaries. + +### “Impossible” designs + +There are many things which are possible to make work for specific cases but +impossible to make work in general. We should avoid proposing changes to TF that +look like they work in general but in practice each new use case needs to be +covered by manual work from the TF team. + +### Distribution Strategies + +tf.distribute is the recommended API for distributing computation over GPUs, +TPUs and multiple machines. It is important to consider the implications of a +new design wrt how it would work in a distributed setting. There may be explicit +changes required to ensure the new functionality works seamlessly with / without +tf.distribute.