Docs on t/q batch mode decorators (#70)

Balandat · facebook-github-bot · commit 734b50df1e76 · 2019-04-17T23:11:13.000-07:00
Summary: Adds some explanation for these (we should still consider getting rid of the q-batch one) Pull Request resolved: #70 Reviewed By: eytan Differential Revision: D14993807 Pulled By: Balandat fbshipit-source-id: 3f0de22c16a5a889877049fde8ca6eff9ff4cfd3
diff --git a/docs/acquisition.md b/docs/acquisition.md
@@ -59,8 +59,8 @@ stochastic optimization methods.
 ![resampling_fixed](assets/EI_resampling_fixed.png)
 
 If the base samples are fixed, the problem of optimizing the acquisition function
-is deterministic, allowing for conventional quasi-second order methods to be used
-(e.g., `L-BFGS` or sequential least-squares programming `SLSQP`). These have
+is deterministic, allowing for conventional quasi-second order methods such as
+L-BFGS or sequential least-squares programming (SLSQP) to be used. These have
 faster convergence rates than first-order methods and can speed up acquisition
 function optimization significantly.
 
diff --git a/docs/batching.md b/docs/batching.md
@@ -0,0 +1,171 @@
+---
+id: batching
+title: Batching
+---
+
+botorch makes frequent use of "batching", both in the sense of batch acquisition
+functions for multiple candidates as well as in the sense of parallel or batch
+computation (neither of these should be confused with mini-batch training).
+Here we explain some of the common patterns you will see in botorch for
+exploiting parallelism, including common shapes and decorators for more
+conveniently handling these shapes.
+
+
+## Batch Acquisition Functions
+
+botorch supports batch acquisition functions that assign a joint utility to a
+set of $q$ design points in the parameter space. These are, for obvious reasons,
+referred to as q-Acquisition Functions. For instance, botorch ships with support
+for q-EI, q-UCB, and a few others.
+
+As discussed in the
+[design philosophy](design_philosophy#batching-batching-batching),
+botorch has adopted the convention of referring to batches in the
+batch-acquisition sense as "q-batches", and to batches in the torch
+batch-evaluation sense as "t-batches".
+
+Internally, q-batch acquisition functions operate on input tensors of shape
+$b \times q \times d$, where $b$ is the number of t-batches, $q$ is the number
+of design points to be considered concurrently, and $d$ is the dimension of the
+parameter space. Their output is a one-dimensional tensor with $b$ elements,
+with the $i$-th element corresponding to the $i$-th t-batch. Always requiring a
+explicit t-batch dimension makes it much easier and less ambiguous to work with
+samples from the posterior in a consistent fashion.
+
+#### Batch-mode decorators
+
+In order to simplify the user-facing API for evaluating acquisition functions,  
+botorch implements the
+[`@t_batch_mode_transform`](../api/utils.html#botorch.utils.transforms.t_batch_mode_transform)
+and
+[`@q_batch_mode_transform`](../api/utils.html#botorch.utils.transforms.q_batch_mode_transform)
+decorators.
+
+##### `@t_batch_mode_transform`
+
+This decorator simplifies evaluating MC-based acquisition functions using
+inputs in non-batch mode. If applied to an instance method with a single `Tensor`
+argument, an input tensor to that method without a t-batch dimension (i.e.
+tensors of shape $q \times d$) will automatically be converted to a t-batch of
+size 1 (i.e. of `batch_shape` `torch.Size([1])`), This is typically used on the
+`forward` method of a `MCAcquisitionFunction`.
+
+
+##### `@q_batch_mode_transform`
+
+This decorator simplifies evaluating analytic acquisition functions with input
+tensors that do not have a q-batch dimension. If applied to an instance method
+with a single `Tensor` argument, an input tensor to that method will
+automatically receive an additional singleton dimension at the second-to-last
+dimension. This is typically used on the `forward` method of an
+`AnalyicAcquisitionFunction`.
+
+
+
+## Batching Sample Shapes
+
+botorch evaluates Monte-Carlo acquisition functions using (quasi-) Monte-Carlo
+sampling from the posterior at the input features $X$. Hence, on top of the
+existing q-batch and t-batch dimensions, we also end up with another batch
+dimension corresponding to the MC samples we draw. We use the PyTorch notions of
+`sample_shape` and `event_shape`.
+
+`event_shape` is the shape of a single sample drawn from the underlying
+distribution. For instance,
+- evaluating a single-output model at a $1 \times n \times d$ tensor,
+  representing $n$ data points in $d$ dimensions each, yields a posterior with
+  `event_shape` $1 \times n \times 1$. Evaluating the same model at a
+  $\textit{batch_shape} \times n \times d$ tensor (representing a t-batch-shape
+  of $\textit{batch_shape}$, with $n$ $d$-dimensional data points in each batch)
+  yields a posterior with `event_shape` $\textit{batch_shape} \times n \times 1$.
+- evaluating a multi-output model with $t$ outputs at a $\textit{batch_shape}   
+  \times n \times d$ tensor yields a posterior with `event_shape`
+  $\textit{batch_shape} \times n \times t$.
+- recall from the previous section that internally, all acquisition functions
+  are evaluated using a single t-batch dimension.
+
+`sample_shape` is the shape (possibly multi-dimensional) of the samples drawn
+*independently* from the distribution with `event_shape`, resulting in a tensor
+of samples of shape `sample_shape` + `event_shape`. For instance,
+- drawing a sample of shape $s1 \times s2$ from a posterior with `event_shape`
+  $b \times n \times t$ results in a tensor of shape
+  $s1 \times s2 \times \textit{batch_shape} \times n \times t$, where each of
+  the $s1 s2$ tensors of shape $\textit{batch_shape} \times n \times t$ are
+  independent draws.
+
+
+## Batched Evaluation of Models and Acquisition Functions
+The GPyTorch models implemented in botorch support t-batched evaluation with
+arbitrary t-batch shapes.
+
+##### Non-batched Models
+
+In the simplest case, a model is fit to non-batched training points with shape
+$n \times d$.
+- *Non-batched evaluation* on a set of test points with shape $m \times d$
+  yields a joint posterior over the $m$ points.
+- *Batched evaluation* on a set of test points with shape
+  $\textit{batch_shape} \times m \times d$ yields $\textit{batch_shape}$
+  joint posteriors over the $m$ points in each respective batch.
+
+##### Batched Models
+The GPyTorch models can also be fit on batched training points with shape
+$\textit{input_batch_shape} \times n \times d$. Here, each batch is modeled
+independently (each batch has its own hyperparameters).
+For example the training points have shape $b_1 \times b_2 \times n \times d$
+(two batch dimensions), the batched GPyTorch model is effectively $b_1 \times b_2$
+independent models. More generally, suppose we fit a model to training points
+with shape $\textit{input_batch_shape} \times n \times d$.
+Then, the test points must support broadcasting to the $\textit{input_batch_shape}$.
+
+* *Non-batched evaluation* on a set of test points with shape
+  $\textit{input_batch_shape}^* \times m \times d$, where each dimension of
+  $\textit{input_batch_shape}^*$ either matches the corresponding dimension of
+  $\textit{input_batch_shape}$ or is 1 to support broadcasting, yields
+  $\textit{input_batch_shape}$ joint posteriors over the $m$ points
+  (respectively if not broadcasting).
+
+* *Batched evaluation* on a set of test points with shape
+  $\textit{new_batch_shape} \times \textit{input_batch_shape}^* \times m \times d$,
+  where $\textit{new_batch_shape}$ is the new batch shape for batched evaluation,
+  yields $\textit{new_batch_shape} \times \textit{input_batch_shape}$ joint
+  posteriors over the $m$ points in each respective batch (broadcasting as
+  necessary over $\textit{input_batch_shape}$)
+
+#### Batched Multi-Output Models
+The [`BatchedMultiOutputGPyTorchModel`](../api/models.html#batchedmultioutputgpytorchmodel)
+class implements a fast multi-output model (assuming conditional independence of
+the outputs given the input) by batching over the outputs.
+
+##### Training inputs/targets
+Given training inputs with shape $\textit{input_batch_shape} \times n \times d$
+and training outputs with shape $\textit{input_batch_shape} \times n \times o$,
+the `BatchedMultiOutputGPyTorchModel` permutes the training outputs to make the
+output $o$-dimension a batch dimension such that the augmented training inputs
+have shape $o \times \textit{input_batch_shape} \times n$. The training inputs
+(which are required to be the same for all outputs) are expanded to be
+$o \times \textit{input_batch_shape} \times n \times d$.
+
+##### Evaluation
+When evaluating test points with shape
+$\textit{new_batch_shape} \times \textit{input_batch_shape} \times m \times d$
+via the `posterior` method, the test points are broadcasted to the model(s) for
+each output. This results in the batched posterior where the mean has shape
+$\textit{new_batch_shape} \times o \times \textit{input_batch_shape} \times m$
+which then is permuted back to the original multi-output shape
+$\textit{new_batch_shape} \times \textit{input_batch_shape} \times m \times o$.
+
+#### Batched Optimization of Random Restarts
+botorch uses random restarts to optimize an acquisition function from multiple
+starting points. To efficiently optimize an acquisition function for a $q$-batch
+of candidate points using $r$ random restarts, botorch uses batched
+evaluation on a $r \times q \times d$ set of candidate points to independently
+evaluate and optimize each random restart in parallel.
+In order to optimize the $r$ acquisition functions using gradient information,
+the acquisition values of the $r$ random restarts are summed before
+back-propagating.
+
+#### Batched Cross Validation
+See the
+[Using batch evaluation for fast cross validation](../tutorials/batch_mode_cross_validation)
+tutorial for details on using batching for fast cross validation.
diff --git a/docs/design_philosophy.md b/docs/design_philosophy.md
@@ -19,6 +19,7 @@ botorch adheres to the following main design tenets:
   - Extend the applicability of Bayesian Optimization to very large problems by
   harnessing scalable modeling frameworks such as [GPyTorch](https://gpytorch.ai/).
 
+
 ## Parallelism through batched computation
 
 Batching (as in batching data, batching computations) is a central component to
@@ -44,8 +45,7 @@ stochastic gradient descent using mini-batch training, botorch itself abstracts
 away from this.
 
 For more detail on the different batch notions in botorch, take a look at the
-[More on Batching](#more_on_batching) section.
-
+[Batching in botorch](#batching) section.
 
 
 ## Optimizing Acquisition Functions
diff --git a/docs/more_on_batching.md b/docs/more_on_batching.md
diff --git a/docs/optimization.md b/docs/optimization.md
@@ -20,7 +20,7 @@ The default method used by botorch to optimize acquisition functions is
 [`gen_candidates_scipy()`](../api/gen.html#botorch.gen.gen_candidates_scipy).
 Given a set of starting points (for multiple restarts) and an acquisition
 function, this optimizer makes use of `scipy.optimize.minimize()` for
-optimization, via either the `L-BFGS-B` or `SLSQP` routines.
+optimization, via either the L-BFGS-B or SLSQP routines.
 `gen_candidates_scipy()` automatically handles conversion between `torch` and
 `numpy` types, and utilizes PyTorch's autograd capabilities to compute the
 gradient of the acquisition function.
@@ -30,9 +30,8 @@ gradient of the acquisition function.
 A `torch` optimizer such as `torch.optim.Adam` or `torch.optim.SGD` can also be
 used directly, without the need to perform `numpy` conversion. These first-order
 gradient-based optimizers are particularly useful for the case when the
-acquisition function is stochastic, where algorithms `L-BFGS` or Sequential
-Least-Squares Programming designed for deterministic functions should not be
-applied. The function
+acquisition function is stochastic, where algorithms like L-BFGS or SLSQP that
+are designed for deterministic functions should not be applied. The function
 [`gen_candidates_torch()`](../api/gen.html#botorch.gen.gen_candidates_torch)
 provides an interface for `torch` optimizers and handles bounding.
 See the example notebooks
diff --git a/website/sidebars.json b/website/sidebars.json
@@ -3,6 +3,6 @@
     "About": ["introduction", "design_philosophy", "botorch_and_ax"],
     "General": ["getting_started"],
     "Basic Concepts": ["overview", "models", "posteriors", "acquisition", "optimization"],
-    "Advanced Topics": ["more_on_batching", "objectives", "samplers", "mtmo_models"]
+    "Advanced Topics": ["batching", "objectives", "samplers"]
   }
 }

Original file line number	Diff line number	Diff line change
`@@ -3,6 +3,6 @@`
`3`	`3`	`"About": ["introduction", "design_philosophy", "botorch_and_ax"],`
`4`	`4`	`"General": ["getting_started"],`
`5`	`5`	`"Basic Concepts": ["overview", "models", "posteriors", "acquisition", "optimization"],`
`6`		`- "Advanced Topics": ["more_on_batching", "objectives", "samplers", "mtmo_models"]`
	`6`	`+ "Advanced Topics": ["batching", "objectives", "samplers"]`
`7`	`7`	`}`
`8`	`8`	`}`