Many problems involving correlating inputs require optimizing an unknown reward function from which only noisy observations may be obtained. For example, in the field of spinal therapy, there may be multiple electrodes placed in different locations on the spine. Determining the optimal location and polarities (cathode or anode) of the multiple electrodes for the most optimal response from electrical stimulation involves testing multiple combinations of locations and electrodes for outputs that reflect both reactions and noise. Analyzing the output of the function is used to continue to refine the input data sets to obtain the optimal location of electrodes to maximize desired attributes of the stimulus response.
A central challenge is choosing actions that both explore (estimate) a function and exploit knowledge about likely high reward regions for determining the best results based on the function. Carefully calibrating this exploration-exploitation tradeoff in selecting input data is especially important in cases where the experiments are costly in some sense, e.g., when each experiment takes a long time to perform and the time window available for experiments is short. This approach relies on the completion of a first experiment before other values may be tested, but such an approach requires time to conduct multiple trials which rely on the results of the previous trials.
In many applications, it is desirable to select of a batch of input data to be evaluated in parallel, which increases the speed of obtaining solutions since the many inputs constituting a batch may be tested simultaneously. By parallelizing the experiments, substantially more information may be gathered in the same time frame, however, future actions must be chosen without the benefit of intermediate results. This involves choosing groups of experiments to run simultaneously. The challenge is to assemble groups of experiments which both explore the function and exploit what are currently known to be high-performing regions. This challenge is significant when dealing with the combinatorially large set of possible data inputs. Further, the statistical question of quantitatively how the algorithm's performance depends on the size of the batch (i.e., the degree of informational parallelism) is important to resolve.
Exploration-exploitation tradeoffs have been studied in context of the multi-armed bandit problem, in which a single action is taken at each round, and a corresponding (possibly noisy) reward is observed. Early work has focused on the case of a finite number of decisions and payoffs that are independent across the arms. In this setting, under some strong assumptions, optimal policies can be computed.
Optimistic allocation of actions according to upper-confidence bounds (UCB) on the payoffs has proven to be particularly effective. Recently, approaches for coping with large (or infinite) sets of decisions have been developed. In these cases, dependence between the payoffs associated with different decisions must be modeled and exploited. Examples include bandits with linear or Lipschitz-continuous payoffs or bandits on trees. The exploration-exploitation tradeoff has also been studied in Bayesian global optimization and response surface modeling, where Gaussian process models are often used due to their flexibility in incorporating prior assumptions about the payoff function
One natural application is the design of high-throughput experiments, where several experiments are performed in parallel, but only receive feedback after the experiments have concluded. In other settings, feedback may be received only after a delay. To enable parallel selection, one must account for the lag between decisions and observations. Most existing approaches that can deal with such delay result in a multiplicative increase in the cumulative regret as the delay grows. Only recently, methods have demonstrated that it is possible to obtain regret bounds that only increase additively with the delay (i.e., the penalty becomes negligible for large numbers of decisions). However, such approaches only apply to contextual bandit problems with finite decision sets, and thus not to settings with complex (even nonparametric) payoff functions.
There is therefore a need for a method to select a batch of input data for numerous evaluations of a function performed in parallel to maximize reward. There is also a need for a system to use existing Gaussian process models with upper confidence bounds in order to select a batch of input data without relying on previous evaluation output data. There is a further need to provide a process for selecting a batch with a variable length for function evaluation in parallel.