Optimal control of systems that are only partially observable, that is, where the state of the system may be only partially determined through available measurements (observations) of the environment, continues to be an active area of research. This so-called “partial observability” arises from such causes as (for example) sensor inaccuracy or more significantly, the lack of an appropriate sensor for a state that is relevant to an action that the controller may initiate. Such conditions are likely to occur, in particular, in relatively unstructured environments in which the system to be controlled is a small part of a large scenario containing other equally autonomous entities that were not specifically designed to match the capabilities of the controlled system. With an expected increasing use of robotic systems in all walks of life (automated border surveillance, unmanned crop spraying in agriculture, mapping of intestinal tracts using autonomous surveillance ‘pills’, for example), such conditions are of increasing importance and there are many potential applications for this technology.
A key requirement in controlling these systems is an ability to react reasonably to a very broad range of external disturbances, using very limited sensor information, particularly where the disturbances are likely to be responsive to an action of the controller. For example, in aerial crop spraying there might be a need to avoid bird-strike, but the action of traversing the whole area of a crop to be sprayed will necessarily lead to birds taking flight from the crop: the birds may not be visible on the crop but experience in the form of a model of bird behaviour might be used to define best patterns of spraying to minimise the incidence of bird-strike.
An approach to addressing this realm of problem is provided by optimal control theory. The fundamental equations providing a recursive solution to this general problem are the Hamilton Bellman equations, as described for example by R E Bellman in “Dynamic Programming”, Princeton University Press, (1957). These equations can be solved exactly only in a restricted range of relatively simple control problems.
For linear systems in particular, where performance measures are quadratic and where any noise arising in observations of the environment are assumed to be Gaussian in nature, Goodwin and Kwai Sang Sin (Adaptive Filtering, Prediction and Control, Prentice Hall (1984)), for example, describe a method for controlling a system comprising estimating the states of the system, on-line, using a Kalman filter and using these state estimates in a full state feedback controller that has been designed separately from the state estimation process. This method involves an off-line solution of a set of Riccatti equations very similar to those of the Kalman filter itself. While this process leads to an elegant solution to the associated control problem, the assumptions that are needed to obtain the solution are rarely satisfied in a real systems.
In more realistic situations it is not possible to invoke the ‘separation principle’ that allows the controller and state estimation to be separately solved for. Such a separation enables both the estimation and control to be expressed compactly in terms of separate pre-computed gain factors. Without this separation, the optimal control depends on the results of observations and not just on an instantaneous error estimate. Because of this intimate coupling the usual approach to the design of controllers for such complex systems, characterised by nonlinear sensors and dynamics and correspondingly non-Gaussian statistics, (for example as described by D. Karagiannis, R. Ortega, and A. Astolfi in “Nonlinear adaptive stabilization via system immersion: control design and applications”, Lecture Notes in Control and Information Sciences, Springer-Verlag, Berlin, 311, pp. 1-21 (2005)) is to completely avoid state estimation and to directly design a controller on the basis of an analysis of the dynamics of the controller system.
While this approach is sufficient for control, it does not provide for an interpretation of the control actions that were taken. This is not a serious issue for a closed autonomous system, but if such a system is linked to collaborating human operators undesirable behaviours can result simply because, without the system controller providing an explanation for its robotic actions, the human collaborator can end up counteracting them. Pilot-induced oscillation is an example of this phenomenon.
In the case of a linear system in which the states of the system can be defined by discrete values that are directly observed, the optimal control problem is just that of the well-known ‘travelling salesman’ problem that is NP Complete, that is, the time taken to obtain a general solution is known to scale faster than any polynomial of the number of states. Nonetheless, at least, in the case of an effectively finite horizon, a number of algorithms are known that enable practical approximate solutions to be obtained in which the control algorithms are solved either off-line or iteratively improved as control actions are taken. These methods are generally described within a framework of Markov Decision Processes (MDPs) that comprise a Bayesian network model representing states and measurements of the system, augmented with control actions and performance measures. Within this model, a probabilistic link between a state and a measurement is termed a measurement model and the links between the states at different times constitute a model of the system dynamics. This dynamics is parameterised by the controller action and the performance of the system is monitored by rewards that are (possibly stochastic variables) conditional on the states of the system. Given this probabilistic structure, the benefits of particular control actions can be accumulated in a net return function. Choosing the actions so as to optimise the net return in the long run provides the route to design of an optimal controller.
In the case of an MDP system with discrete-valued states, the Hamilton Bellman equation, referenced above, is a matrix equation in the discrete state space and solutions for optimal control are obtained by iteration of this matrix equation. The individual iterations involve matrix multiplication that is polynomial in the number of states, so that provided a finite horizon can be assumed, a whole solution can be found approximately in polynomial time. In the textbook “Reinforcement Learning—an Introduction”, by Sutton R S, and Barto, A G., MIT Press, Cambridge Mass., (1998), Barto and Sutton describe a number of different methods for solving such discrete state decision problems.
In the general case where not all states are “well observed”, that is, where it is not possible to determine all the states of a system with certainty on the basis of observations, control problems need to be solved by means of what is referred to as a Partially Observed Markov Decision Process (POMDP). In this case the MDP solution techniques mentioned above are not practicable. Although the same MDP approach is equally valid when states are not well observed, the resulting equations involve integrals over the probabilities of states rather than sums over the states themselves. These equations are not exactly solvable and even the known approximate solutions (for example the “Witness” algorithm) are not guaranteed to be polynomial.
The Witness algorithm, developed by Kaelbling et al. and described for example in Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman, “Acting optimally in partially observable stochastic domains”, Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, Wash., (1994) is difficult to use in practice, if only because its tractability is difficult to gauge, a priori. In this algorithm, the integral problem referred to above is converted into a discrete problem by assuming the global benefit function to be piece-wise linear. This assumption is consistent with the form of the fundamental Hamilton Bellman equation but this does not place a restriction on the number of linear facets that might result in the solution. Here the solution time is polynomial in this number of facets, but since it is possible for the number of facets to be indefinitely large, the solution may still be non-polynomial.
In practice, this might not be a serious limitation if the controller design is performed off-line, since it may be possible to adjust the effective horizon of the solution so as to obtain a reasonable solution in a practicable time. However, the method does have the undesirable feature that the number of facets and hence the computational time only emerge as the solution process proceeds. This means that at best the approach is limited to off-line controller (decision-maker) design.
The Witness algorithm approach does have the advantage that it addresses the issue of providing both optimal control actions and optimal estimation of the system state, giving a rational for the actions. In contrast, recent work by MacAllester and Singh (“Uncertainty in Artificial Intelligence”, vol. 5, p 409 (1999)) takes an approach analogous to that of Karagiannis et al., referenced above, in conventional control and avoids the state estimation entirely by seeking control solutions for the POMDP problem directly in terms of histories of measurements. As noted previously, such an approach makes difficult the integration of multiple levels of control, particularly those involving interactions with human players.
Although partial observation leads to a problem for which there is no known approach providing a solution that can be made arbitrarily accurate, there are many situations where the lack of complete observability has few consequences and it is possible to obtain an approximate solution by solving the completely observed case and then looking for a perturbation solution by considering fluctuations around this classical solution. For a continuous state space, this is the Laplace approximation.
There have been recent attempts to extend this approach by allowing multiple classical paths. In particular, in H. J. Kappen, “Path integrals and symmetry breaking for optimal control theory”, arXiv: physics/0505066 4 (2005), Kappen has addressed this extension by sampling over the system trajectories from the neighbourhoods of all the classical optimal solution paths. In this approach the optimally controlled path is obtained by averaging over samples of likely paths, with appropriate likelihood weightings. This approach loses the potential advantages of a recursive formulation that would allow on-line optimisation and it also avoids the need to compute estimates of the states. While the latter might be seen as an advantage, it is not if there is a need to ‘understand’ the controlled actions of an autonomous robotic system.