The present disclosure generally relates to systems and methods for planning and sequential decision making, for example, in real-world domains. More particularly, the present application relates to planning via Monte-Carlo simulation trials utilizing an innovative decomposition of observable state information that enables tackling larger state spaces than is feasible with established state-of-the-art methods.
A planning problem, also known as a sequential decision-making problem, is commonly characterized by a set of possible states of the problem, a set of allowable actions that can be performed in each state, a process for generating transitions to successor states given the current state and current selected action, a planning horizon (i.e., total anticipated number of decisions to be made in the problem), and a measure of utility or reward obtained at one or more steps of the planning problem. Typically, the objective is to compute a sequence of decisions that maximizes expected cumulative discounted or undiscounted reward. Additionally, planning problems presume that observable information pertaining to the state of the problem is available at each step in the sequence. If the observable information uniquely identifies the state, and the processes that generate rewards and state transitions are stationary and history-independent, the problem is formally classified as a Markov Decision Process (MDP). Alternatively, if the observable information does not uniquely identify the state, the problem is a Partially Observable Markov Decision Process (POMDP), provided that the reward and state transition processes are still stationary and history-independent.
Monte-Carlo Planning methods use a simulation model of the real domain, and estimate the cumulative reward of performing an action in a given state on the basis of Monte-Carlo simulation trials. Such simulation trials comprise one or more steps, each of which typically comprises a simulation of performing an action in the current simulated state, generating a transition to a new state, and generating an immediate reward (if applicable). The selection of an action at each step is performed by a simulation policy, i.e., a method which selects one of the available legal actions responsive to information observed in the current or previous steps of a simulation trial. The outcomes of Monte-Carlo simulation trials are assumed to be non-deterministic. The non-determinism may arise from non-deterministic rewards or state transitions, as well as from the use of a non-deterministic simulation policy. As a result, Monte-Carlo simulation trials provide a means of statistically evaluating the long-range cumulative expected reward obtained by performing a given action in a given state of the simulated domain.
Many methods are known in the art for planning based on Monte-Carlo simulation trials. One of the earliest and simplest methods is the so-called “rollout algorithm” (G. Tesauro and G. R. Galperin, On-line policy improvement using Monte-Carlo search,” in: Advances in Neural Information Processing Systems, vol. 9, pp. 1068-1074, 1997). In this method, a number of simulated trials (“rollouts” are performed, each starting from a common initial state corresponding to the current state of the real domain. Each trial comprises selection of a legal action in the root state according to a sampling policy, and then actions in subsequent steps of the trial are performed by a fixed simulation policy. Mean reward statistics are maintained for each top-level action, and upon termination of all simulated trials, the method returns the top-level action with highest mean reward to be executed in the real domain.
More recently, a number of Monte-Carlo planning methods have been published (e.g., L. Kocsis and Cs. Szepesvari, “Bandit-based Monte-Carlo Planning,” Proceedings of European Conference on Machine Learning, pp. 282-293, 2006) that extend the rollout algorithm to multiple levels of evaluation. That is, mean reward statistics are computed and maintained at subsequent steps of a trial in addition to the top-level step. This is typically accomplished by maintaining a collection of “nodes” (i.e., symbolic representations of states, or legal actions in a given state) encountered during the trials, computing total reward at the end of each trial, and then updating mean reward statistics of nodes participating in a given trial based on the total reward obtained in the trial. A sampling policy (e.g., sampling according to multi-armed bandit theory) is used not only for the initial step, but also for subsequent steps of a trial. While these methods are capable of producing effective sequential plans in domains with arbitrary topological relations between nodes (e.g., general MDPs with multiple paths to a given state, and loops back to previously encountered states), the preferred embodiment of these methods comprises nodes organized in a strict tree structure. For this reason, the methods are commonly referred to as Monte-Carlo Tree Search (MCTS) methods.
The recent advances in the use of MCTS methods enable effective on-the-fly planning in real-world domains such as Computer Go (S. Gelly and D. Silver, “Achieving Master Level Play in 9×9 Computer Go,” Proc. of AAAI, 2008). In that MCTS method, a tree of alternating action (parent) nodes and child nodes based on a simulated game is dynamically grown. The MCTS tree and data associated with nodes are represented as data structures in a computer system memory. From stochastic simulations involving randomness of sequential game moves (e.g., simulations of playouts in the case of Computer Go), intelligence is gathered at each of the nodes (e.g., an evaluation based on a winning percentage). For example, in the case of Computer Go, statistics data at each node is maintained based on the number of trials and simulated playout win outcomes. Associated reward values may be computed and stored in association with that node of the tree. On the basis of the intelligence gathered from the simulations, a good strategy for a player's move (decisions) can be inferred.
FIG. 1 illustrates a data structure 400 constructed by a computing system for computing the optimal move for one player (for example, White) in an example computer Go game. The data structure 400 depicts a current state of an example computer Go game, and includes a parent node 405, and an alternating tree structure of action nodes represented by circles (e.g., nodes 407, 410 and 420) and successor state nodes represented by squares (e.g., nodes 405, 415, 425 and 430). An action node refers to a node in the data structure 400 that specifies at least one next action for the planning agent to perform in the parent node state. For example, node 410 may represent the action of White playing at E5 in the root state, and node 407 may represent White playing at F6 in the root state. Successor state nodes indicate possible environmental responses to the planning agent's actions, for example, possible moves that can be made by Black in response to a particular move by White. For example, node 425 represents the response B-D7 to W-E5, and node 430 represents the response B-E3 to W-E5. Within the data structure, standard MCTS planning methods perform a plurality of trials for each top-level action. A trial comprises a sequence of simulation steps, each step comprising a selection of an action node from among the available actions, and simulating a response of the environment leading to a successor state. The selected actions in the initial steps of a trial are typically selected by a bandit-sampling policy. If the trial selects an action that has not been previously sampled, a new node will be added to the data structure, corresponding to the action, and steps of the trial will continue, making use of a non-deterministic “playout” policy to select moves for both players. The steps of the trial continue until the simulated game terminates according to the rules of Go.
As a result of the simulated actions according to the MCTS approach, the data maintained at each action node includes the total number of simulated trials including the given action node, and the number of trials resulting in a win outcome. For example, node 410 contains the data record “ 3/7” indicating that a total of 7 trials have been performed including this node, of which 3 trials resulted in wins for White. After completion of a trial, the MCTS approach performs an update, in which the statistics of the nodes participating in the trial are updated with the game results, i.e., results of simulated trials are propagated up successive levels of the tree.
The value of any given node is estimated according to the node's mean win rate, i.e., the ratio of number of wins to total number of trials including the given node. During the performance of simulated trials within a decision cycle, selecting actions according to mean win rate may result in greater likelihood of achieving a more favorable outcome. For example, at node 405, a root node representative of the current state of the game, there is estimated the expected reward of each child action node (e.g., value of ⅕ for node 407 and value of 3/7 for node 410 in FIG. 1). Based on the observed statistics, selecting action 410 will be more likely to achieve a win than action 407. In practice, bandit-sampling algorithms used in MCTS select actions in a manner that balances exploitation (achieving high win rate) and exploration (sampling nodes with few trials). The tree is expanded by one node for each simulated game and is grown dynamically by adding new leaf nodes and performing corresponding simulations.
While MCTS methods enable effective planning in many real domains, it is widely understood that the computational cost of such methods scales poorly (i.e., exponentially) with the number of state variables, the granularity of possible values of the state variables, and the number of legal actions in typical states. Consider, for example, domains comprising a number of continuous-valued state variables. A literal implementation of standard MCTS, maintaining separate nodes for each distinct state, may well result in no node being encountered more than once, since each encountered state may never match a previously encountered state to infinite precision in all state variables. Hence, the mean reward statistics in every node would only comprise results of a single trial, and would thus provide a highly unreliable estimate of a node's true expected value.
Discretizing the state space (of continuous variables) could address the above limitation of MCTS methods. For example, continuous robot arm motion may be discretized in units of 1 degree angles. However, the number of encountered states in the Monte-Carlo trials may still be infeasibly large, and the number of visits of any given node may still be too small to provide a reliable basis for effective planning. Moreover, such an approach fails to exploit a natural smoothness property in many real-world domains, in that similar states tend to have similar expected values, so that statistical evidence gathered from neighboring states could provide highly relevant evidence of the expected value of any particular node.
Hence, it would be desirable to provide a system and method implementing improved Monte-Carlo planning that reduces the nominal search complexity of standard Monte-Carlo Tree Search, and effectively exploits smooth dependence of expected cumulative reward on some or all of the observable state variables in a given real-word domain.