In adversarial multiagent domains, security, commonly defined as the ability to deal with intentional threats from other agents, is a critical issue. These domains can be modeled as Bayesian games. Much work has been done on finding equilibria for such games. It is often the case, however, in multiagent security domains that one agent can commit to a mixed strategy that its adversaries observe before choosing their own strategies. In this case, the agent can maximize reward by finding an optimal strategy, without requiring equilibrium. Previous work has shown this problem of optimal strategy selection to be NP-hard.
In many multiagent domains, agents must act in order to provide security against attacks by adversaries. A common issue that agents face in such security domains is uncertainty about the adversaries they may be facing. For example, a security robot may need to make a choice about which areas to patrol, and how often. It will not, however, know in advance exactly where a robber will choose to strike. A team of unmanned aerial vehicles (UAVs) monitoring a region undergoing a humanitarian crisis may also need to choose a patrolling policy. They must make this decision without knowing in advance whether terrorists or other adversaries may be waiting to disrupt the mission at a given location. It may indeed be possible to model the motivations of types of adversaries the agent or agent team is likely to face in order to target these adversaries more closely. In both cases, the security robot or UAV team will not know exactly which kinds of adversaries may be active on any given day.
A common approach for choosing a policy for agents in such scenarios is, as described previously, to model the scenarios as Bayesian games. A Bayesian game is a game in which agents may belong to one or more types; the type of an agent determines its possible actions and payoffs. The distribution of adversary types that an agent will face may be known or inferred from historical data. Usually, these games are analyzed according to the solution concept of a Bayes-Nash equilibrium, an extension of the Nash equilibrium for Bayesian games. In many settings, however, a Nash or Bayes-Nash equilibrium is not an appropriate solution concept, since it assumes that the agents' strategies are chosen simultaneously.
In some settings, one player can commit to a strategy before the other players choose their strategies, and by doing so, attain a higher reward than if the strategies were chosen simultaneously. These scenarios are known as Stackelberg games. In a Stackelberg game, a leader commits to a strategy first, and then a follower (or group of followers) selfishly optimize their own rewards, considering the action chosen by the leader. For example, the security agent (leader) may first commit to a mixed strategy for patrolling various areas in order to be unpredictable to the robbers (followers). The robbers, after observing the pattern of patrols over time, can then choose their own strategy of choosing a location to rob.
To see the advantage of being the leader in a Stackelberg game, consider a simple game with the payoff table as shown in Table 1, infra. The leader is the row player and the follower is the column player. Here, the leader's payoff is listed first.
TABLE 1Payoff table for example normal form game.12315, 50, 03, 1 020, 02, 25, 0
The only Nash equilibrium for this game is when the leader plays 2 and the follower plays 2 which gives the leader a payoff of 2. However, if the leader commits to a uniform mixed strategy of playing 1 and 2 with equal (0.5) probability, the follower's best response is to play 3 to get an expected payoff of 5 (10 and 0 with equal probability). The leader's payoff would then be 4 (3 and 5 with equal probability). In this case, the leader now has an incentive to deviate and choose a pure strategy of 2 (to get a payoff of 5). However, this would cause the follower to deviate to strategy 2 as well, resulting in the Nash equilibrium. Thus, by committing to a strategy that is observed by the follower, and by avoiding the temptation to deviate, the leader manages to obtain a reward higher than that of the best Nash equilibrium.
The problem of choosing an optimal strategy for the leader to commit to in a Stackelberg game is analyzed in and found to be NP-hard in the case of a Bayesian game with multiple types of followers. Thus, efficient heuristic techniques for choosing high-reward strategies in these games is an important open issue. Methods for finding optimal leader strategies for non-Bayesian games can be applied to this problem by converting the Bayesian game into a normal-form game by the Harsanyi transformation. If, on the other hand, one wishes to compute the highest-reward Nash equilibrium, new methods using mixed-integer linear programs (MILPs) may be used, since the highest-reward Bayes-Nash equilibrium is equivalent to the corresponding Nash equilibrium in the transformed game. However, by transforming the game, the compact structure of the Bayesian game is lost. In addition, since the Nash equilibrium assumes a simultaneous choice of strategies, the advantages of being the leader are not considered.
A Bayesian game can be transformed into a normal-form game using the Harsanyi transformation. Once this is done, prior art linear-program (LP)-based methods for finding high-reward strategies for normal-form games can be used to find a strategy in the transformed game; this strategy can then be used for the Bayesian game. While prior art methods exist for finding Bayes-Nash equilibria directly, without the Harsanyi transformation, they find only a single equilibrium in the general case, which may not be of high reward.
In most security patrolling domains, the security agents (like UAVs or security robots) cannot feasibly patrol all areas all the time. Instead, they must choose a policy by which they patrol various routes at different times, taking into account factors such as the likelihood of crime in different areas, possible targets for crime, and the security agents' own resources (number of security agents, amount of available time, fuel, etc.). It is usually beneficial for this policy to be nondeterministic so that robbers cannot safely rob certain locations, knowing that they will be safe from the security agents. To demonstrate the utility of our algorithm, we use a simplified version of such a domain, expressed as a game.
The most basic version of such a scenario game consists of two players: the security agent (the leader) and the robber (the follower) in a world consisting of m houses, 1 . . . m. The security agent's set of pure strategies consists of possible routes of d houses to patrol (in an order). The security agent can choose a mixed strategy so that the robber will be unsure of exactly where the security agent may patrol, but the robber will know the mixed strategy the security agent has chosen. For example, the robber can observe over time how often the security agent patrols each area. With this knowledge, the robber must choose a single house to rob. We assume that the robber generally takes a long time to rob a house. If the house chosen by the robber is not on the security agent's route, then the robber successfully robs the house. Otherwise, if it is on the security agent's route, then the earlier the house is on the route, the easier it is for the security agent to catch the robber before lie finishes robbing it.
We model the payoffs for this game with the following variables:                vl,x: value of the goods in house Ito the security agent.        vl,q: value of the goods in house Ito the robber.        cx: reward to the security agent of catching the robber.        cy: cost to the robber of getting caught.        pl: probability that the security agent can catch the robber at the lth house in the patrol (pl<pl′l′<l).        
The security agent's set of possible pure strategies (patrol routes) is denoted by X and includes all d-tuples i=<w1, w2, . . . , wd> with w1 . . . wd=1 . . . m, where no two elements are equal (the agent is not allowed to return to the same house). The robber's set of possible pure strategies (e.g., houses to rob) is denoted by Q and includes all integers j=1 . . . m. The payoffs (security agent, robber) for pure strategies i, j are:                −vl,x, vl,q for j=l∉i.        plcx+(1−pl)(−vl,x),−plcq+(1−pl)(vl,q), for j=1∈i.        
With this structure it is possible to model many different types of robbers who have differing motivations; for example, one robber may have a lower cost of getting caught than another, or may value the goods in the various houses differently. If the distribution of different robber types is known or inferred from historical data, then the game can be modeled as a Bayesian game [6].
Bayesian Games
A Bayesian game contains a set of N agents, and each agent n must be one of a given set of types θn. For the case of a simplified patrolling domain, two agents are present, the security agent and the robber. Θ1 is the set of security agent types and θ2 is the set of robber types. Since there is only one type of security agent, Θ1 contains only one element. During the game, the robber knows its type but the security agent does not know the robber's type. For each agent (the security agent or the robber)n, there is a set of strategies σn and a utility function un: θ1×θ2×σ1×σ2→.
As referenced previously, a Bayesian game can be transformed into a normal-form game using the Harsanyi transformation, as described in J. C. Harsanyi and R. Selten, “A generalized Nash solution for two-person bargaining games with incomplete information,” Management Science, 18(5):80-106, 1972; the entire contents of which are incorporated herein by reference. Once this is done, linear-program (LP)-based methods for finding high-reward strategies for normal-form games can be used to find a strategy in the transformed game; this strategy can then be used for the Bayesian game. While methods exist for finding Bayes-Nash equilibria directly, without the Harsanyi transformation, they find only a single equilibrium in the general case, which may not be of high reward. Recent work has led to efficient mixed-integer linear program techniques to find the best Nash equilibrium for a given agent. These techniques, however, do require a normal-form game, and so to compare the policies given by embodiments of the presently disclosed ASAP method against the optimal policy, as well as against the highest-reward Nash equilibrium, these techniques can be applied to the Harsanyi-transformed matrix, as described infra.
Harsanyi Transformation
The first step in solving Bayesian games is to apply the Harsanyi transformation that converts the incomplete information game into a normal form game. Given that the Harsanyi transformation is a standard concept in game theory, it is described briefly through a simple example without introducing the mathematical formulations. An initial assumption is that there are two robber types a and b in the Bayesian game. Robber a will be active with probability α, and robber b will be active with probability 1−α. The rules described under the heading “The Patrolling Domain,” supra, can allow construction of simple payoff tables.
For example, one can assume that there are two houses in the world (1 and 2) and hence there are two patrol routes (pure strategies) for the agent: {1,2} and {2,1}. The robber can rob either house 1 or house 2 and hence he has two strategies (denoted as 1l, 2l, for robber type 1). Since there are two types assumed (denoted as a and b), two payoff tables (shown in Table 2) can be constructed corresponding to the security agent playing a separate game with each of the two robber types with probabilities α and 1−α. First, consider robber type a. Borrowing the notation from the domain section, supra, the following values can be assigned to the variables: vl,x=v1,q=3/4, v2,x=v2,q=1/4, cx=1/2, cq=1, p1=1, p2=1/2. Using these values, a base payoff table can be constructed as the payoff for the game against robber type a. For example, if the security agent chooses route {1,2} when robber a is active, and robber a chooses house 1, the robber receives a reward of −1 (for being caught) and the agent receives a reward of 0.5 for catching the robber. The payoffs for the game against robber type b are constructed using different values.
TABLE 2Payoff Tables: Security Agents vs. Robbers a and bSecurity Agent{1, 2}{2, 1}Robber a1a−1, 0.5−0.375, 0.1252a−0.125, −0.125−1, 0.5Robber b1b−0.9, 0.6−0.275, 0.2252b−0.025, −0.25−0.9, 0.6
Using the Harsanyi technique involves introducing a chance node that determines the robber's type, thus transforming the security agent's incomplete information regarding the robber into imperfect information. The Bayesian equilibrium of the game is then precisely the Nash equilibrium of the imperfect information game. The transformed, normal-form game is shown in Table 3, below.
In the transformed game, the security agent is the column player, and the set of all robber types together is the row player. Suppose that robber type a robs house 1 and robber type b robs house 2, while the security agent chooses patrol {1,2}. Then, the security agent and the robber receive an expected payoff corresponding to their payoffs from the agent encountering robber a at house 1 with probability a and robber b at house 2 with probability 1−α.
Finding an Optimal Strategy
Although a Nash equilibrium is the standard solution concept for games in which agents choose strategies simultaneously, in our security domain, the security agent (the leader) can gain an advantage by committing to a mixed strategy in advance. Since the followers (the robbers) will know the leader's strategy, the optimal response for the followers will be a pure strategy. Given the common assumption, in the case where followers are indifferent, they will choose the strategy that benefits the leader, there must exist a guaranteed optimal strategy for the leader.
From the Bayesian game in Table 2, the Harsanyi transformed bimatrix is constructed in Table 3. The index sets of the security agent and robbers' pure strategies denoted as X=σ1θ2=σ1, and Q=σ2θ2, respectively, with R and C as the corresponding payoff matrices. Rij is the reward of the security agent and Cij is the reward of the robbers when the security agent takes pure strategy i and the robbers take pure strategy j. A mixed strategy for the security agent is a probability distribution over its set of pure strategies and will be represented by a vector x=(px1, px2, . . . , px|x|), where pxi≧0 and Σpxi=1. Here, p is the probability that the security agent will choose its ith pure strategy.
TABLE 3Harsanyi Transformed Payoff Table{1, 2}{2, 1}{1a, 2a}−1α − 0.9(1 − α), −0.375α-0.275(1 − α), 0.5α + 0.6(1 − α)0.125α + 0.225(1 − α){1a, 2b}1α − 0.025(1 − α), −0.375α − 0.9(1 − α), 0.5α −0.025(1 − α)0.125α + 0.6(1 − α){2a, 1b}−0.125α − 0.9(1 − α), −1α − 0.275(1 − α), −0.125α + 0.6(1 − α)0.5α + 0.225(1 − α){2a, 2b}−0.125α − 0.25(1 − α), −1α − 0.9(1 − α), −0.125α + 0.025(1 − α)0.5α + 0.6(1 − α)
The optimal mixed strategy for the security agent can be found in time polynomial in the number of rows in the normal form game using the following linear program formulation from.
For every possible pure strategy j by the follower (the set of all robber types),maxΣi∈pxiRij s.t.∀j′∈Q1Σi∈σ1pxiCij≧Σi∈σ1pxiCij′Σi∈pxi=1∀i∈,pxi=0  (Eq. 1)
Then, for all feasible follower strategies j, choose the one that maximizes Σi∈pxiRij the reward for the security agent (leader). The pxi; variables give the optimal strategy for the security agent.
Note that while this method is polynomial in the number of rows in the transformed, normal-form game, the number of rows increases exponentially with the number of robber types. Using this method for a Bayesian game thus requires running |σ2||θ2| separate linear programs. This is not a surprise, since finding the optimal strategy to commit to for the leader in a Bayesian game is NP-hard.
The patrolling problem has recently received growing attention from the multiagent community due to its wide range of applications. However most of this work is focused on either limiting energy consumption involved in patrolling or optimizing on criteria like the length of the path traveled, without reasoning about any explicit model of an adversary.
What is desirable, therefore, are devices and techniques that address such limitations described for the prior art.