Recent years have seen a rise in interest in applying game theoretic methods to real world problems wherein one player (referred to as the leader) chooses a strategy (which may be a non-deterministic i.e. mixed strategy) to commit to, and waits for the other player (referred to as the follower) to respond. Examples of such problems include network monitoring, public surveillance or infrastructure security domains where the leader commits to a mixed, randomized patrolling strategy in an attempt to thwart the follower from compromising resources of high value to the leader. In particular, a known technique referred to as the ARMOR system such as described in the reference to Pita, J., Jain, M., Western, C., Portway, C., Tambe, M., Ordonez, F., Kraus, S., Paruchuri, P. entitled Deployed ARMOR protection: The application of a game-theoretic model for security at the Los Angeles International Airport in Proceedings of AAMAS (Industry Track) (2008), suggests where to deploy security checkpoints to protect terminal approaches of Los Angeles International Airport. A further technique described in a reference to Tsai, J., Rathi, S., Kiekintveld, C., Ordonez, F., Tambe, M. entitled IRIS—A tool for strategic security allocation in transportation networks in Proceedings of AAMAS (Industry Track) (2009) proposes flight routes for the Federal Air Marshals to protect domestic and international flight from being hijacked and the PROTECT system (under development) suggests routes for the United States Coast Guard to survey critical infrastructure in the Boston harbor.
In arriving at optimal leader strategies for the above-mentioned and other domains, of critical importance is the leader's ability to profile the followers. In essence, determining the preferences of the follower actions is a vital step in predicting the follower rational response to leader actions which in turn allows the leader to optimize its mixed strategy to commit to. In security domains in particular it is very problematic to provide precise and accurate information about the preferences and capabilities of possible attackers. For example, the follower might have a different valuation from the leader valuation of the resources that the leader protects which leads to situations where some leader resources are at an elevated risk of being compromised. For example, a leader might value an airport fuel depot at $10 M whereas the follower (without knowing that the depot is empty) might value the same depot at $20 M. A fundamental problem that the leader thus has to address is how to act, over a prolonged period of time, given the initial lack of knowledge (or only a vague estimate) about the types of the followers and their preferences. Examples of such problems can be found in security applications for computer networks, see for instance, a reference to Alpcan, T., Basar, T. entitled “A game theoretic approach to decision and analysis in network intrusion detection,” in Proceedings of the 42nd IEEE Conference on Decision and Control, pp. 2595-2600 (2003) and, see reference to Nguyen, K. C., Basar, T. A. T. entitled “Security games with incomplete information,” in Proceeding of IEEE International Conference on Communications (ICC 2009) (2009) where the hackers are rarely caught and prevented from future attacks while their profiles are initially unknown.
Domains where the leader acts first by choosing a mixed strategy to commit to and the follower acts second by responding to the leader's strategy can be modeled as Stackelberg games.
In a Bayesian Stackelberg game the situation is more complex as the follower agent can be of multiple types (encountered with a given probability), and each type can have a different payoff matrix associated with it. The optimal strategy of the leader must therefore consider that the leader might end up playing the game with any opponent type. It has been shown that computing the Strong Bayesian Stackelberg Equilibrium is an NP-hard problem.
Formally, a Stackelberg game is defined as follows: Al={al1, . . . , alM} is a set of leader actions and Af={af1, . . . , afN} is a set of follower actions. (Note that the number M of leader actions does not have to be equal to the number N of follower actions.) Leader's utility function is ul: Al×Af→. The follower is of a type θ from set Θ, i.e., θ∈Θ, which determines its payoff function uf: Θ×Al×Af→. The leader acts first by committing to a mixed strategy σ∈Σ where σ(al) is the probability of the leader executing its pure strategy al∈Al. For a given leader strategy al∈Al and a follower of type θ∈Θ, the follower's “best” response B(θ,σ)∈Af to σ is a pure strategy B(θ,σ)∈Af that satisfies:
      B    ⁡          (              θ        ,        σ            )        =            argmax                        a          f                ∈                  A          f                      ⁢                  ∑                              a            i                    ∈                      A            i                              ⁢                        σ          ⁡                      (                          a              l                        )                          ⁢                                            u              f                        ⁡                          (                              θ                ,                                  a                  l                                ,                                  a                  f                                            )                                .                    
Given the follower type θ∈Θ, the expected utility of the leader strategy σ is therefore given by:
      U    ⁡          (              θ        ,        σ            )        =            ∑                        a          l                ∈                  A          l                      ⁢                  σ        ⁡                  (                      a            l                    )                    ⁢                                    u            l                    ⁡                      (                                          a                l                            ,                              B                ⁡                                  (                                      θ                    ,                    σ                                    )                                                      )                          .            
Given a probability distribution P(Θ) over the follower types, the expected utility of the leader strategy σ over all the follower types is hence:
                              U          ⁡                      (            σ            )                          =                              ∑                          θ              ∈              Θ                                ⁢                                    P              ⁡                              (                θ                )                                      ⁢                                          ∑                                                      a                    l                                    ∈                                      A                    l                                                              ⁢                                                σ                  ⁡                                      (                                          a                      l                                        )                                                  ⁢                                                                            u                      l                                        ⁡                                          (                                                                        a                          l                                                ,                                                  B                          ⁡                                                      (                                                          θ                              ,                              σ                                                        )                                                                                              )                                                        .                                                                                        (        3        )            
Solving a single-round Bayesian Stackelberg game involves findingσ*=arg maxσ∈ΣU(σ).
In an example Stackelberg game 10 such as shown in FIG. 1, first, a leader agent 11 (e.g., a security force) commits to a mixed strategy. The follower agent 13 (e.g., the adversary or opponent) of just a single type then observes the leader strategy and responds optimally to it, with a pure strategy, to maximize its own immediate payoff. For example, the leader mixed strategy to “Patrol Terminal #1” with probability 0.5 and “Patrol Terminal #2” with probability 0.5 triggers the follower strategy “Attack Terminal #1”, because its expected utility of 0.5·(−2)+0.5·(2)=0 is greater than the expected utility of 0.5·(2)−0.5·(4)=−1 of the alternative response “Attacking Terminal #2”. The expected utility for the above-mentioned leader strategy is therefore 0.5·(3)+0.5·(−2)=0.5 (which is higher than the utility for leader playing either of its two pure strategies).
Despite recent progress on solving Bayesian Stackelberg games (games where the leader faces an opponent of different types, with different preferences) it is commonly assumed that the payoff structure (and thus also their preferences) of both players are known to the players (either as the payoff matrices or the probability distributions over the payoffs).
It would be highly desirable to provide an approach to the problem of solving a repeated Stackelberg Game, played for a fixed number of rounds, where the payoffs or preferences of the follower and the prior probability distribution over follower types are initially unknown to the leader.
Multiple Rounds, Unknown Followers
In repeated Stackelberg games such as described in Letchford et al., entitled “Learning and Approximating the Optimal Strategy to Commit To,” in Proceedings of the Symposium on Algorithmic Game Theory, 2009, nature first selects a follower type θ∈Θ, upon which the leader then plays H rounds of a Stackelberg game against that follower. Across all rounds, the follower is assumed to act rationally (albeit myopically), whereas the leader aims to act strategically, so as to maximize total utility collected in all H stages of the game. The leader may never quite learn the exact type θ that it is playing against: Instead, the leader uses the observed follower responses to its actions to narrow down the subset of types and utility functions that are consistent with the observed responses.
To illustrate the concept of a repeated Stackelberg game with unknown follower preferences refer again to FIG. 1, but this time, assume that the follower payoffs indicated as follower payoffs 16, 18 are unknown to the leader. If the game was played for only a single round and the leader believed that each response of the follower is equally likely (e.g., with probability 0.5), then the optimal (mixed) strategy of the leader would be to “Patrol Terminal #1” with probability 1.0, as this provides the leader with the expected utility of 0.5*3+0.5*(−1)=1. (Note that the worst mixed strategy of the leader is to “Patrol Terminal #2” with probability 1.0, yielding the expected utility of 0.5*(−2)+0.5*2=0.) Now, if the Stackelberg game spans two rounds, the optimal strategy of the leader is conditioned on the leader observation of the follower response in the first round of the game. In particular, if the leader plays “Patrol Terminal #1” in the first round and observes the follower response “Attack Terminal #2”, the optimal action of the leader in the next round is to switch to “Patrol Terminal #2” with probability 1.0 which yields the expected utility of 0 as opposed to continue to “Patrol Terminal #1” with probability 1.0 which yields the exact utility of −1. In contrast, if the leader plays “Patrol Terminal #1” in the first round and observes the follower response “Attack Terminal #1”, the optimal action of the leader in the next round is to continue to “Patrol Terminal #1” with probability 1.0, which yields the exact utility of 3. In so doing, the leader has deliberately chosen not to learn anything about the follower preferences in response to the leader strategy “Patrol Terminal #2”, as this extra information cannot improve on the utility of 3 that the leader is now guaranteed to receive by “Patrolling Terminal #2”. This contrasts sharply with the approach in above-identified Letchford et al. where the leader would choose to “Patrol Terminal #2”, to learn the complete follower preference structure in as few game rounds as possible.
Letchford et al. propose a method for learning the follower preferences in as few game rounds as possible, however, this technique is deficient: First, while the method ensures that the leader learns the complete follower preferences structure (i.e. follower responses to any mixed strategy of the leader) in as few rounds as possible (by probing the follower responses with carefully chosen leader mixed strategies), it ignores the payoffs that the leader is receiving during in these rounds. In essence, the leader only values exploration of the follower preferences and ignores the exploitation of the already known follower preferences, for its own benefit. Second, the method of the prior art solution does not allow the follower to be of many types.
Further, existing work has predominantly focused on single-round games and as such, only the exploitation part of the problem was being considered. That is, methods may compute the optimal leader mixed strategy for just a single round of the game, given all the available information about the follower preferences and/or payoffs. While in contrast, the work by Letchford et al. considers a repeated-game scenario, it does not consider that the leader would optimize her own payoffs. Instead that work presumed that the leader would act so as to uniquely determine the follower preferences in the fewest number of rounds of rounds which may be arbitrarily expensive for the leader. In addition, the technique proposed by Letchford et al. only considers non-Bayesian Stackelberg game in that the authors assumed that the follower is of a single type.