Conventionally, a bandit problem has been known as a representative example of a problem of searching for a solution capable of maximizing an expected value. A purpose of the bandit problem is to maximize an expected value of the total of rewards one can receive. In the bandit problem, a player repeats choosing, by a certain action, one option from n-types of different options of action. After each selection, a result selected from a probability distribution which depends on the selected action is given to the player as a reward.
Take the following case as an example: There are several slot machines, and a player can get a coin (a reward) by pulling a lever of the machine, under a certain probability distribution. The probability distribution (a winning rate) of getting a coin differs among each of the slot machines, and the player has no knowledge about the winning rate. In such case, a most common method for evaluating the winning rate of each of the slot machines is simply to play each of the slot machines multiple times one after another. The slot machine which actually provides the highest reward is determined to be the machine of the highest winning rate.
In the above method, however, the player has to play the slot machines for a considerable number of times in order to determine a machine with the highest winning rate in fact. This, as a result, requires a large investment. It is necessary, therefore, to create an algorithm capable of finding a solution efficiently, as well as reducing an investment as much as possible, in searching for the winning rate of each of the slot machines.
To find a solution, the above case can be applied to the bandit problem described above, which is a problem of searching for a solution capable of maximizing an expected value of the total of rewards (for example, refer to Non Patent Literature 1). In particular, a combinatorial bandit problem has recently been drawing attention. In the combinatorial bandit problem, a combination of options which is expected to output an optimal result is selected from n-types of different options of action. There is a need for the combinatorial bandit problem in various fields, other than selecting a combination of slot machines which is likely to provide a high dividend from among a plurality of slot machines. For example, the combinatorial bandit problem can be used in selecting an optimal channel combination which is able to maximize the amount of data transmission in cognitive radio communication, an optimal advertisement combination to maximize the click count in Internet advertising, and a portfolio of financial instruments with the highest return on investment. In these applications, the bandit problem is a commoner type of a combinatorial reward maximization problem. In other words, there are multiple players, and the amount of reward for each player is determined depending on the choice by the player (for example, by a payoff matrix). In the present description, however, a combinatorial reward maximization problem in which each slot machine is independent of each other (particularly, a combination of two slot machines) is described with examples for simplification.