The present disclosure relates to a method, an apparatus, and a computer program for using a cyclic Markov decision process to determine, with reduced computational processing load, an optimum policy that minimizes an average cost.
A method of solving a control problem that is formulated as the so-called “Markov decision process” is one of the techniques that can be applied to a wide variety of fields, such as robotics, power plants, factories, and railroads, to solve autonomous control problems in those fields. In a “Markov decision process”, a control problem of a state transition that is dependent on time of an event is solved by using the distance (cost) from an ideal state transition as the evaluation criterion.
For example, JP2011-022902A discloses an electric power transaction management system which manages automatic electric power interchange at power generating and power consumption sites such as minimal clusters including equipment, such as a power generator, an electric storage device, and electric equipment, and power router and uses a Markov decision process to determine an optimum transaction policy. JP2005-084834A discloses an adaptive controller that uses a Markov decision process in which a controlled device transitions to the next state according to a state transition probability distribution. The controller is thus caused to operate as a probabilistic controller in order to reduce the amount of computation in algorithms such as dynamic programming algorithms in which accumulated costs are computed and exhaustive search algorithms in which a policy is directly searched for.
Other approaches that use Markov decision processes to determine optimum policies include value iteration, policy iteration, and so-called linear programming, which is disclosed in JP2011-022902A. In the case of a Markov decision process that has a special structure, the special structure itself is used to efficiently determine an optimum policy as disclosed in JP2005-084834A.