A plant controlling device processes measured signals obtained from a plant, which is a control target, and calculates operation signals to be given to the plant. The plant controller includes an algorithm for calculating the operation signals so that the measured signal from the plant satisfies an operation target.
Control algorithms used for plant control include a proportional integration (PI) control algorithm. To derive an operation signal, this algorithm multiplies a deviation between a target operation value and a measured signal by a proportional gain, and adds a value obtained by integrating the deviation with time to the value obtained by the multiplication. It is also known that a learning algorithm is used to derive a plant operation signal.
A technique called reinforcement learning has been available in recent years in a field of learning without teachers. Reinforcement learning is known as a framework of learning controls in which a method of generating operation signals to be given to a control target is learned through trial-and-error interaction to and from a control target so that desirable measured signals are obtained from the control target.
In reinforcement learning, a value obtained by evaluating a scalar value that is calculated by using a signal obtained from the control target (the evaluated value is called “compensation” in reinforcement learning) is used to learn an operation signal generation method so that an expected value of the evaluated value obtained at present and in the future is maximized.
Non-patent Document 1 describes a technique in which a positive evaluated value is given when a measured signal satisfies a target operation value and an operation signal generation method is learned by using Actor-Critic, Q learning, real-time Dynamic Programming, or another algorithm. A framework called the Dyna architecture is also introduced as a method developed from the technique. The framework has a model that simulates a control target in a controller. The model fetches an operation signal to be given to the control target as a model input, and calculates a model output, which is a predicted value of a measured signal of the control target. This model is structured by using physical expressions and a statistical technique. An evaluated value calculated from the model output is used to learn a method of generating a model input.
In the Dyna architecture, the model input generation method is learned in advance so that a target model output value is attained, and an operation signal to be applied to a control target is determined from a learning result.
Non-patent Document 1: Reinforcement Learning, translated by Sadayoshi Mikami and Masaaki Minagawa, Kitagawa Shuppan Kabushigaisha, Dec. 20, 2000, p. 142-172, p. 247-253